核心内容摘要
MIPI 接口详解
- 模型加载仅占用
32 GiB 内存- 可用 KV 缓存内存
57 GiB- 总显存使用约
89 GiB 符合 1GB 以内的要求(TraeAI-
~/my_python_server/wsl [1] $ cd /root/my_python_server/wsl ; /root/my_python_server/vllm-env/bin/python test_inference.py INFO
20:28:51 [utils.py:263] non-default args: {trust_remote_code: True, dtype: bfloat16, gpu_memory_utilization:
1, max_num_batched_tokens: 512, disable_log_stats: True, quantization: gptq_marlin, model: /root/my_python_server/models/OpenBMB_MiniCPM4-
5B-QAT-Int4-GPTQ-format} The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. INFO
20:28:51 [model.py:530] Resolved architecture: MiniCPMForCausalLM WARNING
20:28:51 [model.py:1869] Casting torch.float16 to torch.bfloat
INFO
20:28:51 [model.py:1545] Using max model len 32768 INFO
20:28:53 [gptq_marlin.py:230] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO
20:28:53 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens
INFO
20:28:53 [vllm.py:630] Asynchronous scheduling is enabled. INFO
20:28:53 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling. WARNING
20:28:53 [interface.py:470] Using pin_memoryFalse as WSL is detected. This may slow down the performance. (EngineCore_DP0 pid
INFO
20:28:53 [core.py:97] Initializing a V1 LLM engine (v
0.
14.
with config: model/root/my_python_server/models/OpenBMB_MiniCPM4-
5B-QAT-Int4-GPTQ-format, speculative_configNone, tokenizer/root/my_python_server/models/OpenBMB_MiniCPM4-
5B-QAT-Int4-GPTQ-format, skip_tokenizer_initFalse, tokenizer_modeauto, revisionNone, tokenizer_revisionNone, trust_remote_codeTrue, dtypetorch.bfloat16, max_seq_len32768, download_dirNone, load_formatauto, tensor_parallel_size1, pipeline_parallel_size1, data_parallel_size1, disable_custom_all_reduceFalse, quantizationgptq_marlin, enforce_eagerFalse, enable_return_routed_expertsFalse, kv_cache_dtypeauto, device_configcuda, structured_outputs_configStructuredOutputsConfig(backendauto, disable_fallbackFalse, disable_any_whitespaceFalse, disable_additional_propertiesFalse, reasoning_parser, reasoning_parser_plugin, enable_in_reasoningFalse), observability_configObservabilityConfig(show_hidden_metrics_for_versionNone, otlp_traces_endpointNone, collect_detailed_tracesNone, kv_cache_metricsFalse, kv_cache_metrics_sample
01, cudagraph_metricsFalse, enable_layerwise_nvtx_tracingFalse, enable_mfu_metricsFalse, enable_mm_processor_statsFalse, enable_logging_iteration_detailsFalse), seed0, served_model_name/root/my_python_server/models/OpenBMB_MiniCPM4-
5B-QAT-Int4-GPTQ-format, enable_prefix_cachingTrue, enable_chunked_prefillTrue, pooler_configNone, compilation_config{level: None, mode: CompilationMode.VLLM_COMPILE: 3, debug_dump_path: None, cache_dir: , compile_cache_save_format: binary, backend: inductor, custom_ops: [none], splitting_ops: [vllm::unified_attention, vllm::unified_attention_with_output, vllm::unified_mla_attention, vllm::unified_mla_attention_with_output, vllm::mamba_mixer2, vllm::mamba_mixer, vllm::short_conv, vllm::linear_attention, vllm::plamo2_mamba_mixer, vllm::gdn_attention_core, vllm::kda_attention, vllm::sparse_attn_indexer], compile_mm_encoder: False, compile_sizes: [], compile_ranges_split_points: [512], inductor_compile_config: {enable_auto_functionalized_v2: False, combo_kernels: True, benchmark_combo_kernel: True}, inductor_passes: {}, cudagraph_mode: CUDAGraphMode.FULL_AND_PIECEWISE: (2,
, cudagraph_num_of_warmups: 1, cudagraph_capture_sizes: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], cudagraph_copy_inputs: False, cudagraph_specialize_lora: True, use_inductor_graph_partition: False, pass_config: {fuse_norm_quant: False, fuse_act_quant: False, fuse_attn_quant: False, eliminate_noops: True, enable_sp: False, fuse_gemm_comms: False, fuse_allreduce_rms: False}, max_cudagraph_capture_size: 512, dynamic_shapes_config: {type: DynamicShapesType.BACKED: backed, evaluate_guards: False, assume_32_bit_indexing: True}, local_cache_dir: None} (EngineCore_DP0 pid
INFO
20:28:53 [parallel_state.py:1214] world_size1 rank0 local_rank0 distributed_init_methodtcp://
172.
19.
3
159:52679 backendnccl (EngineCore_DP0 pid
INFO
20:28:54 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A (EngineCore_DP0 pid
INFO
20:28:54 [gpu_model_runner.py:3808] Starting to load model /root/my_python_server/models/OpenBMB_MiniCPM4-
5B-QAT-Int4-GPTQ-format... (EngineCore_DP0 pid
INFO
20:28:54 [gptq_marlin.py:377] Using MarlinLinearKernel for GPTQMarlinLinearMethod (EngineCore_DP0 pid
/root/my_python_server/vllm-env/lib/python
12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:174: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled. (EngineCore_DP0 pid
We recommend installing via pip install torch-c-dlpack-ext (EngineCore_DP0 pid
warnings.warn( (EngineCore_DP0 pid
INFO
20:28:56 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: (FLASH_ATTN, FLASHINFER, TRITON_ATTN, FLEX_ATTENTION) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00,
26it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00,
26it/s] (EngineCore_DP0 pid
(EngineCore_DP0 pid
INFO
20:28:57 [default_loader.py:291] Loading weights took
81 seconds (EngineCore_DP0 pid
INFO
20:28:58 [gpu_model_runner.py:3905] Model loading took
32 GiB memory and
018920 seconds (EngineCore_DP0 pid
INFO
20:29:03 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/6e55f109d6/rank_0_0/backbone for vLLMs torch.compile (EngineCore_DP0 pid
INFO
20:29:03 [backends.py:704] Dynamo bytecode transform time:
40 s (EngineCore_DP0 pid
INFO
20:29:09 [backends.py:261] Cache the graph of compile range (1,
for later use (EngineCore_DP0 pid
INFO
20:29:12 [backends.py:278] Compiling a graph for compile range (1,
takes
89 s (EngineCore_DP0 pid
INFO
20:29:12 [monitor.py:34] torch.compile takes
1
29 s in total (EngineCore_DP0 pid
INFO
20:29:13 [gpu_worker.py:358] Available KV cache memory:
57 GiB (EngineCore_DP0 pid
INFO
20:29:13 [kv_cache_utils.py:1305] GPU KV cache size: 50,144 tokens (EngineCore_DP0 pid
INFO
20:29:13 [kv_cache_utils.py:1310] Maximum concurrency for 32,768 tokens per request:
53x Capturing CUDA graphs (mixed prefill-decode, P Capturing CUDA graphs (decode, FULL): 100%|█| (EngineCore_DP0 pid
INFO
20:29:17 [gpu_model_runner.py:4856] Graph capturing finished in 4 secs, took
37 GiB (EngineCore_DP0 pid
INFO
20:29:17 [core.py:273] init engine (profile, create kv cache, warmup model) took
1
44 seconds INFO
20:29:18 [llm.py:347] Supported tasks: [generate] Adding requests: 100%|█| 1/1 [00:0000:00, 458 Processed prompts: 100%|█| 1/1 [00:0100:00, 北京这座历史与现代交织的城市拥有众多令人向往的景点。
以下是五个不容错过的北京景点推荐
**故宫博物院**作为明清两代的皇家宫殿故宫博物院内珍藏了大量珍贵的文物和艺术品如《清明上河图》、《千里江山图》等是了解中国历史和文化的重要窗口。
**天安门广场**作为世界上最大的城市广场之一天安门广场不仅见证了新中国的诞生还承载着中国人民的骄傲和记忆。
广场周围有毛主席纪念堂、人民英雄纪念碑等标志性建筑是市民和游客聚集的场所。
**颐和园**作为中国保存最完整的一座皇家园林颐和园以其精美的园林设计和丰富的自然景观闻名于世。
园内有长廊、佛香阁、十七孔桥等著名景点是夏季避暑和欣赏园林艺术的好去处。
**北海公园**位于北京市海淀区北海公园以“湖光山色”为特色拥有众多历史名亭和石雕如“海印”、“玉女潭”等是体验京城园林艺术和休闲的好地方。
**798艺术区**作为北京最具现代感的创意产业区之一798艺术区聚集了众多艺术家和创意工作室如798艺术博物馆、798艺术街等是了解当代艺术和创意产业的好去处。
这些景点各具特色不仅展现了北京的历史文化底蕴也体现了现代艺术的创新精神是体验北京魅力不可或缺的组成部分。
ERROR
20:29:19 [core_client.py:610] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.from modelscope import AutoTokenizer from vllm import LLM, SamplingParams import os # 使用与 run_vllm.py 中相同的本地模型路径 LLM_MODEL OpenBMB/MiniCPM4-
5B-QAT-Int4-GPTQ-format LLM_DIR f/root/my_python_server/models/{LLM_MODEL.replace(/, _)} # 检查本地模型是否存在如果不存在则从 ModelScope 下载 if not os.path.exists(LLM_DIR) or not os.listdir(LLM_DIR): print(f本地模型不存在将从 ModelScope 下载: {LLM_MODEL}) from modelscope import snapshot_download snapshot_download(model_idLLM_MODEL, local_dirLLM_DIR) model_name LLM_DIR # 使用本地模型路径 prompt [{role: user, content: 推荐5个北京的景点。
}] tokenizer AutoTokenizer.from_pretrained(model_name, trust_remote_codeTrue) input_text tokenizer.apply_chat_template(prompt, tokenizeFalse, add_generation_promptTrue) llm LLM( modelmodel_name, quantizationgptq_marlin, trust_remote_codeTrue, max_num_batched_tokens512, dtypebfloat16, gpu_memory_utilization
1, ) sampling_params SamplingParams(top_p
7, temperature
7, max_tokens1024, repetition_penalty
1.