首页速度优化福建UU小学生：乘风破浪，闪耀不止的未来之星

网站优化

渴望那份精彩？“想要XX在线观看”的终极指南，一站式满足你的所有期待！

怡红院欧美

2026-06-09 14:14:18

阅读时长:8分钟

562次阅读

核心内容摘要

探寻“哈昂”的神秘力量：一场关于声音、情感与生活的奇妙旅程

- 模型加载仅占用

32 GiB 内存- 可用 KV 缓存内存

57 GiB- 总显存使用约

89 GiB 符合 1GB 以内的要求(TraeAI-

~/my_python_server/wsl [1] $ cd /root/my_python_server/wsl ; /root/my_python_server/vllm-env/bin/python test_inference.py INFO

20:28:51 [utils.py:263] non-default args: {trust_remote_code: True, dtype: bfloat16, gpu_memory_utilization:

1, max_num_batched_tokens: 512, disable_log_stats: True, quantization: gptq_marlin, model: /root/my_python_server/models/OpenBMB_MiniCPM4-

5B-QAT-Int4-GPTQ-format} The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. INFO

20:28:51 [model.py:530] Resolved architecture: MiniCPMForCausalLM WARNING

20:28:51 [model.py:1869] Casting torch.float16 to torch.bfloat

INFO

20:28:51 [model.py:1545] Using max model len 32768 INFO

20:28:53 [gptq_marlin.py:230] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO

20:28:53 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens

INFO

20:28:53 [vllm.py:630] Asynchronous scheduling is enabled. INFO

20:28:53 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling. WARNING

20:28:53 [interface.py:470] Using pin_memoryFalse as WSL is detected. This may slow down the performance. (EngineCore_DP0 pid

INFO

20:28:53 [core.py:97] Initializing a V1 LLM engine (v

0.

14.

with config: model/root/my_python_server/models/OpenBMB_MiniCPM4-

5B-QAT-Int4-GPTQ-format, speculative_configNone, tokenizer/root/my_python_server/models/OpenBMB_MiniCPM4-

5B-QAT-Int4-GPTQ-format, skip_tokenizer_initFalse, tokenizer_modeauto, revisionNone, tokenizer_revisionNone, trust_remote_codeTrue, dtypetorch.bfloat16, max_seq_len32768, download_dirNone, load_formatauto, tensor_parallel_size1, pipeline_parallel_size1, data_parallel_size1, disable_custom_all_reduceFalse, quantizationgptq_marlin, enforce_eagerFalse, enable_return_routed_expertsFalse, kv_cache_dtypeauto, device_configcuda, structured_outputs_configStructuredOutputsConfig(backendauto, disable_fallbackFalse, disable_any_whitespaceFalse, disable_additional_propertiesFalse, reasoning_parser, reasoning_parser_plugin, enable_in_reasoningFalse), observability_configObservabilityConfig(show_hidden_metrics_for_versionNone, otlp_traces_endpointNone, collect_detailed_tracesNone, kv_cache_metricsFalse, kv_cache_metrics_sample

01, cudagraph_metricsFalse, enable_layerwise_nvtx_tracingFalse, enable_mfu_metricsFalse, enable_mm_processor_statsFalse, enable_logging_iteration_detailsFalse), seed0, served_model_name/root/my_python_server/models/OpenBMB_MiniCPM4-

5B-QAT-Int4-GPTQ-format, enable_prefix_cachingTrue, enable_chunked_prefillTrue, pooler_configNone, compilation_config{level: None, mode: CompilationMode.VLLM_COMPILE: 3, debug_dump_path: None, cache_dir: , compile_cache_save_format: binary, backend: inductor, custom_ops: [none], splitting_ops: [vllm::unified_attention, vllm::unified_attention_with_output, vllm::unified_mla_attention, vllm::unified_mla_attention_with_output, vllm::mamba_mixer2, vllm::mamba_mixer, vllm::short_conv, vllm::linear_attention, vllm::plamo2_mamba_mixer, vllm::gdn_attention_core, vllm::kda_attention, vllm::sparse_attn_indexer], compile_mm_encoder: False, compile_sizes: [], compile_ranges_split_points: [512], inductor_compile_config: {enable_auto_functionalized_v2: False, combo_kernels: True, benchmark_combo_kernel: True}, inductor_passes: {}, cudagraph_mode: CUDAGraphMode.FULL_AND_PIECEWISE: (2,

, cudagraph_num_of_warmups: 1, cudagraph_capture_sizes: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], cudagraph_copy_inputs: False, cudagraph_specialize_lora: True, use_inductor_graph_partition: False, pass_config: {fuse_norm_quant: False, fuse_act_quant: False, fuse_attn_quant: False, eliminate_noops: True, enable_sp: False, fuse_gemm_comms: False, fuse_allreduce_rms: False}, max_cudagraph_capture_size: 512, dynamic_shapes_config: {type: DynamicShapesType.BACKED: backed, evaluate_guards: False, assume_32_bit_indexing: True}, local_cache_dir: None} (EngineCore_DP0 pid

INFO

20:28:53 [parallel_state.py:1214] world_size1 rank0 local_rank0 distributed_init_methodtcp://

172.

19.

3

159:52679 backendnccl (EngineCore_DP0 pid

INFO

20:28:54 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A (EngineCore_DP0 pid

INFO

20:28:54 [gpu_model_runner.py:3808] Starting to load model /root/my_python_server/models/OpenBMB_MiniCPM4-

5B-QAT-Int4-GPTQ-format... (EngineCore_DP0 pid

INFO

20:28:54 [gptq_marlin.py:377] Using MarlinLinearKernel for GPTQMarlinLinearMethod (EngineCore_DP0 pid

/root/my_python_server/vllm-env/lib/python

12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:174: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled. (EngineCore_DP0 pid

We recommend installing via pip install torch-c-dlpack-ext (EngineCore_DP0 pid

warnings.warn( (EngineCore_DP0 pid

INFO

20:28:56 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: (FLASH_ATTN, FLASHINFER, TRITON_ATTN, FLEX_ATTENTION) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00,

26it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:0000:00,

26it/s] (EngineCore_DP0 pid

(EngineCore_DP0 pid

INFO

20:28:57 [default_loader.py:291] Loading weights took

81 seconds (EngineCore_DP0 pid

INFO

20:28:58 [gpu_model_runner.py:3905] Model loading took

32 GiB memory and

018920 seconds (EngineCore_DP0 pid

INFO

20:29:03 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/6e55f109d6/rank_0_0/backbone for vLLMs torch.compile (EngineCore_DP0 pid

INFO

20:29:03 [backends.py:704] Dynamo bytecode transform time:

40 s (EngineCore_DP0 pid

INFO

20:29:09 [backends.py:261] Cache the graph of compile range (1,

for later use (EngineCore_DP0 pid

INFO

20:29:12 [backends.py:278] Compiling a graph for compile range (1,

takes

89 s (EngineCore_DP0 pid

INFO

20:29:12 [monitor.py:34] torch.compile takes

1

29 s in total (EngineCore_DP0 pid

INFO

20:29:13 [gpu_worker.py:358] Available KV cache memory:

57 GiB (EngineCore_DP0 pid

INFO

20:29:13 [kv_cache_utils.py:1305] GPU KV cache size: 50,144 tokens (EngineCore_DP0 pid

INFO

20:29:13 [kv_cache_utils.py:1310] Maximum concurrency for 32,768 tokens per request:

53x Capturing CUDA graphs (mixed prefill-decode, P Capturing CUDA graphs (decode, FULL): 100%|█| (EngineCore_DP0 pid

INFO

20:29:17 [gpu_model_runner.py:4856] Graph capturing finished in 4 secs, took

37 GiB (EngineCore_DP0 pid

INFO

20:29:17 [core.py:273] init engine (profile, create kv cache, warmup model) took

1

44 seconds INFO

20:29:18 [llm.py:347] Supported tasks: [generate] Adding requests: 100%|█| 1/1 [00:0000:00, 458 Processed prompts: 100%|█| 1/1 [00:0100:00, 北京这座历史与现代交织的城市拥有众多令人向往的景点。

以下是五个不容错过的北京景点推荐

故宫博物院作为明清两代的皇家宫殿故宫博物院内珍藏了大量珍贵的文物和艺术品如《清明上河图》、《千里江山图》等是了解中国历史和文化的重要窗口。

天安门广场作为世界上最大的城市广场之一天安门广场不仅见证了新中国的诞生还承载着中国人民的骄傲和记忆。

广场周围有毛主席纪念堂、人民英雄纪念碑等标志性建筑是市民和游客聚集的场所。

颐和园作为中国保存最完整的一座皇家园林颐和园以其精美的园林设计和丰富的自然景观闻名于世。

园内有长廊、佛香阁、十七孔桥等著名景点是夏季避暑和欣赏园林艺术的好去处。

北海公园位于北京市海淀区北海公园以“湖光山色”为特色拥有众多历史名亭和石雕如“海印”、“玉女潭”等是体验京城园林艺术和休闲的好地方。

798艺术区作为北京最具现代感的创意产业区之一798艺术区聚集了众多艺术家和创意工作室如798艺术博物馆、798艺术街等是了解当代艺术和创意产业的好去处。

这些景点各具特色不仅展现了北京的历史文化底蕴也体现了现代艺术的创新精神是体验北京魅力不可或缺的组成部分。

ERROR

20:29:19 [core_client.py:610] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.from modelscope import AutoTokenizer from vllm import LLM, SamplingParams import os # 使用与 run_vllm.py 中相同的本地模型路径 LLM_MODEL OpenBMB/MiniCPM4-

5B-QAT-Int4-GPTQ-format LLM_DIR f/root/my_python_server/models/{LLM_MODEL.replace(/, _)} # 检查本地模型是否存在如果不存在则从 ModelScope 下载 if not os.path.exists(LLM_DIR) or not os.listdir(LLM_DIR): print(f本地模型不存在将从 ModelScope 下载: {LLM_MODEL}) from modelscope import snapshot_download snapshot_download(model_idLLM_MODEL, local_dirLLM_DIR) model_name LLM_DIR # 使用本地模型路径 prompt [{role: user, content: 推荐5个北京的景点。

}] tokenizer AutoTokenizer.from_pretrained(model_name, trust_remote_codeTrue) input_text tokenizer.apply_chat_template(prompt, tokenizeFalse, add_generation_promptTrue) llm LLM( modelmodel_name, quantizationgptq_marlin, trust_remote_codeTrue, max_num_batched_tokens512, dtypebfloat16, gpu_memory_utilization

1, ) sampling_params SamplingParams(top_p

7, temperature

7, max_tokens1024, repetition_penalty

1.

outputs llm.generate(promptsinput_text, sampling_paramssampling_params) print(outputs[0].outputs[0].text)

免费黄色APP-免费黄色应用