开源模型应用落地-Qwen2.5-7B-Instruct与vllm实现离线推理-CPU版本

一、前言

离线推理能够在模型训练完成后，特别是在处理大规模数据时，利用预先准备好的输入数据进行批量推理，从而显著提高计算效率和响应速度。通过离线推理，可以在不依赖实时计算的情况下，快速生成预测结果，从而优化决策流程和提升用户体验。此外，离线推理还可以降低云计算成本，允许在资源使用高效的时间段进行计算，进一步提高经济效益。

在本篇中，将学习如何使用CPU将Qwen2.5-7B-Instruct模型与vLLM框架进行有效整合（使用vLLM框架，能为模型推理提供强有力的支持，使得在CPU上执行的模型不仅能保持较高的准确率，还能在资源有限的条件下，实现快速响应，充分释放潜在价值），通过离线推理为实际项目带来更大的价值。

GPU版本：开源模型应用落地-Qwen2.5-7B-Instruct与vllm实现离线推理-降本增效（一）

二、术语

2.1. vLLM

vLLM是一个开源的大模型推理加速框架，通过PagedAttention高效地管理attention中缓存的张量，实现了比HuggingFace Transformers高14-24倍的吞吐量。

2.2. Qwen2.5

Qwen2.5系列模型都在最新的大规模数据集上进行了预训练，该数据集包含多达 18T tokens。相较于 Qwen2，Qwen2.5 获得了显著更多的知识（MMLU：85+），并在编程能力（HumanEval 85+）和数学能力（MATH 80+）方面有了大幅提升。

此外，新模型在指令执行、生成长文本（超过 8K 标记）、理解结构化数据（例如表格）以及生成结构化输出特别是 JSON 方面取得了显著改进。 Qwen2.5 模型总体上对各种system prompt更具适应性，增强了角色扮演实现和聊天机器人的条件设置功能。

与 Qwen2 类似，Qwen2.5 语言模型支持高达 128K tokens，并能生成最多 8K tokens的内容。它们同样保持了对包括中文、英文、法文、西班牙文、葡萄牙文、德文、意大利文、俄文、日文、韩文、越南文、泰文、阿拉伯文等 29 种以上语言的支持。我们在下表中提供了有关模型的基本信息。

专业领域的专家语言模型，即用于编程的 Qwen2.5-Coder 和用于数学的 Qwen2.5-Math，相比其前身 CodeQwen1.5 和 Qwen2-Math 有了实质性的改进。具体来说，Qwen2.5-Coder 在包含 5.5 T tokens 编程相关数据上进行了训练，使即使较小的编程专用模型也能在编程评估基准测试中表现出媲美大型语言模型的竞争力。同时，Qwen2.5-Math 支持中文和英文，并整合了多种推理方法，包括CoT（Chain of Thought）、PoT（Program of Thought）和 TIR（Tool-Integrated Reasoning）。

2.3. Qwen2.5-7B-Instruct

是通义千问团队推出的语言模型，拥有70亿参数，经过指令微调，能更好地理解和遵循指令。作为 Qwen2.5 系列的一部分，它在 18T tokens 数据上预训练，性能显著提升，具有多方面能力，包括语言理解、任务适应性、多语言支持等，同时也具备一定的长文本处理能力，适用于多种自然语言处理任务，为用户提供高质量的语言服务。

2.4. 离线推理

是在模型训练完成后，使用该模型进行推理（即生成预测或输出）的过程，而不需要实时与模型进行交互或进行在线计算。在离线推理中，通常会事先准备好输入数据，并在本地或云端的环境中批量处理这些数据，以获得模型的输出结果。

离线推理的优点包括：

1. 可以批量处理多个输入，充分利用计算资源，提高推理效率。

2. 在云环境中，可以选择在低峰时段进行推理，降低计算成本。

3. 不依赖于实时数据，可以在模型的稳定版本上进行推理，避免了在线推理中的不确定性。

三、前提条件

3.1. 基础环境及前置条件

1）操作系统：centos7

2）Tesla V100-SXM2-32GB CUDA Version: 12.2

3）提前下载好Qwen2.5-7B-Instruct模型

通过以下两个地址进行下载，优先推荐魔搭

huggingface：

https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/tree/main

ModelScope：

git clone https://www.modelscope.cn/qwen/Qwen2.5-7B-Instruct.git

按需选择SDK或者Git方式下载

使用git方式下载示例：

3.2. Anaconda安装

参见“开源模型应用落地-qwen-7b-chat与vllm实现推理加速的正确姿势（一）”

3.3. vllm包的升级

这里需要考虑是否会对原有的环境造成影响

首次安装情况：

conda create --name vllm python=3.10
conda activate vllm
pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple

二次升级情况：

# 基于原有vllm环境，克隆一个新的环境，后续在新的环境中进行升级
conda create --name vllm2 --clone vllm
conda activate vllm2
pip install --upgrade vllm

ps:vllm版本必须≥0.4.0

四、技术实现

4.1. 离线生成

# -*- coding: utf-8 -*-
from vllm import LLM, SamplingParamsdef generate(model_path,prompts):sampling_params = SamplingParams(temperature=0.45, top_p=0.9, max_tokens=1048)llm = LLM(model=model_path,dtype='float16',swap_space=16,cpu_offload_gb=2)outputs = llm.generate(prompts, sampling_params)return outputsif __name__ == '__main__':model_path = '/data/model/qwen2.5-7b-instruct'prompts = ["广州有什么特色景点？",]outputs = generate(model_path,prompts)for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

调用结果：

(vllm) [root@gpu test]# python -u test.py 
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'from vllm.version import __version__ as VLLM_VERSION
WARNING 10-21 16:28:58 config.py:1674] Casting torch.bfloat16 to torch.float16.
INFO 10-21 16:29:03 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/data/model/qwen2.5-7b-instruct', speculative_config=None, tokenizer='/data/model/qwen2.5-7b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/model/qwen2.5-7b-instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-21 16:29:04 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-21 16:29:04 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-21 16:29:06 model_runner.py:1060] Starting to load model /data/model/qwen2.5-7b-instruct...
INFO 10-21 16:29:06 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-21 16:29:06 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:26<01:20, 26.86s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:53<00:53, 26.55s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:19<00:26, 26.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:43<00:00, 25.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:43<00:00, 25.96s/it]INFO 10-21 16:30:51 model_runner.py:1071] Loading model weights took 13.0675 GB
INFO 10-21 16:30:57 gpu_executor.py:122] # GPU blocks: 9932, # CPU blocks: 11702
INFO 10-21 16:30:57 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 4.85x
INFO 10-21 16:31:03 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-21 16:31:03 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-21 16:31:28 model_runner.py:1530] Graph capturing finished in 24 secs.
Processed prompts: 100%|████████████████████████████████████| 1/1 [00:43<00:00, 43.42s/it, est. speed input: 0.12 toks/s, output: 8.04 toks/s]
Prompt: '广州有什么特色景点？', Generated text: ' 广州是广东省的省会城市，拥有丰富的历史文化底蕴和现代化的城市风貌。以下是一些广州的特色景点：\n\n1. 白云山：白云山是广州的标志性景点之一，被誉为“羊城第一秀”，是广州市民休闲娱乐的好去处。白云山上有许多名胜古迹，如白云观、摩星岭等。\n\n2. 广州塔：广州塔是广州的地标性建筑之一，高600米，是世界上最高的电视塔之一。游客可以乘坐电梯到达观景台，欣赏广州的城市风光。\n\n3. 陈家祠：陈家祠是广州著名的古建筑之一，建于清朝，是一座典型的岭南风格的建筑。陈家祠内有许多精美的雕刻和壁画，展示了广东地区的传统文化。\n\n4. 番禺长隆旅游度假区：番禺长隆旅游度假区是广州著名的旅游景点之一，拥有各种游乐设施和动物表演，是家庭旅游的好去处。\n\n5. 越秀公园：越秀公园是广州著名��公园之一，建于清朝，是广州市民休闲娱乐的好去处。公园内有许多名胜古迹，如五羊石像、越秀山等。\n\n6. 海心沙：海心沙是广州珠江新城的一个重要景点，是一个集休闲、娱乐、文化于一体的大型公园。海心沙上有许多现代化的建筑和设施，如广州大剧院、广州塔等。\n\n以上是一些广州的特色景点，当然广州还有很多其他值得一游的地方，如珠江夜游、广州动物园等。广州是一个充满活力和魅力的城市，游客可以在这里体验到广东地区的传统文化和现代城市风貌。'

4.2. 离线对话

# -*- coding: utf-8 -*-
from vllm import LLM, SamplingParamsdef chat(model_path,conversation):sampling_params = SamplingParams(temperature=0.45, top_p=0.9, max_tokens=1024)llm = LLM(model=model_path, dtype='float16', swap_space=2,cpu_offload_gb=2)outputs = llm.chat(conversation,sampling_params=sampling_params,use_tqdm=False)return outputsif __name__ == '__main__':model_path = '/data/model/qwen2.5-7b-instruct'conversation = [{"role": "system","content": "你是一位专业的导游"},{"role": "user","content": "请介绍一些广州的特色景点",},]outputs = chat(model_path, conversation)for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

调用结果：

(vllm) [root@gpu test]# python -u test.py 
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'from vllm.version import __version__ as VLLM_VERSION
WARNING 10-21 16:34:30 config.py:1674] Casting torch.bfloat16 to torch.float16.
INFO 10-21 16:34:35 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/data/model/qwen2.5-7b-instruct', speculative_config=None, tokenizer='/data/model/qwen2.5-7b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/model/qwen2.5-7b-instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-21 16:34:36 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-21 16:34:36 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-21 16:34:37 model_runner.py:1060] Starting to load model /data/model/qwen2.5-7b-instruct...
INFO 10-21 16:34:37 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-21 16:34:37 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:16<00:50, 16.70s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:18<00:15,  7.84s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:19<00:04,  4.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:21<00:00,  3.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:21<00:00,  5.34s/it]INFO 10-21 16:35:02 model_runner.py:1071] Loading model weights took 12.1953 GB
INFO 10-21 16:35:08 gpu_executor.py:122] # GPU blocks: 10952, # CPU blocks: 2340
INFO 10-21 16:35:08 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 5.35x
INFO 10-21 16:35:09 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-21 16:35:09 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-21 16:35:36 model_runner.py:1530] Graph capturing finished in 27 secs.
Prompt: '<|im_start|>system\n你是一位专业的导游<|im_end|>\n<|im_start|>user\n请介绍一些广州的特色景点<|im_end|>\n<|im_start|>assistant\n', Generated text: '广州作为中国的南大门，不仅有着悠久的历史和丰富的文化底蕴，还拥有许多特色景点。下面是一些广州的特色景点介绍：\n\n1. **广州塔（小蛮腰）**：广州塔是广州的标志性建筑之一，不仅外观独特，而且塔内设有观光层、旋转餐厅等，是俯瞰广州全景的最佳地点。\n\n2. **白云山**：位于广州市中心，是广州市民休闲娱乐的好去处。白云山不仅自然风光优美，还有许多历史遗迹和文化景点，如摩星岭、云台花园等。\n\n3. **陈家祠**：位于广州市越秀区，是广东省内规模最大的一座宗祠建筑群，展示了岭南建筑艺术的独特魅力，也是研究岭南文化的重要场所。\n\n4. **上下九步行街**：位于广州市荔湾区，是广州最繁华的商业街之一，这里不仅有各种特色小吃，还有许多传统手工艺品店，是体验广州地道生活的好地方。\n\n5. **珠江夜游**：乘坐游船在珠江上夜游，可以欣赏到广州塔、海心沙、珠江新城等现代化建筑的夜景，同时也能感受到广州独特的水乡风情。\n\n6. **广州博物馆**：位于广州市越秀区，是一座综合性博物馆，收藏了大量的历史文物和艺术品，是了解广州历史文化的绝佳场所。\n\n7. **荔枝湾涌**：位于广州市荔湾区，是一条充满岭南水乡特色的古水道，沿岸有许多保存完好的岭南建筑，是体验广州传统水乡文化的好地方。\n\n这些景点各具特色，涵盖了自然风光、历史文化、现代建筑等多个方面，是了解广州不可或缺的部分。'

五、附带说明

5.1. 问题：ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100S-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

原因：v100卡不支持Bfloat16类型的模型精度

解决方法：代码中，显示指定dtype=‘float16’

5.2. LLM支持的参数

model: The name or path of a HuggingFace Transformers model.
tokenizer: The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
skip_tokenizer_init: If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.
trust_remote_code: Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.
tensor_parallel_size: The number of GPUs to use for distributed execution with tensor parallelism.
dtype: The data type for the model weights and activations. Currently,we support `float32`, `float16`, and `bfloat16`. If `auto`, we use the `torch_dtype` attribute specified in the model config file. However, if the `torch_dtype` in the config is `float32`, we will use `float16` instead.
quantization: The method used to quantize the model weights. Currently,we support "awq", "gptq", and "fp8" (experimental).If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the data type of the weights.
revision: The specific model version to use. It can be a branch name,a tag name, or a commit id.
tokenizer_revision: The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.
seed: The seed to initialize the random number generator for sampling.
gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Highervalues will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors.
swap_space: The size (GiB) of CPU memory per GPU to use as swap space. This can be used for temporarily storing the states of the requests when their `best_of` sampling parameters are larger than 1. If all requests will have `best_of=1`, you can safely set this to 0.Otherwise, too small values may cause out-of-memory (OOM) errors.
cpu_offload_gb: The size (GiB) of CPU memory to use for offloading the model weights. This virtually increases the GPU memory space you can use to hold the model weights, at the cost of CPU-GPU data transfer for every forward pass.
enforce_eager: Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode.If False, we will use CUDA graph and eager execution in hybrid.
max_context_len_to_capture: Maximum context len covered by CUDA graphs.When a sequence has context length larger than this, we fall back to eager mode (DEPRECATED. Use `max_seq_len_to_capture` instead).
max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.When a sequence has context length larger than this, we fall back to eager mode. Additionally for encoder-decoder models, if the sequence length of the encoder input is larger than this, we fall back to the eager mode.
disable_custom_all_reduce: See ParallelConfig **kwargs: Arguments for :class:`~vllm.EngineArgs`.