【AIGC魔童】DeepSeek v3推理部署:华为昇腾NPU/TRT-LLM

news/2025/2/9 11:15:35/

AIGC魔童】DeepSeek v3推理部署:华为昇腾NPU/TRT-LLM

    • (1)使用华为昇腾NPU推理部署DeepSeek
    • (2)使用TRT-LLM推理部署DeepSeek

(1)使用华为昇腾NPU推理部署DeepSeek

参考博客华为昇腾推理DeepSeek-R1,性能比肩高端GPU,API免费无限量!潞晨自研推理引擎出手了

来自华为昇腾社区的 MindIE 框架成功适配了 DeepSeek-V3 的 BF16 版本。

有关 Ascend NPU 的分步指南,请按照此处的说明进行操作。

(2)使用TRT-LLM推理部署DeepSeek

GitHub地址:https://github.com/NVIDIA/TensorRT-LLM

TensorRT-LLM 现在支持 DeepSeek-V3 模型,仅提供 BF16 和 INT4/INT8 权重等精度选项。对 FP8 的支持目前正在进行中,并将很快发布。

您可以通过以下链接访问 TRTLLM 专门用于 DeepSeek-V3 支持的自定义分支,直接体验新功能:https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3。

2.8.1 下载DeepSeek模型权重

Download DeepSeek-V3 weights from HF https://huggingface.co/deepseek-ai/DeepSeek-V3-Base.

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

可选项: 转化 FP8 权重到 BF16.

This is not necessary unless you want to run the model E2E in BF16 precision.

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference/
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-V3 --output-bf16-hf-path /path/to/deepseek-v3-bf16
cp /path/to/DeepSeek-V3/config.json /path/to/DeepSeek-V3/configuration_deepseek.py /path/to/deepseek-v3-bf16/

2.8.2 构建TensorRT引擎

首先,利用convert_checkpoint.py将DeepSeek权重转换为TensorRT-LLM权重,然后,使用TensorRT-LLM权重构建TensorRT引擎。

  • 模型转化

转化为 FP8 权重:

# Convert Deepseek-v3 HF Native FP8 weights to TensorRT-LLM checkpoint.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \--output_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \--dtype bfloat16 \--use_fp8_weights \--tp_size 8 \--workers 8 # using multiple workers can accelerate the conversion process

可选项: 转化为 BF16 权重:

# Convert Deepseek-v3 HF weights to TensorRT-LLM checkpoint in BF16.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \--output_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \--dtype bfloat16 \--tp_size 32 \--workers 8 # using multiple workers can accelerate the conversion process
  • 构建TensorRT引擎

对于FP8模型:

# Build FP8 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \--output_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \--max_batch_size 4 \--max_seq_len 4096 \--max_input_len 2048 \--use_paged_context_fmha enable \--workers 8

对于BF16模型:

# Build BF16 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \--output_dir ./trtllm_engines/deepseek_v3/bf16/tp32-sel4096-isl2048-bs4 \--gpt_attention_plugin bfloat16 \--gemm_plugin bfloat16 \--max_batch_size 4 \--max_seq_len 4096 \--max_input_len 2048 \--use_paged_context_fmha enable \--workers 8

Caution: --max_batch_size and --max_seq_len are the main factors to determine how many GPU memory will be used during runtime, so later when try to run e.g., summarize.py or mmlu.py or gptManagerBenchmark.cppmay need adjust --max_batch_size and --max_seq_len accordingly to avoid OOM.(meaning rebuild TensorRT engine with smaller --max_batch_size and --max_seq_len if needed based on GPU memory size), there is beautiful technical log perf-best-practices.md (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md) explained the mechanism.

2.8.3 模型推理

使用run.py 脚本测试FP8模型:

# run.sh
python3 ../run.py --input_text "Today is a nice day." \--max_output_len 30 \--tokenizer_dir ./DeepSeek-V3 \--engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \--top_p 0.95 \--temperature 0.3

多节点推理:

srun -N 2 -w node-[1-2] --gres=gpu:8 --ntasks-per-node 8 \--container-image tensorrt_llm/release:latest \--container-mounts ${PWD}:/workspace \sh /workspace/command/run.sh

输出结果:

...
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
Input [Text 0]: "Today is a nice day."
Output [Text 0 Beam 0]: " I am going to the park with my friends. We are going to play soccer. We are going"

2.8.4 模型评估

使用 mmlu.py 脚本实现模型评估:

# Download MMLU dataset
mkdir mmlu_data && cd mmlu_data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
# Run MMLU evaluation
python3 mmlu.py \--hf_model_dir ${MODEL_DIR} \--engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \--data_dir mmlu_data \--test_trt_llm 2>&1 | tee ${ENGINE_DIR}/test_with_mmlu.log

输出结果:

Average accuracy 0.926 - high_school_macroeconomics
Average accuracy 0.752 - high_school_mathematics
Average accuracy 0.954 - high_school_microeconomics
Average accuracy 0.848 - high_school_physics
Average accuracy 0.967 - high_school_psychology
Average accuracy 0.861 - high_school_statistics
Average accuracy 0.956 - high_school_us_history
Average accuracy 0.954 - high_school_world_history
Average accuracy 0.861 - human_aging
Average accuracy 0.931 - human_sexuality
Average accuracy 0.975 - international_law
Average accuracy 0.907 - jurisprudence
Average accuracy 0.920 - logical_fallacies
Average accuracy 0.848 - machine_learning
Average accuracy 0.951 - management
Average accuracy 0.957 - marketing
Average accuracy 0.950 - medical_genetics
Average accuracy 0.957 - miscellaneous
Average accuracy 0.870 - moral_disputes
Average accuracy 0.798 - moral_scenarios
Average accuracy 0.918 - nutrition
Average accuracy 0.916 - philosophy
Average accuracy 0.932 - prehistory
Average accuracy 0.869 - professional_accounting
Average accuracy 0.714 - professional_law
Average accuracy 0.956 - professional_medicine
Average accuracy 0.908 - professional_psychology
Average accuracy 0.800 - public_relations
Average accuracy 0.869 - security_studies
Average accuracy 0.960 - sociology
Average accuracy 0.950 - us_foreign_policy
Average accuracy 0.578 - virology
Average accuracy 0.930 - world_religions
Average accuracy 0.852 - math
Average accuracy 0.874 - health
Average accuracy 0.905 - physics
Average accuracy 0.936 - business
Average accuracy 0.958 - biology
Average accuracy 0.825 - chemistry
Average accuracy 0.888 - computer science
Average accuracy 0.912 - economics
Average accuracy 0.890 - engineering
Average accuracy 0.851 - philosophy
Average accuracy 0.917 - other
Average accuracy 0.932 - history
Average accuracy 0.944 - geography
Average accuracy 0.904 - politics
Average accuracy 0.936 - psychology
Average accuracy 0.949 - culture
Average accuracy 0.744 - law
Average accuracy 0.883 - STEM
Average accuracy 0.827 - humanities
Average accuracy 0.926 - social sciences
Average accuracy 0.898 - other (business, health, misc.)
Average accuracy: 0.877

http://www.ppmy.cn/news/1570563.html

相关文章

电路笔记 : opa 运放失调电压失调电流输入偏置电流 + 反向放大器的平衡电阻 R3 = R1 // R2 以减小输出直流噪声

目录 定义影响和解决失调电压输入偏置电流平衡电阻R3推导公式: 失调电流 实际的运算放大器(Op-Amp)存在一些非理想特性,如失调电压(VIO)、失调电流(IIO)和输入偏置电流(I…

工业相机在工业生产制造过程中的视觉检测技术应用

随着技术不断发展以及工业4.0时代的到来,利用工业相机进行视觉检测技术已经成为制造业不可或缺的一部分。通过结合先进的计算机视觉、AI算法和自动化设备,工业视觉检测为生产线质量控制和效率提升提供了革命性的解决方案。 一、什么是工业视觉检测技术 …

【03】Java+若依+vue.js技术栈实现钱包积分管理系统项目-若依框架搭建-服务端-后台管理-整体搭建-优雅草卓伊凡商业项目实战

【03】Java若依vue.js技术栈实现钱包积分管理系统项目-若依框架搭建-服务端-后台管理-整体搭建-优雅草卓伊凡商业项目实战 项目背景 本项目经费43000元,需求文档如下,工期25天,目前已经过了8天,时间不多了,我们需要在…

ubuntu24.04安装布置ros

最近换电脑布置机器人环境,下了24.04,但是网上的都不太合适,于是自己试着布置好了,留作有需要的人一起看看。 文章目录 目录 前言 一、确认 ROS 发行版名称 二、检查你的 Ubuntu 版本 三、安装正确的 ROS 发行版 四、对于Ubuntu24…

Axios 的原理

🤍 前端开发工程师、技术日更博主、已过CET6 🍨 阿珊和她的猫_CSDN博客专家、23年度博客之星前端领域TOP1 🕠 牛客高级专题作者、打造专栏《前端面试必备》 、《2024面试高频手撕题》 🍚 蓝桥云课签约作者、上架课程《Vue.js 和 E…

【开源AI】AI一页一页读PDF

【开源AI】AI一页一页读PDF 可以在这里看 : 让AI 处理 PDF 文件,提取其中的知识点,并生成总结。 只是无法修改,后续若有更新在csdn这里。 【OpenAI】 API 更新: JSON 结构化输出约束机制( JSON Schema) 的一次实战。知识库的JSON Schema形式 每一页都要总结,总结的知识…

使用libcurl获取网页内容不完整的另一种情况

写一个使用libcurl获取网页的测试程序,发现获取的网页内容使用cout显示总是不完整,内容从中间截断了。 到网络上查找,也总是对不上号,不是自己想要的答案。 后来发现,写文件的方式,获取的网页文件是完整&am…

react 19 useOptimistic 竞争更新乐观值时阻塞

react 19 刚刚出,我在官网上调试这个 useOptimistic api 时,发现了一个竞争调用时的阻塞状态。当连续多次调用 updateFn 参数,且每次更新时间较长时,乐观状态的更新被阻塞了。 前往官网 useOptimistic 一节,修改末尾 …