相关博客
【自然语言处理】【分布式训练及推理】推理工具DeepSpeed-Inference
【自然语言处理】【chatGPT系列】大语言模型可以自我改进
【自然语言处理】【ChatGPT系列】WebGPT:基于人类反馈的浏览器辅助问答
【自然语言处理】【ChatGPT系列】FLAN:微调语言模型是Zero-Shot学习器
【自然语言处理】【ChatGPT系列】ChatGPT的智能来自哪里?
【自然语言处理】【ChatGPT系列】Chain of Thought:从大模型中引导出推理能力
【自然语言处理】【ChatGPT系列】InstructGPT:遵循人类反馈指令来训练语言模型
【自然语言处理】【ChatGPT系列】大模型的涌现能力
DeepSpeed-Inference是DeepSpeed框架在推理方面的扩展。DeepSpeed-Inference合并了张量、流水线并行以及自定义优化cuda核等并行化技术。DeepSpeed提供了无缝推理模式来兼容DeepSpeed、Megatron和HuggingFace训练的Transformer模型。DeepSpeed-Inference集成了模型并行技术,从而使得可以在多个GPU上进行大模型的推理。
本文以HuggingFace的BLOOM模型为例,展示DeepSpeed-Inference的使用方式。
import os
import torch
import deepspeed
import numpy as np
import transformersfrom time import perf_counter
from transformers import AutoTokenizer, AutoModelForCausalLM
from deepspeed.ops.transformer.inference import DeepSpeedTransformerInferencetransformers.logging.set_verbosity_error()
一、直接推理
先加载一个7B的模型
model_name = "bigscience/bloomz-7b1-mt"
payload = "一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。你认为这句话的立场是赞扬、中立还是批评?"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
print(f"模型加载至设备{model.device.type}")
定义一个推理函数,并尝试进行推理
def inference(payload, model, tokenizer):input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(model.device)print(f"输入:\n {payload}")logits = model.generate(input_ids, do_sample=True, num_beams=1, max_new_tokens=128)print(f"生成:\n {tokenizer.decode(logits[0].tolist()[len(input_ids[0]):])}")if __name__ == "__main__":inference(payload, model, tokenizer)
# 执行结果
"""
输入:一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。你认为这句话的立场是赞扬、中立还是批评?
生成:赞扬</s>
"""
定义一个衡量延时的函数
def measure_latency(model, tokenizer, payload, device, generation_args={}):input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(device)latencies = []# 预热for _ in range(2):_ = model.generate(input_ids, **generation_args)# 统计时间for _ in range(10):start_time = perf_counter()_ = model.generate(input_ids, **generation_args)latency = perf_counter() - start_timelatencies.append(latency)# 计算统计量time_avg_ms = 1000 * np.mean(latencies) # 延时均值time_std_ms = 1000 * np.std(latencies) # 延时方差time_p95_ms = 1000 * np.percentile(latencies,95) # 延时的95分位数return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_msdef test_inference_time():print(f'输入序列的长度为: {len(tokenizer(payload)["input_ids"])}')generation_args = dict(do_sample=False,num_beams=1,max_new_tokens=128)vanilla_results = measure_latency(model, tokenizer, payload, model.device, generation_args)print(f"普通模型的结果: {vanilla_results[0]}")if __name__ == "__main__":test_inference_time()
# 执行结果
"""
普通模型的结果: P95 latency (ms) - 147.3398147150874; Average latency (ms) - 143.50 +\- 2.45;
"""
二、DeepSpeed-Inference
DeepSpeed-Inference可以通过张量并行将大模型分解至多卡上,从而完成推理并提供一定的加速。
加载7B的模型
model_name = "bigscience/bloomz-7b1-mt"
payload = "一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。你认为这句话的立场是赞扬、中立还是批评?"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)ds_model = deepspeed.init_inference(model=model, # Transformers模型mp_size=4, # GPU数量dtype=torch.float16, # 权重类型(fp16)replace_method="auto", # 让DS自动替换层replace_with_kernel_inject=True, # 使用kernel injector替换
)
print(f"模型加载至设备{ds_model.module.device}\n")
assert isinstance(ds_model.module.transformer.h[0], DeepSpeedTransformerInference) == True, "Model not sucessfully initalized"
定义一个衡量延时的函数
def measure_latency(model, tokenizer, payload, device, generation_args={}):input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(device)latencies = []# 预热for _ in range(2):_ = model.generate(input_ids, **generation_args)# 统计时间for _ in range(10):start_time = perf_counter()_ = model.generate(input_ids, **generation_args)latency = perf_counter() - start_timelatencies.append(latency)# 计算统计量time_avg_ms = 1000 * np.mean(latencies) # 延时均值time_std_ms = 1000 * np.std(latencies) # 延时方差time_p95_ms = 1000 * np.percentile(latencies,95) # 延时的95分位数return f"DeepSpeed模型的结果: P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms
执行推理任务以及衡量时延
def test_inference():# 执行模型推理input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(model.device)logits = ds_model.generate(input_ids, do_sample=True, max_length=100)print(tokenizer.decode(logits[0].tolist()))def test_inference_time():# 衡量推理时延print(f'输入序列的长度为: {len(tokenizer(payload)["input_ids"])}')generation_args = dict(do_sample=False, num_beams=1, max_new_tokens=128)ds_results = measure_latency(ds_model, tokenizer, payload, ds_model.module.device, generation_args)print(f"DeepSpeed model: {ds_results[0]}")if __name__ == "__main__":test_inference()test_inference_time()
# 执行结果
"""
DeepSpeed model: DeepSpeed模型的结果: P95 latency (ms) - 80.4762739688158; Average latency (ms) - 77.79 +\- 2.20;
"""
DeepSpeed的代码不能使用 python来执行,需要通过deepspeed命令执行,示例如下
# bloom_ds_inference.py就是需要执行的python脚本
deepspeed --num_gpus 4 --master_port 60000 bloom_ds_inference.py
三、结果
本实验使用4张3090:
- 使用DeepSpeed-Inference前的平均时延是143.5ms,使用后则是77.79ms。总的来说,速度提升了1.84倍;
- 使用DeepSpeed-Inference可以将单个模型分配至多个GPU上,对于那些单GPU上无法加载的模型来说,可以利用多GPU进行推理;
参考资料
Accelerate GPT-J inference with DeepSpeed-Inference on GPUs
Getting Started with DeepSpeed for Inferencing Transformer based Models