在 AMD GPU 上使用 vLLM 的 Triton 推理服务器

Triton Inference Server with vLLM on AMD GPUs — ROCm Blogs

2025年1月8日，作者：Fabricio Flores，Tiffany Mintz，Eliot Li，Yao Liu，Ted Themistokleous，Brian Pickrell，Vish Vadlamani

Triton 推理服务器是一个开源平台，旨在简化 AI 推理过程。它支持从各种机器学习和深度学习框架（包括 Tensorflow、PyTorch 和 vLLM）中部署、扩展和推理训练后的 AI 模型，使其适用于各种 AI 工作负载。它被设计为可跨多个环境工作，包括云、数据中心和边缘设备。

Triton 推理服务器的一些功能包括：

框架灵活性: 允许部署来自不同框架的模型（参见 Triton 推理服务器后台），无论底层基础设施如何。此灵活性允许在同一硬件上运行多个模型或一个模型的多个实例，提高资源利用率。
硬件和部署多样性: 它针对 GPU 和 CPU 环境都进行了优化，这使得它可以部署在各种硬件上。Triton 推理服务器可以在云端、数据中心或边缘设备上使用，使其高度多样化。
性能优化: 通过动态批处理增强推理性能，动态批处理聚合较小的推理请求以优化处理并实现并发模型执行。这种能力允许同时运行多个模型，对于需要最小延迟的实时应用至关重要。

在本文中，我们将逐步向您展示如何在 AMD GPU 上使用 ROCm 设置具有 vLLM 后端的 Triton 推理服务器。我们首先简要介绍将 vLLM 作为 Triton 推理服务器后端的一些关键方面。然后，我们提供详细的操作指南，展示如何使用 vLLM 后端设置 Triton 推理服务器，并在3个 LLMs（`microsoft/phi-2`、`mistral-7b-instruct` 和 meta-llama/Meta-Llama-3-8B-Instruct）上进行推理测试。

要求

AMD GPU: 参见 ROCm 文档页面了解支持的硬件和操作系统。本文在配备8个 AMD Instinct MI210 GPUs 的机器上进行了测试。
ROCm 6.1+: 参见 ROCm Linux 安装指南了解安装说明。
Docker: 参见在 Ubuntu 上安装 Docker 引擎了解安装说明。
Hugging Face 访问令牌: 此博客需要一个 Hugging Face 帐户，并生成一个用户访问令牌。
访问 Hugging Face 上的 mistral 和 Llama-3 模型. 这些是 Hugging Face 上的限制访问模型。如需请求访问，请参见 mistralai/Mistral-7B-Instruct-v0.2 和 meta-llama/Meta-Llama-3-8B-Instruct。

您可以在此 GitHub 文件夹找到与本文相关的文件。

Triton 推理服务器：vLLM 后端

Triton 推理服务器的后端指的是在推理过程中负责执行 AI 模型的组件。后端是一个围绕特定机器学习框架（如 PyTorch, TensorFlow, vLLM或其他）的封装。每个后端都实现为一个共享库，模型可以配置为使用特定的后端。例如，如果一个模型使用 PyTorch，那么后端将配置为与 PyTorch 库交互。

Triton 推理服务器项目提供了一组经过测试和在每个版本中更新的支持的后端。关于支持的后端列表，请参见 Where can I find all the backends that are available for Triton?可以找到所有可用的 Triton 后端。本博客重点介绍 vLLM 后端。

使用 vLLM 作为后端可以启用大语言模型（LLMs）的推理服务，其特点是高吞吐量和低延迟。vLLM 是一个专门为处理 LLM 推理优化的引擎，特别是在持续批处理和内存效率至关重要的场景下。

以下是 Triton 推理服务器中 vLLM 的一些关键方面：

vLLM 集成: vLLM 从23.10 版本开始集成到 Triton 推理服务器中。可以通过包含 vLLM 后端的预构建容器或通过构建自定义容器来使用该集成。这种集成允许通过 Triton 推理服务器的灵活和可扩展架构提供如 Facebook 的 OPT 系列、LLaMA 模型等模型服务。
配置与部署: 在设置 vLLM 作为后端时，需配置模型仓库。这个仓库包括 model.json 和 config.pbtxt 文件。这些配置定义了模型参数，例如内存利用、批处理大小和模型特定设置。
性能特性: Triton 推理服务器中的 vLLM 后端支持异步推理，这对于大规模文本生成和处理等任务至关重要。张量并行和分页注意力等特性增强了多 GPU 性能，使 vLLM 适合跨分布式系统处理大模型。
部署选项: 使用 vLLM 后端的模型可以部署在各种平台上，包括云环境。容器化部署确保模型可以根据性能需求进行水平扩展，并支持 Kubernetes 和其他编排系统。

使用 vLLM 作为 Triton 推理服务器的后端，提供了一个高度优化的服务引擎，专门适应 LLM 的特定需求，并且还能利用 Triton 推理服务器的强大基础设施以实现可扩展的推理服务。

设置带有 vLLM 后端的 Triton 推理服务器

要使用 Triton 推理服务器和 vLLM 后端执行大型语言模型的推理，请按照以下步骤操作：

设置带有 vLLM 后端的 Triton 推理服务器: 我们正在配置一个 docker compose 文件，其中包括一个带有 vLLM 后端的 Triton 推理服务器容器。该 docker compose 文件引用了预先安装了 Triton 推理服务器的 Docker 镜像（该镜像可以从源代码构建或从注册表中拉取），定义了 GPU 访问，设置了存储库路径，并暴露了必要的端口。
准备模型库: 模型库是一个目录或一组目录，其中包含将用于推理的模型。每个模型在存储库中以特定的结构组织。每次 Triton 推理服务器启动时都会扫描和加载此结构。

模型库的结构如下:
```
model_repository/├── <model_name_1>/│   ├── config.pbtxt  # 描述模型的配置文件│   ├── 1/  # 版本目录（Triton 推理服务器支持版本控制）│   │   └── model.onnx  # 实际的模型文件（例如，ONNX，PyTorch，vLLM）│   └── 2/│       └── model.onnx├── <model_name_2>/│   ├── config.pbtxt│   ├── 1/│   │   └── model.json│   └── 2/│       └── model.json
```
model_repository 是包含一个或多个子目录的根目录，每个子目录代表一个模型。每个模型被组织到一个 模型目录 (<model_name>), 中，目录名称对应于模型的名称. 在模型目录中，有版本目录 (1/, 2/) 允许同一模型的多个版本共存。每个版本目录包含实际的模型文件。这些文件使 Triton 推理服务器能够识别和服务正确的版本。模型文件 (model.onnx, model.json, 等) 存储模型的架构和推理参数。最后，配置文件（`config.pbtxt`）定义了输入和输出张量的名称、形状、数据类型和其他配置。
定义模型配置和模型文件: 使用 vLLM 后端时，模型配置文件必须除了数据类型和形状外，还指定后端类型。一个简化版本的 config.pbtxt 如下：
```
backend: "vllm"input [
{name: "text_input"data_type: TYPE_STRINGdims: [ 1 ]
}
]output [
{name: "text_output"data_type: TYPE_STRINGdims: [ -1 ]
}
]
```
而模型文件 model.json，其中指定了模型初始化和推理参数，如下：
```
{"model":"meta-llama/Meta-Llama-3-8B-Instruct","gpu_memory_utilization": 0.8,"tensor_parallel_size": 2,"trust_remote_code": true,"disable_log_requests": true,"enforce_eager": true,"max_model_len": 2048
}
```
这些参数中，`model` 指定模型的名称，`gpu_memory_utilization` 限制模型只能使用 GPU 内存的一定百分比，`tensor_parallel_size` 定义模型应使用的 GPU 数量以进行并行处理。有关更多参数和配置文件的详细信息, 请参见Triton Inference Server-vLLM 文档: 启动 Triton 推理服务器.

我们创建了一个 Docker Compose 配置，自动化了带有 vLLM 后端的 Triton 推理服务器的整个设置。此设置包括构建 Docker 镜像，通过 docker-compose.yaml 文件配置 AMD GPU 访问权限，并设置包含 3 个不同的大型语言模型（LLM）的模型库（`./triton_server_vllm/src/model_repository`）以进行测试。使用此设置，运行 docker compose build 和 docker compose up 命令来启动 Triton 推理服务器，而无需手动完成前面的步骤。

让我们从源代码开始构建 Triton 推理服务器 Docker 镜像。克隆包含 AMD ROCm 版本的 Triton 推理服务器的存储库:

git clone https://github.com/ROCm/tritoninferenceserver-vllm.git

接下来，进入 tritoninferenceserver-vllm 目录并运行 build-vllm-docker.py Python 脚本来构建 Docker 镜像：

cd tritoninferenceserver-vllmpython3 build-vllm-docker.py --no-container-pull --enable-logging --enable-stats \--enable-tracing --enable-rocm  --endpoint=grpc \--image gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 \--endpoint=http --backend=python --backend=vllm

新构建的 Docker 镜像名为 tritonserver。要验证它的存在，请使用以下命令：

docker images | grep tritonserver

输出将类似于:

REPOSITORY                TAG            IMAGE ID       CREATED         SIZE
tritonserver              latest         fffefb8a8258   22 hours ago    62.8GB

构建完 tritonserver Docker 镜像后，让我们返回到原始目录并克隆这个博客的存储库:

cd ..
git clone https://github.com/ROCm/rocm-blogs.git
cd rocm-blogs/blogs/artificial-intelligence/triton_server_vllm/docker

然后编辑环境文件：`./triton_server_vllm/docker/.env` 并提供 Hugging Face Token:

HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_ACCESS_TOKEN>

接下来，运行以下命令赋予 start_services.sh 脚本执行权限:

chmod +x start_services.sh

最后，构建并启动 Docker 容器:

docker compose build
docker compose up

注意：启动容器和服务将花费一些时间，因为模型 Mistral-7B-Instruct-v0.1、`microsoft/phi-2` 和 meta-llama/Meta-Llama-3-8B-Instruct 将从 Hugging Face Hub 下载和提供服务。

执行 docker compose up 命令后，终端将显示类似以下的输出:

[+] Running 2/1✔ Network docker_default                 Created  0.1s ✔ Container docker-triton_server_vllm-1  Created  0.0s 
Attaching to triton_server_vllm-1
...
triton_server_vllm-1  | [I 2024-08-27 15:33:39.976 ServerApp] Jupyter Server 2.14.2 is running at:
triton_server_vllm-1  | [I 2024-08-27 15:33:39.976 ServerApp] http://3dd761dca9b9:8888/lab
triton_server_vllm-1  | [I 2024-08-27 15:33:39.976 ServerApp]     http://127.0.0.1:8888/lab
triton_server_vllm-1  | [I 2024-08-27 15:33:39.976 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
...triton_server_vllm-1  | INFO 08-27 22:22:12 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='mistralai/Mistral-7B-Instruct-v0.1', tokenizer='mistralai/Mistral-7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)triton_server_vllm-1  | INFO 08-27 22:22:13 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='microsoft/phi-2', tokenizer='microsoft/phi-2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)triton_server_vllm-1  | INFO 08-27 22:22:13 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)

当 Triton Inference Server 准备就绪时，控制台将显示以下内容:

triton_server_vllm-1  | I0827 22:27:53.490967 15 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
triton_server_vllm-1  | I0827 22:27:53.491185 15 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000

在控制台的输出中我们看到:

Jupyter 服务器正在运行，地址为 http://127.0.0.1:8888/lab
模型 mistralai/Mistral-7B-Instruct-v0.1 正在初始化。可以在 http://localhost:8000/v2/models/mistral-7b-instruct/generate 发送请求
模型 microsoft/phi-2 正在初始化。可以在 http://localhost:8000/v2/models/phi2/generate 发送请求
模型 meta-llama/Meta-Llama-3-8B-Instruct 正在初始化。可以在 http://localhost:8000/v2/models/llama3-8b-instruct/generate 发送请求

随着模型准备好进行推理，我们可以进行一些测试。

理解模型库结构和配置

docker-compose.yaml 文件包含创建 Docker 容器的必要配置，该容器可以服务 phi-2, Mistral-7B-Instruct-v0.1, 和 Meta-Llama-3-8B-Instruct 模型。用于服务和执行每个模型推理的具体配置位于 ./triton_server_vllm/src/model_repository 文件夹中。在我们的例子中，这个 model_repository 文件夹具有以下结构：

model_repository/├── llama3-8b-instruct/│   ├── config.pbtxt    # 描述模型的配置文件│   ├── 1/              # 版本目录│       └── model.json  # 实际的模型文件├── mistral-7b-instruct/│   ├── config.pbtxt│   ├── 1/│       └── model.json├── phi2/│   ├── config.pbtxt│   ├── 1/│       └── model.json

每个模型的 model.json 文件包含其自身的配置。对于 llama3-8b-instruct 模型，其 model.json 文件如下：

{"model":"meta-llama/Meta-Llama-3-8B-Instruct","gpu_memory_utilization": 0.8,"tensor_parallel_size": 2,"trust_remote_code": true,"disable_log_requests": true,"enforce_eager": true,"max_model_len": 2048
}

For the Mistral-7B-Instruct-v0.1 model its model.json is:

{"model":"mistralai/Mistral-7B-Instruct-v0.1","gpu_memory_utilization": 0.8,"tensor_parallel_size": 2,"trust_remote_code": true,    "disable_log_requests": true,"enforce_eager": true,"max_model_len": 2048
}

对于 Mistral-7B-Instruct-v0.1 模型，其 model.json 文件如下：

{"model":"microsoft/phi-2","gpu_memory_utilization": 0.8,"tensor_parallel_size": 1,"trust_remote_code": true,"disable_log_requests": true,"enforce_eager": true,"max_model_len": 2048
}

每个 model.json 文件中的 tensor_parallel_size 参数值指定了用于每个模型并行计算的 GPU 数量。由于我们希望同时运行这3个模型，并且拥有8个 AMD Instinct MI210 GPU，这意味着 Meta-Llama-3-8B-Instruct 将使用8个GPU中的2个，`Mistral-7B-Instruct-v0.1` 将使用剩余6个GPU中的2个，而 phi-2 将使用剩余4个GPU中的1个。如果某个模型需要更多GPU，我们需要调整一个或多个模型的 tensor_parallel_size 参数值以适应可用的GPU数量。

关于更多参数和配置文件的信息，请参阅 Triton Inference Server-vLLM 文档：Start Triton Inference Server

使用 phi-2, Mistral-7B-Instruct-v0.1 和 Meta-Llama-3-8B-Instruct 进行推理

当我们的 Jupyter Lab 和 Triton 推理服务器运行时，导航到 http://127.0.0.1:8888/lab/tree/src/triton_server_vllm.ipynb 进行这些模型的推理。

让我们开始测试 microsoft/phi-2，如下:

# 定义端点URL
url = "http://localhost:8000/v2/models/phi2/generate"# 定义负载
payload = {"text_input": "What is triton inference server?","parameters": {"stream": False,"temperature": 0,"max_tokens": 100}
}# 设置请求头（可选）
headers = {"Content-Type": "application/json"
}# 发送 POST 请求
response = requests.post(url, data=json.dumps(payload), headers=headers)# 打印响应
print(response.status_code)
print(response.json())

我们正在向 Triton 推理服务器发送带有提示的负载：`"What is triton inference server?"。输出包含响应状态 200` 和一个 json 对象:

200
{'model_name': 'phi2', 'model_version': '1', 'text_output': 'What is triton inference server?\n\nTriton inference server is a software that helps to run machine learning models on a computer. It is like a helper that makes sure the models work correctly and gives us the results we need.\n\nWhat is the purpose of triton inference server?\n\nThe purpose of triton inference server is to help us use machine learning models in our daily lives. It makes it easier for us to use these models and get the results we need.\n\nHow does triton inference server'}

使用`Mistral-7B-Instruct-v0.1`时，我们有如下代码:

# 定义端点URL
url = "http://localhost:8000/v2/models/mistral-7b-instruct/generate"# 定义负载
payload = {"text_input": "What is triton inference server?","parameters": {"stream": False,"temperature": 0,"max_tokens": 100}
}# 设置头信息 (可选)
headers = {"Content-Type": "application/json"
}# 发送POST请求
response = requests.post(url, data=json.dumps(payload), headers=headers)# 打印响应
print(response.status_code)
print(response.json())

输出如下：

200
{'model_name': 'mistral-7b-instruct', 'model_version': '1', 'text_output': 'What is triton inference server?\n\nTriton Inference Server is an open-source, high-performance, and scalable inference engine for deep learning models. It supports a wide range of deep learning frameworks, including TensorFlow, PyTorch, and MXNet, and can be used to deploy deep learning models in various environments, such as edge devices, cloud services, and on-premises data centers.\n\nTriton Inference Server provides a unified API for accessing'}

最终，使用`meta-llama/Meta-Llama-3-8B-Instruct`进行推理时，我们有如下代码:

# 定义端点URL
url = "http://localhost:8000/v2/models/llama3-8b-instruct/generate"# 定义负载
payload = {"text_input": "What is triton inference server?","parameters": {"stream": False,"temperature": 0,"max_tokens": 100}
}# 设置头信息 (可选)
headers = {"Content-Type": "application/json"
}# 发送POST请求
response = requests.post(url, data=json.dumps(payload), headers=headers)# 打印响应
print(response.status_code)
print(response.json())

POST请求的响应如下：

200
{'model_name': 'llama3-8b-instruct', 'model_version': '1', 'text_output': 'What is triton inference server?¶\n\nTriton Inference Server is an open-source, high-performance, scalable, and extensible deep learning inference server developed by NVIDIA. It is designed to serve as a production-ready inference engine for deep learning models, allowing developers to deploy and manage their models in a scalable and efficient manner.\n\nTriton Inference Server provides a number of features that make it an attractive choice for deploying deep learning models in production environments, including:\n\n1. **Model serving**: Triton Inference Server can'}

同时部署这三个模型（`microsoft/phi-2`，`Mistral-7B-Instruct-v0.1`，和`meta-llama/Meta-Llama-3-8B-Instruct`）使我们能够提供多个LLM服务。Triton推理服务器与vLLM后端管理了必要的资源，并优化了内存利用率，以同时运行这些模型。