Meta-Llama-3-8B-Instruct 模型的混合精度训练显存需求：AdamW优化器（中英双语）

深入分析 Meta-Llama-3-8B-Instruct 模型的混合精度训练显存需求

Meta-Llama-3-8B-Instruct 是一个 8B（80亿）参数的大型语言模型，适用于指令微调任务。与之前的 7B 模型相比，它在计算和存储方面会有更高的需求。为了提高训练效率并减少显存占用，混合精度训练（使用 BF16 格式存储权重，同时使用 FP32 格式存储更新副本和优化器参数）成为一种重要的技术手段。本博客将详细计算 Meta-Llama-3-8B-Instruct 在混合精度训练中的显存需求，并分析为什么权重更新副本和优化器参数使用 FP32 存储。

1. Meta-Llama-3-8B-Instruct 模型的基本参数

Meta-Llama-3-8B-Instruct 模型包含 80 亿参数。根据参数存储格式的不同，显存需求有所不同。我们将计算在混合精度训练下，模型权重、梯度、优化器参数等的显存占用。

存储格式：

BF16：每个参数占用 2 字节。
FP32：每个参数占用 4 字节。

2. 显存需求的详细计算

在混合精度训练中，显存需求主要由以下几个部分组成：

1）模型权重

前向传播和反向传播：
模型权重以 BF16 存储。
$\text{模型权重(BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB}$
权重更新副本：
模型权重的 FP32 副本用于更新。
$\text{模型权重(FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB}$

2）梯度

梯度与模型权重的规模相同，通常使用 BF16 格式进行存储：
$\text{梯度(BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB}$

3）优化器动量参数

AdamW 优化器需要存储一阶动量和二阶动量，通常以 FP32 格式存储：

一阶动量（( $\mathbf{m}$ )）：
$\text{一阶动量(FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB}$
二阶动量（( $\mathbf{v}$ )）：
$\text{二阶动量(FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB}$

4）激活值

反向传播过程中需要保留前向传播的激活值，通常以 BF16 存储。假设激活值占总权重的 30%：
$\text{激活值(BF16)} \approx 0.3 \times 16 \, \text{GB} = 4.8 \, \text{GB}$

3. 显存需求总结

组件	存储格式	显存需求
模型权重（BF16）	BF16	16 GB
权重更新副本（FP32）	FP32	32 GB
梯度（BF16）	BF16	16 GB
一阶动量（FP32）	FP32	32 GB
二阶动量（FP32）	FP32	32 GB
激活值（BF16）	BF16	4.8 GB
总计		132.8 GB

4. 深入分析

为什么优化器参数使用 FP32 存储？

尽管 BF16 格式在计算中可以显著减少显存需求，但其精度不足以保证优化器参数的稳定性，特别是在大规模模型的训练中。由于 AdamW 优化器需要对梯度和动量进行多次更新操作，使用 FP32 可以确保数值精度和更新的稳定性，避免因精度问题导致训练失败。

动量参数显存需求为何如此高？

AdamW 优化器的两种动量参数（( $\mathbf{m}$ ) 和 ( $\mathbf{v}$ )）的显存占用与权重相同，因此显存需求非常高。每个动量参数都以 FP32 存储，和模型的参数数量成正比。对 LLaMA 3 8B 模型而言，动量参数的显存需求非常可观。

如何优化显存使用？

ZeRO 优化：使用 DeepSpeed 的 ZeRO 技术来分片优化器参数和梯度，能显著降低单张 GPU 的显存占用。
混合精度优化：继续使用 BF16 存储权重和梯度，减少内存消耗，同时通过 FP32 存储动量和权重更新副本，保证数值稳定。

5. 代码示例：DeepSpeed 混合精度训练

下面是如何使用 DeepSpeed 训练 Meta-Llama-3-8B-Instruct 模型的代码示例：

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer# 加载模型和分词器
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)# DeepSpeed 配置
ds_config = {"fp16": {"enabled": True  # 启用混合精度训练 (BF16)},"optimizer": {"type": "AdamW",  # 使用 AdamW 优化器"params": {"lr": 1e-5,"betas": [0.9, 0.999],"eps": 1e-8,"weight_decay": 0.01}},"zero_optimization": {"stage": 2,  # ZeRO Stage 2，分片优化器参数"contiguous_gradients": True,"overlap_comm": True},"gradient_accumulation_steps": 4,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0
}# 启动 DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model,config_params=ds_config
)# 示例数据
inputs = tokenizer("Hello, DeepSpeed!", return_tensors="pt")
outputs = model_engine(**inputs, labels=inputs["input_ids"])
loss = outputs.loss# 反向传播和优化
model_engine.backward(loss)
model_engine.step()

6. 总结

混合精度训练通过使用 BF16 存储模型权重和梯度，在保证计算精度的同时显著降低了显存占用。然而，由于 AdamW 优化器需要更高精度的更新操作，动量参数和权重更新副本仍然以 FP32 格式存储。
对于 Meta-Llama-3-8B-Instruct 模型，混合精度训练的显存需求约为 132.8 GB，其中动量参数和优化器副本占据了显存的大部分。
通过 DeepSpeed 等技术，如 ZeRO 优化，可以进一步优化显存占用，为大规模模型的训练提供更高效的解决方案。

In-Depth Analysis of Memory Requirements for Mixed Precision Training of Meta-Llama-3-8B-Instruct Model

Meta-Llama-3-8B-Instruct is an 8 billion parameter language model, fine-tuned for instruction-based tasks. Compared to smaller models, it requires more computational resources and memory. Mixed precision training, which stores weights in BF16 format while keeping the update copies of weights and optimizer parameters in FP32 format, helps reduce memory usage and speeds up training. In this blog, we will calculate the memory requirements for the Meta-Llama-3-8B-Instruct model under mixed precision training and explain the rationale behind using FP32 for optimizer parameters despite using BF16 for weights and gradients.

1. Meta-Llama-3-8B-Instruct Model Parameters

The Meta-Llama-3-8B-Instruct model consists of 8 billion parameters. The memory requirements vary depending on the precision format used for different components (weights, gradients, optimizer parameters). Let’s calculate the memory needed for various components of the model under mixed precision training.

Precision Formats:

BF16: 2 bytes per parameter.
FP32: 4 bytes per parameter.

2. Detailed Calculation of Memory Requirements

Mixed precision training involves storing model weights and gradients in BF16, while optimizer parameters (momentum) and the weight update copies are stored in FP32. Let’s break down the memory requirements for the different components:

1) Model Weights

Forward and Backward Propagation:
Weights are stored in BF16.
$\text{Model Weights (BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB}$
Weight Update Copies:
The FP32 copy of the weights is used for updates.
$\text{Model Weights (FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB}$

2) Gradients

The gradients have the same scale as the model weights and are usually stored in BF16 format:
$\text{Gradients (BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB}$

3) Optimizer Momentum Parameters

AdamW optimizer requires storing first and second moments, which are stored in FP32:

First Moment (( $\mathbf{m}$ )):
$\text{First Moment (FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB}$
Second Moment (( $\mathbf{v}$ )):
$\text{Second Moment (FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB}$

4) Activations

During backpropagation, activations need to be stored for the forward pass. Activations are typically stored in BF16 format. Assuming activations take up around 30% of the weight size:
$\text{Activations (BF16)} \approx 0.3 \times 16 \, \text{GB} = 4.8 \, \text{GB}$

3. Memory Requirements Summary

Component	Precision Format	Memory Requirement
Model Weights (BF16)	BF16	16 GB
Weight Update Copies (FP32)	FP32	32 GB
Gradients (BF16)	BF16	16 GB
First Moment (FP32)	FP32	32 GB
Second Moment (FP32)	FP32	32 GB
Activations (BF16)	BF16	4.8 GB
Total		132.8 GB

4. Detailed Analysis

Why are Optimizer Parameters Stored in FP32?

Even though BF16 offers significant memory savings during calculations, it lacks the precision needed for reliable updates in optimizer parameters. AdamW optimizer requires high precision for momentum updates, as the accuracy of parameter adjustments directly affects model convergence. Storing the momentum parameters and weight update copies in FP32 ensures that numerical stability is maintained during optimization.

Why is the Memory Requirement for Momentum Parameters So High?

Both first and second moment parameters in AdamW are stored in FP32, which results in a significant memory footprint. The memory needed for these parameters scales linearly with the number of model parameters. In the case of LLaMA-3 8B, the optimizer’s memory requirements are substantial, occupying 64 GB just for these momentum terms.

How Can Memory Usage Be Optimized?

ZeRO Optimization: DeepSpeed’s ZeRO optimization can help reduce the memory footprint by partitioning the optimizer states and gradients across multiple GPUs, making it feasible to train large models with less memory.
Mixed Precision Optimization: Using BF16 for weights, gradients, and activations reduces memory usage while still maintaining enough precision for forward and backward passes, ensuring efficient training.

5. Code Example: DeepSpeed Mixed Precision Training

Here’s how to use DeepSpeed to train the Meta-Llama-3-8B-Instruct model with mixed precision:

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer# Load model and tokenizer
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)# DeepSpeed configuration
ds_config = {"fp16": {"enabled": True  # Enable mixed precision training (BF16)},"optimizer": {"type": "AdamW",  # Use AdamW optimizer"params": {"lr": 1e-5,"betas": [0.9, 0.999],"eps": 1e-8,"weight_decay": 0.01}},"zero_optimization": {"stage": 2,  # ZeRO Stage 2: Partition optimizer states"contiguous_gradients": True,"overlap_comm": True},"gradient_accumulation_steps": 4,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0
}# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model,config_params=ds_config
)# Example input
inputs = tokenizer("Hello, DeepSpeed!", return_tensors="pt")
outputs = model_engine(**inputs, labels=inputs["input_ids"])
loss = outputs.loss# Backward pass and optimizer step
model_engine.backward(loss)
model_engine.step()

6. Conclusion

Mixed precision training reduces memory usage by storing weights and gradients in BF16, while using FP32 for the momentum terms and weight update copies to ensure numerical stability.
For the Meta-Llama-3-8B-Instruct model, the memory requirements under mixed precision training are about 132.8 GB, with momentum parameters and optimizer copies occupying a significant portion of this memory.
By using techniques like ZeRO optimization, DeepSpeed helps further optimize memory usage, making training large models more feasible and efficient.

后记

2024年12月1日14点46分于上海，在GPT4o大模型辅助下完成。