Meta-Llama-3-8B-Instruct 模型的混合精度训练显存需求:AdamW优化器(中英双语)

news/2024/12/4 19:40:30/

深入分析 Meta-Llama-3-8B-Instruct 模型的混合精度训练显存需求

Meta-Llama-3-8B-Instruct 是一个 8B(80亿)参数的大型语言模型,适用于指令微调任务。与之前的 7B 模型相比,它在计算和存储方面会有更高的需求。为了提高训练效率并减少显存占用,混合精度训练(使用 BF16 格式存储权重,同时使用 FP32 格式存储更新副本和优化器参数)成为一种重要的技术手段。本博客将详细计算 Meta-Llama-3-8B-Instruct 在混合精度训练中的显存需求,并分析为什么权重更新副本和优化器参数使用 FP32 存储。

1. Meta-Llama-3-8B-Instruct 模型的基本参数

Meta-Llama-3-8B-Instruct 模型包含 80 亿参数。根据参数存储格式的不同,显存需求有所不同。我们将计算在混合精度训练下,模型权重、梯度、优化器参数等的显存占用。

存储格式:

  • BF16:每个参数占用 2 字节。
  • FP32:每个参数占用 4 字节。

2. 显存需求的详细计算

在混合精度训练中,显存需求主要由以下几个部分组成:

1)模型权重

  • 前向传播和反向传播
    模型权重以 BF16 存储。
    模型权重(BF16) = 8 × 1 0 9 × 2 bytes = 16 GB \text{模型权重(BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB} 模型权重(BF16)=8×109×2bytes=16GB

  • 权重更新副本
    模型权重的 FP32 副本用于更新。
    模型权重(FP32) = 8 × 1 0 9 × 4 bytes = 32 GB \text{模型权重(FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB} 模型权重(FP32)=8×109×4bytes=32GB

2)梯度

梯度与模型权重的规模相同,通常使用 BF16 格式进行存储:
梯度(BF16) = 8 × 1 0 9 × 2 bytes = 16 GB \text{梯度(BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB} 梯度(BF16)=8×109×2bytes=16GB

3)优化器动量参数

AdamW 优化器需要存储一阶动量和二阶动量,通常以 FP32 格式存储:

  • 一阶动量(( m \mathbf{m} m))
    一阶动量(FP32) = 8 × 1 0 9 × 4 bytes = 32 GB \text{一阶动量(FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB} 一阶动量(FP32)=8×109×4bytes=32GB

  • 二阶动量(( v \mathbf{v} v))
    二阶动量(FP32) = 8 × 1 0 9 × 4 bytes = 32 GB \text{二阶动量(FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB} 二阶动量(FP32)=8×109×4bytes=32GB

4)激活值

反向传播过程中需要保留前向传播的激活值,通常以 BF16 存储。假设激活值占总权重的 30%:
激活值(BF16) ≈ 0.3 × 16 GB = 4.8 GB \text{激活值(BF16)} \approx 0.3 \times 16 \, \text{GB} = 4.8 \, \text{GB} 激活值(BF16)0.3×16GB=4.8GB


3. 显存需求总结

组件存储格式显存需求
模型权重(BF16)BF1616 GB
权重更新副本(FP32)FP3232 GB
梯度(BF16)BF1616 GB
一阶动量(FP32)FP3232 GB
二阶动量(FP32)FP3232 GB
激活值(BF16)BF164.8 GB
总计132.8 GB

4. 深入分析

为什么优化器参数使用 FP32 存储?

尽管 BF16 格式在计算中可以显著减少显存需求,但其精度不足以保证优化器参数的稳定性,特别是在大规模模型的训练中。由于 AdamW 优化器需要对梯度和动量进行多次更新操作,使用 FP32 可以确保数值精度和更新的稳定性,避免因精度问题导致训练失败。

动量参数显存需求为何如此高?

AdamW 优化器的两种动量参数(( m \mathbf{m} m) 和 ( v \mathbf{v} v))的显存占用与权重相同,因此显存需求非常高。每个动量参数都以 FP32 存储,和模型的参数数量成正比。对 LLaMA 3 8B 模型而言,动量参数的显存需求非常可观。

如何优化显存使用?

  • ZeRO 优化:使用 DeepSpeed 的 ZeRO 技术来分片优化器参数和梯度,能显著降低单张 GPU 的显存占用。
  • 混合精度优化:继续使用 BF16 存储权重和梯度,减少内存消耗,同时通过 FP32 存储动量和权重更新副本,保证数值稳定。

5. 代码示例:DeepSpeed 混合精度训练

下面是如何使用 DeepSpeed 训练 Meta-Llama-3-8B-Instruct 模型的代码示例:

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer# 加载模型和分词器
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)# DeepSpeed 配置
ds_config = {"fp16": {"enabled": True  # 启用混合精度训练 (BF16)},"optimizer": {"type": "AdamW",  # 使用 AdamW 优化器"params": {"lr": 1e-5,"betas": [0.9, 0.999],"eps": 1e-8,"weight_decay": 0.01}},"zero_optimization": {"stage": 2,  # ZeRO Stage 2,分片优化器参数"contiguous_gradients": True,"overlap_comm": True},"gradient_accumulation_steps": 4,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0
}# 启动 DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model,config_params=ds_config
)# 示例数据
inputs = tokenizer("Hello, DeepSpeed!", return_tensors="pt")
outputs = model_engine(**inputs, labels=inputs["input_ids"])
loss = outputs.loss# 反向传播和优化
model_engine.backward(loss)
model_engine.step()

6. 总结

  1. 混合精度训练通过使用 BF16 存储模型权重和梯度,在保证计算精度的同时显著降低了显存占用。然而,由于 AdamW 优化器需要更高精度的更新操作,动量参数和权重更新副本仍然以 FP32 格式存储。
  2. 对于 Meta-Llama-3-8B-Instruct 模型,混合精度训练的显存需求约为 132.8 GB,其中动量参数和优化器副本占据了显存的大部分。
  3. 通过 DeepSpeed 等技术,如 ZeRO 优化,可以进一步优化显存占用,为大规模模型的训练提供更高效的解决方案。

In-Depth Analysis of Memory Requirements for Mixed Precision Training of Meta-Llama-3-8B-Instruct Model

Meta-Llama-3-8B-Instruct is an 8 billion parameter language model, fine-tuned for instruction-based tasks. Compared to smaller models, it requires more computational resources and memory. Mixed precision training, which stores weights in BF16 format while keeping the update copies of weights and optimizer parameters in FP32 format, helps reduce memory usage and speeds up training. In this blog, we will calculate the memory requirements for the Meta-Llama-3-8B-Instruct model under mixed precision training and explain the rationale behind using FP32 for optimizer parameters despite using BF16 for weights and gradients.

1. Meta-Llama-3-8B-Instruct Model Parameters

The Meta-Llama-3-8B-Instruct model consists of 8 billion parameters. The memory requirements vary depending on the precision format used for different components (weights, gradients, optimizer parameters). Let’s calculate the memory needed for various components of the model under mixed precision training.

Precision Formats:

  • BF16: 2 bytes per parameter.
  • FP32: 4 bytes per parameter.

2. Detailed Calculation of Memory Requirements

Mixed precision training involves storing model weights and gradients in BF16, while optimizer parameters (momentum) and the weight update copies are stored in FP32. Let’s break down the memory requirements for the different components:

1) Model Weights

  • Forward and Backward Propagation:
    Weights are stored in BF16.
    Model Weights (BF16) = 8 × 1 0 9 × 2 bytes = 16 GB \text{Model Weights (BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB} Model Weights (BF16)=8×109×2bytes=16GB

  • Weight Update Copies:
    The FP32 copy of the weights is used for updates.
    Model Weights (FP32) = 8 × 1 0 9 × 4 bytes = 32 GB \text{Model Weights (FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB} Model Weights (FP32)=8×109×4bytes=32GB

2) Gradients

The gradients have the same scale as the model weights and are usually stored in BF16 format:
Gradients (BF16) = 8 × 1 0 9 × 2 bytes = 16 GB \text{Gradients (BF16)} = 8 \times 10^9 \times 2 \, \text{bytes} = 16 \, \text{GB} Gradients (BF16)=8×109×2bytes=16GB

3) Optimizer Momentum Parameters

AdamW optimizer requires storing first and second moments, which are stored in FP32:

  • First Moment (( m \mathbf{m} m)):
    First Moment (FP32) = 8 × 1 0 9 × 4 bytes = 32 GB \text{First Moment (FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB} First Moment (FP32)=8×109×4bytes=32GB

  • Second Moment (( v \mathbf{v} v)):
    Second Moment (FP32) = 8 × 1 0 9 × 4 bytes = 32 GB \text{Second Moment (FP32)} = 8 \times 10^9 \times 4 \, \text{bytes} = 32 \, \text{GB} Second Moment (FP32)=8×109×4bytes=32GB

4) Activations

During backpropagation, activations need to be stored for the forward pass. Activations are typically stored in BF16 format. Assuming activations take up around 30% of the weight size:
Activations (BF16) ≈ 0.3 × 16 GB = 4.8 GB \text{Activations (BF16)} \approx 0.3 \times 16 \, \text{GB} = 4.8 \, \text{GB} Activations (BF16)0.3×16GB=4.8GB


3. Memory Requirements Summary

ComponentPrecision FormatMemory Requirement
Model Weights (BF16)BF1616 GB
Weight Update Copies (FP32)FP3232 GB
Gradients (BF16)BF1616 GB
First Moment (FP32)FP3232 GB
Second Moment (FP32)FP3232 GB
Activations (BF16)BF164.8 GB
Total132.8 GB

4. Detailed Analysis

Why are Optimizer Parameters Stored in FP32?

Even though BF16 offers significant memory savings during calculations, it lacks the precision needed for reliable updates in optimizer parameters. AdamW optimizer requires high precision for momentum updates, as the accuracy of parameter adjustments directly affects model convergence. Storing the momentum parameters and weight update copies in FP32 ensures that numerical stability is maintained during optimization.

Why is the Memory Requirement for Momentum Parameters So High?

Both first and second moment parameters in AdamW are stored in FP32, which results in a significant memory footprint. The memory needed for these parameters scales linearly with the number of model parameters. In the case of LLaMA-3 8B, the optimizer’s memory requirements are substantial, occupying 64 GB just for these momentum terms.

How Can Memory Usage Be Optimized?

  • ZeRO Optimization: DeepSpeed’s ZeRO optimization can help reduce the memory footprint by partitioning the optimizer states and gradients across multiple GPUs, making it feasible to train large models with less memory.
  • Mixed Precision Optimization: Using BF16 for weights, gradients, and activations reduces memory usage while still maintaining enough precision for forward and backward passes, ensuring efficient training.

5. Code Example: DeepSpeed Mixed Precision Training

Here’s how to use DeepSpeed to train the Meta-Llama-3-8B-Instruct model with mixed precision:

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer# Load model and tokenizer
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)# DeepSpeed configuration
ds_config = {"fp16": {"enabled": True  # Enable mixed precision training (BF16)},"optimizer": {"type": "AdamW",  # Use AdamW optimizer"params": {"lr": 1e-5,"betas": [0.9, 0.999],"eps": 1e-8,"weight_decay": 0.01}},"zero_optimization": {"stage": 2,  # ZeRO Stage 2: Partition optimizer states"contiguous_gradients": True,"overlap_comm": True},"gradient_accumulation_steps": 4,"train_micro_batch_size_per_gpu": 1,"gradient_clipping": 1.0
}# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model,config_params=ds_config
)# Example input
inputs = tokenizer("Hello, DeepSpeed!", return_tensors="pt")
outputs = model_engine(**inputs, labels=inputs["input_ids"])
loss = outputs.loss# Backward pass and optimizer step
model_engine.backward(loss)
model_engine.step()

6. Conclusion

  1. Mixed precision training reduces memory usage by storing weights and gradients in BF16, while using FP32 for the momentum terms and weight update copies to ensure numerical stability.
  2. For the Meta-Llama-3-8B-Instruct model, the memory requirements under mixed precision training are about 132.8 GB, with momentum parameters and optimizer copies occupying a significant portion of this memory.
  3. By using techniques like ZeRO optimization, DeepSpeed helps further optimize memory usage, making training large models more feasible and efficient.

后记

2024年12月1日14点46分于上海,在GPT4o大模型辅助下完成。


http://www.ppmy.cn/news/1552359.html

相关文章

SQL进阶技巧:如何寻找同一批用户 | 断点分组应用【最新面试题】

目录 0 问题描述 1 数据准备 2 问题分析 ​编辑 3 小结 0 问题描述 用户登录时间不超过10分钟的视为同一批用户,找出以下用户哪些属于同一批用户(SQL实现) 例如: user_name time a 2024-10-01 09:55 b 2024-10-01 09:57 c 2024-10-01…

网络安全运维——高级 题库一 50题

1. 单选题 进行监督和检查全管理制度的贯彻执行情况、系统的运行情况,并且对安全审查和决策机构担负保护系统安全的责任,但工作重点偏向于监视的是()。 A. 安全审查和决策机构 B. 安全主管机构 C. 安全运行维护机构 D. 安全审…

opencv-mobile在幸狐RV1106部署使用

本文将介绍 “opencv-mobile”,一款体积仅有官方 1/10 的精简 OpenCV 库,以及它在 LuckFox Pico 平台上的应用。 原文出处:https://zhuanlan.zhihu.com/p/670191385 1、 创建一个项目文件夹 mkdir opencv-mobile-test cd opencv-mobile-test…

实现PDF文档加密,访问需要密码

01. 背景 今天下午老板神秘兮兮的来问我,能不能做个文档加密功能,就是那种用户下载打开需要密码才能打开的那种效果。boss都发话了,那必须可以。 需求:将 pdf 文档经过加密处理,客户下载pdf文档,打开文档需…

SpringBoot(一)

Springboot(一) 什么是SpringBoot SpringBoot是Spring项目中的一个子工程,与Spring-famework同属于Spring的产品 用一些固定的方式来构建生产级别的Spring应用。SpringBoot推崇约定大于配置的方式以便于能够尽可能快速的启动并运行程序 我们把Spring Boot称为搭建程…

第 41 章 - Go语言 软件工程原则

在软件工程中,有一些广泛接受的原则和最佳实践,它们帮助开发者构建更易于维护、扩展和理解的代码。本章将介绍几个重要的原则:SOLID、DRY(Don’t Repeat Yourself)、KISS(Keep It Simple, Stupid&#xff0…

QT6_UI设计——设置表格

环境&#xff1a;qt6.8 1、放置 双击 2行 、列 设置 3、设置表格内容 读取表格内容 uint16 get_table_value_16_cmd(int row,int column) {if(column<1)return 0;QTableWidgetItem *itemnew QTableWidgetItem;itemui1->tableWidget_2->item(row,column);if(item! nul…

Web安全基础实践

实践目标 &#xff08;1&#xff09;理解常用网络攻击技术的基本原理。&#xff08;2&#xff09;Webgoat实践下相关实验。 WebGoat WebGoat是由著名的OWASP负责维护的一个漏洞百出的J2EE Web应用程序&#xff0c;这些漏洞并非程序中的bug&#xff0c;而是故意设计用来讲授We…