清华大学开源 CogVideoX-5B-I2V 模型，以支持图生视频

CogVideoX 是源于清影的开源视频生成模型。下表列出了我们在此版本中提供的视频生成模型的相关信息。

在这里插入图片描述

Model Name	CogVideoX-2B	CogVideoX-5B	CogVideoX-5B-I2V (This Repository)
Model Description	Entry-level model, balancing compatibility. Low cost for running and secondary development.	Larger model with higher video generation quality and better visual effects.	CogVideoX-5B image-to-video version.
Inference Precision	*FP16(recommended)*, BF16, FP32, FP8, INT8, not supported: INT4	BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4
Single GPU Memory Usage	SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8 (torchao): from 3.6GB*	SAT BF16: 26GB diffusers BF16: from 5GB* diffusers INT8 (torchao): from 4.4GB*
Multi-GPU Inference Memory Usage	*FP16: 10GB using diffusers**	*BF16: 15GB using diffusers**
Inference Speed (Step = 50, FP/BF16)	Single A100: ~90 seconds Single H100: ~45 seconds	Single A100: ~180 seconds Single H100: ~90 seconds
Fine-tuning Precision	FP16	BF16
Fine-tuning Memory Usage	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)	78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16GPU)
Prompt Language	English*
Maximum Prompt Length	226 Tokens
Video Length	6 Seconds
Frame Rate	8 Frames / Second
Video Resolution	720 x 480, no support for other resolutions (including fine-tuning)
Position Embedding	3d_sincos_pos_embed	3d_rope_pos_embed	3d_rope_pos_embed + learnable_pos_embed

数据说明

在使用 diffusers 库进行测试时，启用了 diffusers 库中包含的所有优化功能。本方案尚未在英伟达™（NVIDIA®）A100/H100 架构以外的设备上进行实际内存使用测试。一般来说，此方案适用于所有英伟达安培架构及以上的设备。如果禁用优化功能，内存消耗将成倍增加，峰值内存使用量约为表中数值的 3 倍。不过，速度会提高约 3-4 倍。您可以有选择性地禁用某些优化功能，包括

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
对于多 GPU 推理，需要禁用 enable_sequential_cpu_offload() 优化。
使用 INT8 模型会降低推理速度，这样做是为了适应较低内存的 GPU，同时保持最低的视频质量损失，不过推理速度会明显降低。
CogVideoX-2B 模型是以 FP16 精度训练的，而所有 CogVideoX-5B 模型都是以 BF16 精度训练的。我们建议使用模型训练时的精度进行推理。
PytorchAO 和 Optimum-quanto 可用于量化文本编码器、转换器和 VAE 模块，以降低 CogVideoX 的内存需求。这样，模型就可以在免费的 T4 Colab 或内存较小的 GPU 上运行！此外，请注意 TorchAO 量化完全兼容 torch.compile，可显著提高推理速度。 FP8 精度必须在英伟达 H100 及以上的设备上使用，需要安装 torch、torchao、diffusers 和加速 Python 软件包的源代码。建议使用 CUDA 12.4。
推理速度测试也采用了上述内存优化方案。在不进行内存优化的情况下，推理速度提高了约 10%。只有扩散器版本的模型支持量化。
该模型仅支持英文输入，其他语言可通过大型模型细化翻译成英文使用。
模型微调的内存使用情况在 8 * H100 环境中进行了测试，程序自动使用 Zero 2 优化。如果表格中标注了特定的 GPU 数量，则必须使用该数量或更多 GPU 进行微调。

提醒

使用 SAT 进行推理和微调 SAT 版本模型。欢迎访问我们的 GitHub 了解更多详情。

Getting Started Quickly 🤗

该模型支持使用拥抱面扩散器库进行部署。您可以按照以下步骤开始操作。

我们建议您访问我们的 GitHub，查看提示优化和转换，以获得更好的体验。

安装所需的依赖项

# diffusers>=0.30.3
# transformers>=0.44.2
# accelerate>=0.34.0
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

运行代码

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_imageprompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
image = load_image(image="input.jpg")
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V",torch_dtype=torch.bfloat16
)pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()video = pipe(prompt=prompt,image=image,num_videos_per_prompt=1,num_inference_steps=50,num_frames=49,guidance_scale=6,generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]export_to_video(video, "output.mp4", fps=8)

量化推理

PytorchAO 和 Optimum-quanto 可用于量化文本编码器、转换器和 VAE 模块，以减少 CogVideoX 的内存需求。这样，模型就可以在免费的 T4 Colab 或 VRAM 较低的 GPU 上运行！此外，请注意 TorchAO 量化完全兼容 torch.compile，这可以大大加快推理速度。

# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
# Source and nightly installation is only required until the next release.import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_onlyquantization = int8_weight_onlytext_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b-I2V", subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b-I2V",subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b-I2V", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())# Create pipeline and run inference
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V",text_encoder=text_encoder,transformer=transformer,vae=vae,torch_dtype=torch.bfloat16,
)pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
image = load_image(image="input.jpg")
video = pipe(prompt=prompt,image=image,num_videos_per_prompt=1,num_inference_steps=50,num_frames=49,guidance_scale=6,generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]export_to_video(video, "output.mp4", fps=8)