DeepSeek-VL2 后训练版本的网络架构
flyfish
通过视觉模块提取图像特征,投影模块将视觉特征映射到与语言模块兼容的特征空间,语言模块则结合视觉和文本信息进行因果语言建模。同时,使用 PEFT 和 LoRA 技术进行参数高效微调,以及 MoE 架构提高模型的计算效率和表达能力。
整体架构概述
这是一个多模态的因果语言模型,使用了参数高效微调(PEFT)技术,结合了视觉和语言处理能力。模型主要由包装层、视觉模块、投影模块和语言模块组成,各部分协同工作以处理视觉和文本输入。
各部分详细分析
1. PeftModelForCausalLM
和 LoraModel
包装层
PeftModelForCausalLM
:它是基于因果语言模型的参数高效微调包装类。在微调过程中,为了减少计算资源和时间消耗,PEFT 方法只调整模型的部分参数,而非全部。这使得模型能够在特定任务上快速适应,同时保留预训练模型的大部分权重。LoraModel
:LoRA(Low - Rank Adaptation)是 PEFT 中的一种具体技术。LoraModel
通过在原始线性层上添加低秩矩阵来调整模型的权重。在训练时,只需更新这些低秩矩阵的参数,从而大大减少了可训练参数的数量。例如,在模型中的多个lora.Linear
层中,lora_A
和lora_B
矩阵就是 LoRA 技术引入的低秩矩阵。
2. 视觉模块(vision
)
该模块基于 Vision Transformer(ViT)架构,用于处理视觉输入,具体结构如下:
PatchEmbed
:proj
:一个卷积层Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14))
,将输入的 3 通道图像分割成固定大小的图像块,并将每个图像块映射到 1152 维的特征向量。norm
:这里是Identity()
,即不进行归一化操作。
pos_drop
:一个 Dropout 层,丢弃概率p = 0.0
,意味着在训练过程中不丢弃任何元素。blocks
:由 26 个Block
组成的序列,每个Block
包含多头自注意力机制(Attention
)和多层感知机(Mlp
),并使用LayerNorm
进行归一化。norm
:一个LayerNorm
层,对视觉特征进行归一化处理。attn_pool
:注意力池化层AttentionPoolLatent
,用于从视觉特征中提取潜在表示。head
:当前为Identity()
,可能用于后续的任务特定输出层。
3. 投影模块(projector
)
(projector): MlpProjector((layers): Sequential((0): Linear(in_features=4608, out_features=2048, bias=True)(1): GELU(approximate='none')(2): Linear(in_features=2048, out_features=2048, bias=True))
)
- 这是一个多层感知机(MLP)投影器,将视觉模块提取的 4608 维特征投影到 2048 维,以匹配语言模块的输入维度。中间使用
GELU
激活函数引入非线性。
4. 语言模块(language
)
该模块基于 DeepseekV2ForCausalLM
,是一个用于因果语言建模的模型,具体结构如下:
embed_tokens
:一个嵌入层Embedding(102400, 2048)
,将输入的词索引映射到 2048 维的词嵌入向量。layers
:由 27 个DeepseekV2DecoderLayer
组成的ModuleList
,每个DeepseekV2DecoderLayer
包含自注意力机制(DeepseekV2Attention
)和多层感知机(DeepseekV2MLP
或DeepseekV2MoE
),并使用DeepseekV2RMSNorm
进行归一化。DeepseekV2Attention
:自注意力机制,包含多个lora.Linear
层,使用 LoRA 技术进行微调。其中rotary_emb
是旋转嵌入层,用于对位置信息进行编码。DeepseekV2MLP
或DeepseekV2MoE
:多层感知机模块,部分层使用了混合专家(MoE)架构。DeepseekV2MoE
包含多个专家网络(experts
)和一个门控网络(MoEGate
),门控网络根据输入决定将输入分配给哪些专家网络进行处理。
norm
:一个DeepseekV2RMSNorm
层,对语言特征进行归一化处理。lm_head
:一个线性层Linear(in_features=2048, out_features=102400, bias=False)
,将语言特征映射到词汇表大小(102400),用于预测下一个词的概率。
model: PeftModelForCausalLM((base_model): LoraModel((model): DeepseekVLV2ForCausalLM((vision): VisionTransformer((patch_embed): PatchEmbed((proj): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14))(norm): Identity())(pos_drop): Dropout(p=0.0, inplace=False)(patch_drop): Identity()(norm_pre): Identity()(blocks): Sequential((0): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(1): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(2): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(3): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(4): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(5): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(6): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(7): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(8): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(9): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(10): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(11): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(12): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(13): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(14): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(15): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(16): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(17): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(18): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(19): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(20): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(21): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(22): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(23): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(24): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(25): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity())(26): Block((norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=1152, out_features=3456, bias=True)(q_norm): Identity()(k_norm): Identity()(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Identity())(ls1): Identity()(drop_path1): Identity()(norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='tanh')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): Identity()(drop_path2): Identity()))(norm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(attn_pool): AttentionPoolLatent((q): Linear(in_features=1152, out_features=1152, bias=True)(kv): Linear(in_features=1152, out_features=2304, bias=True)(q_norm): Identity()(k_norm): Identity()(proj): Linear(in_features=1152, out_features=1152, bias=True)(proj_drop): Dropout(p=0.0, inplace=False)(norm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=1152, out_features=4304, bias=True)(act): GELU(approximate='none')(drop1): Dropout(p=0.0, inplace=False)(norm): Identity()(fc2): Linear(in_features=4304, out_features=1152, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(fc_norm): Identity()(head_drop): Dropout(p=0.0, inplace=False)(head): Identity())(projector): MlpProjector((layers): Sequential((0): Linear(in_features=4608, out_features=2048, bias=True)(1): GELU(approximate='none')(2): Linear(in_features=2048, out_features=2048, bias=True)))(language): DeepseekV2ForCausalLM((model): DeepseekV2Model((embed_tokens): Embedding(102400, 2048)(layers): ModuleList((0): DeepseekV2DecoderLayer((self_attn): DeepseekV2Attention((q_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=3072, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=3072, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(kv_a_proj_with_mqa): lora.Linear((base_layer): Linear(in_features=2048, out_features=576, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=576, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(kv_a_layernorm): DeepseekV2RMSNorm()(kv_b_proj): lora.Linear((base_layer): Linear(in_features=512, out_features=4096, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=512, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=4096, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(o_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=2048, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=2048, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(rotary_emb): DeepseekV2RotaryEmbedding())(mlp): DeepseekV2MLP((gate_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=10944, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=10944, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(up_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=10944, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=10944, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(down_proj): lora.Linear((base_layer): Linear(in_features=10944, out_features=2048, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=10944, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=2048, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(act_fn): SiLU())(input_layernorm): DeepseekV2RMSNorm()(post_attention_layernorm): DeepseekV2RMSNorm())(1-26): 26 x DeepseekV2DecoderLayer((self_attn): DeepseekV2Attention((q_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=3072, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=3072, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(kv_a_proj_with_mqa): lora.Linear((base_layer): Linear(in_features=2048, out_features=576, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=576, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(kv_a_layernorm): DeepseekV2RMSNorm()(kv_b_proj): lora.Linear((base_layer): Linear(in_features=512, out_features=4096, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=512, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=4096, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(o_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=2048, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=2048, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(rotary_emb): DeepseekV2RotaryEmbedding())(mlp): DeepseekV2MoE((experts): ModuleList((0-63): 64 x DeepseekV2MLP((gate_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=1408, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=1408, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(up_proj): lora.Linear((base_layer): Linear(in_features=2048, out_features=1408, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=2048, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=1408, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(down_proj): lora.Linear((base_layer): Linear(in_features=1408, out_features=2048, bias=False)(lora_dropout): ModuleDict((default): Dropout(p=0.05, inplace=False))(lora_A): ModuleDict((default): Linear(in_features=1408, out_features=8, bias=False))(lora_B): ModuleDict((default): Linear(in_features=8, out_features=2048, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(act_fn): SiLU()))(gate): MoEGate()(shared_experts): DeepseekV2MLP((gate_proj): Linear(in_features=2048, out_features=2816, bias=False)(up_proj): Linear(in_features=2048, out_features=2816, bias=False)(down_proj): Linear(in_features=2816, out_features=2048, bias=False)(act_fn): SiLU()))(input_layernorm): DeepseekV2RMSNorm()(post_attention_layernorm): DeepseekV2RMSNorm()))(norm): DeepseekV2RMSNorm())(lm_head): Linear(in_features=2048, out_features=102400, bias=False))))
)