4 Implementing a GPT model from Scratch To Generate Text
4.4 Adding shortcut connections
-
接下来,让我们讨论 shortcut connections(快捷连接)背后的概念,也称为 skip connections(跳跃连接)或 residual connections(残差连接)。最初,在残差网络(ResNet)中提出,用于缓解梯度消失问题。梯度消失问题指的是梯度(在训练过程中指导权重更新)在向后传播通过各层时逐渐变小,导致难以有效训练较早的层,如下图所示
对比一个由 5 层组成的深度神经网络,左侧没有快捷连接,右侧带有快捷连接。快捷连接涉及将某一层的输入与其输出相加,从而有效地创建一条绕过某些层的替代路径
工作原理:
- 创建更短的梯度路径,跳过中间层。
- 通过将某一层的输出与后面某一层的输出相加实现。
-
前向方法中添加快捷连接
python">class ExampleDeepNeuralNetwork(nn.Module):def __init__(self, layer_sizes, use_shortcut):super().__init__()self.use_shortcut = use_shortcutself.layers = nn.ModuleList([nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())])def forward(self, x):for layer in self.layers:# Compute the output of the current layerlayer_output = layer(x)# Check if shortcut can be appliedif self.use_shortcut and x.shape == layer_output.shape:x = x + layer_outputelse:x = layer_outputreturn x
该代码实现了一个包含 5 层的深度神经网络,每层由一个 Linear layer(线性层)和一个 GELU activation function(GELU 激活函数)组成。在前向传播过程中,我们迭代地将输入传递到各层,如果
self.use_shortcut
属性设置为True
,则可以选择性地添加上面图中的 shortcut connections(快捷连接)。python">def print_gradients(model, x):# Forward passoutput = model(x)target = torch.tensor([[0.]])# Calculate loss based on how close the target# and output areloss = nn.MSELoss()loss = loss(output, target)# Backward pass to calculate the gradientsloss.backward()for name, param in model.named_parameters():if 'weight' in name:# Print the mean absolute gradient of the weightsprint(f"{name} has gradient mean of {param.grad.abs().mean().item()}")
接下来,我们实现一个计算模型向后传递中的梯度的函数:
python">def print_gradients(model, x):# Forward passoutput = model(x)target = torch.tensor([[0.]])# Calculate loss based on how close the target# and output areloss = nn.MSELoss()loss = loss(output, target)# Backward pass to calculate the gradientsloss.backward()for name, param in model.named_parameters():if 'weight' in name:# Print the mean absolute gradient of the weightsprint(f"{name} has gradient mean of {param.grad.abs().mean().item()}")
在前面的代码中,我们定义了一个损失函数来计算模型输出与目标值(如 0)的接近程度,并通过调用
loss.backward()
自动计算每一层的损失梯度。使用model.named_parameters()
可以遍历权重参数,例如对于 3×3 的权重矩阵,计算其 3×3 梯度值的平均绝对梯度,从而得到每一层的单一梯度值,便于比较各层梯度。.backward()
方法的优势在于自动完成梯度计算,无需手动实现数学过程,极大地简化了深度神经网络的训练和使用。接下来我们首先打印没有使用shortcut的梯度
python"># 未使用shortcut layer_sizes = [3, 3, 3, 3, 3, 1] sample_input = torch.tensor([[1., 0., -1.]])torch.manual_seed(123) model_without_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=False ) print_gradients(model_without_shortcut, sample_input)"""输出""" layers.0.0.weight has gradient mean of 0.00020173587836325169 layers.1.0.weight has gradient mean of 0.0001201116101583466 layers.2.0.weight has gradient mean of 0.0007152041653171182 layers.3.0.weight has gradient mean of 0.001398873864673078 layers.4.0.weight has gradient mean of 0.005049646366387606
接着打印使用shortcut的梯度
python"># 使用shortcut torch.manual_seed(123) model_with_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=True ) print_gradients(model_with_shortcut, sample_input)"""输出""" layers.0.0.weight has gradient mean of 0.22169792652130127 layers.1.0.weight has gradient mean of 0.20694106817245483 layers.2.0.weight has gradient mean of 0.32896995544433594 layers.3.0.weight has gradient mean of 0.2665732502937317 layers.4.0.weight has gradient mean of 1.3258541822433472
根据上面的输出结果可以看出,shortcut connections(快捷连接)防止了梯度在早期层(如
layer.0
)中消失,确保梯度的有效传播。