深度学习 - RNN训练过程推演

1. 数据准备

字符序列 “hello” 转换为 one-hot 编码表示：

输入: [‘h’, ‘e’, ‘l’, ‘l’]
输出: [‘e’, ‘l’, ‘l’, ‘o’]

2. 初始化参数

我们使用一个单层的 RNN（N VS N），隐藏层大小为2，每次传1个字符。初始参数如下：

$W_{xh} = \begin{pmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \\ 0.7 & 0.8 \end{pmatrix}, \quad W_{hh} = \begin{pmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{pmatrix}, \quad W_{hy} = \begin{pmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \end{pmatrix}$

偏置项初始化为0。

3. 前向传播和反向传播

时间步 1（输入 ‘h’）：

输入向量 $x_1 = [1, 0, 0, 0]$

$h_1 = \tanh(W_{xh} x_1 + W_{hh} h_0) = \tanh \left( \begin{pmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \\ 0.7 & 0.8 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \end{pmatrix} + \begin{pmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix} \right) = \tanh \left( \begin{pmatrix} 0.1 \\ 0.3 \end{pmatrix} \right) = \begin{pmatrix} 0.0997 \\ 0.2913 \end{pmatrix}$

$y_1 = W_{hy} h_1 = \begin{pmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \end{pmatrix} \begin{pmatrix} 0.0997 \\ 0.2913 \end{pmatrix} = \begin{pmatrix} 0.1695 \\ 0.3889 \\ 0.6083 \\ 0.8277 \end{pmatrix}$

预测值 $\hat{y}_1 = \text{softmax}(y_1)$

假设真实输出为 ‘e’，对应 one-hot 编码为 $y_1 = [0, 1, 0, 0]$ 。

交叉熵损失函数：

$\text{loss}_1 = - \sum_{i} y_{1i} \log(\hat{y}_{1i})$

梯度计算：

$\frac{\partial \text{loss}_1}{\partial W_{hy}} = (\hat{y}_1 - y_1) h_1^T$

$\frac{\partial \text{loss}_1}{\partial W_{xh}} = \frac{\partial \text{loss}_1}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_{xh}}$

$\frac{\partial \text{loss}_1}{\partial W_{hh}} = \frac{\partial \text{loss}_1}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_{hh}}$

参数更新：

$W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_1}{\partial W_{xh}}$

$W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_1}{\partial W_{hh}}$

$W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_1}{\partial W_{hy}}$

时间步 2（输入 ‘e’）：

使用更新后的 $W_{xh}$ 、 $W_{hh}$ 和 $W_{hy}$ 参数。

输入向量 $x_2 = [0, 1, 0, 0]$

$h_2 = \tanh(W_{xh} x_2 + W_{hh} h_1) = \tanh \left( \begin{pmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \\ 0.7 & 0.8 \end{pmatrix} \begin{pmatrix} 0 \\ 1 \\ 0 \\ 0 \end{pmatrix} + \begin{pmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{pmatrix} \begin{pmatrix} 0.0997 \\ 0.2913 \end{pmatrix} \right)$

计算后得：

$h_2 = \tanh \left( \begin{pmatrix} 0.3 \\ 0.7 \end{pmatrix} + \begin{pmatrix} 0.1283 \\ 0.2147 \end{pmatrix} \right) = \tanh \left( \begin{pmatrix} 0.4283 \\ 0.9147 \end{pmatrix} \right)$

$y_2 = W_{hy} h_2$

预测值 $\hat{y}_2 = \text{softmax}(y_2)$

假设真实输出为 ‘l’，对应 one-hot 编码为 $y_2 = [0, 0, 1, 0]$ 。

交叉熵损失函数：

$\text{loss}_2 = - \sum_{i} y_{2i} \log(\hat{y}_{2i})$

梯度计算：

$\frac{\partial \text{loss}_2}{\partial W_{hy}} = (\hat{y}_2 - y_2) h_2^T$

$\frac{\partial \text{loss}_2}{\partial W_{xh}} = \frac{\partial \text{loss}_2}{\partial h_2} \cdot \frac{\partial h_2}{\partial W_{xh}}$

$\frac{\partial \text{loss}_2}{\partial W_{hh}} = \frac{\partial \text{loss}_2}{\partial h_2} \cdot \frac{\partial h_2}{\partial W_{hh}}$

参数更新：

$W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_2}{\partial W_{xh}}$

$W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_2}{\partial W_{hh}}$

$W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_2}{\partial W_{hy}}$

时间步 3（输入 ‘l’）：

使用更新后的 $W_{xh}$ 、 $W_{hh}$ 和 $W_{hy}$ 参数。

输入向量 $x_3 = [0, 0, 1, 0]$

$h_3 = \tanh(W_{xh} x_3 + W_{hh} h_2)$

计算后得：

$h_3 = \tanh \left( \begin{pmatrix} 0.5 \\ 1.2 \end{pmatrix} + W_{hh} h_2 \right)$

$y_3 = W_{hy} h_3$

预测值 $\hat{y}_3 = \text{softmax}(y_3)$

假设真实输出为 ‘l’，对应 one-hot 编码为 $y_3 = [0, 0, 1, 0]$ 。

交叉熵损失函数：

$$
\text{loss}3 = - \sum{i} y_{3i} \log(\hat{y}_{3

i})
$$

梯度计算：

$\frac{\partial \text{loss}_3}{\partial W_{hy}} = (\hat{y}_3 - y_3) h_3^T$

$\frac{\partial \text{loss}_3}{\partial W_{xh}} = \frac{\partial \text{loss}_3}{\partial h_3} \cdot \frac{\partial h_3}{\partial W_{xh}}$

$\frac{\partial \text{loss}_3}{\partial W_{hh}} = \frac{\partial \text{loss}_3}{\partial h_3} \cdot \frac{\partial h_3}{\partial W_{hh}}$

参数更新：

$W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_3}{\partial W_{xh}}$

$W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_3}{\partial W_{hh}}$

$W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_3}{\partial W_{hy}}$

时间步 4（输入 ‘l’）：

使用更新后的 $W_{xh}$ 、 $W_{hh}$ 和 $W_{hy}$ 参数。

输入向量 $x_4 = [0, 0, 1, 0]$

$h_4 = \tanh(W_{xh} x_4 + W_{hh} h_3)$

计算后得：

$h_4 = \tanh \left( \begin{pmatrix} 0.5 \\ 1.2 \end{pmatrix} + W_{hh} h_3 \right)$

$y_4 = W_{hy} h_4$

预测值 $\hat{y}_4 = \text{softmax}(y_4)$

假设真实输出为 ‘o’，对应 one-hot 编码为 $y_4 = [0, 0, 0, 1]$ 。

交叉熵损失函数：

$\text{loss}_4 = - \sum_{i} y_{4i} \log(\hat{y}_{4i})$

梯度计算：

$\frac{\partial \text{loss}_4}{\partial W_{hy}} = (\hat{y}_4 - y_4) h_4^T$

$\frac{\partial \text{loss}_4}{\partial W_{xh}} = \frac{\partial \text{loss}_4}{\partial h_4} \cdot \frac{\partial h_4}{\partial W_{xh}}$

$\frac{\partial \text{loss}_4}{\partial W_{hh}} = \frac{\partial \text{loss}_4}{\partial h_4} \cdot \frac{\partial h_4}{\partial W_{hh}}$

参数更新：

$W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_4}{\partial W_{xh}}$

$W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_4}{\partial W_{hh}}$

$W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_4}{\partial W_{hy}}$

4.代码实现

下面是一个使用 PyTorch 实现简单 RNN（循环神经网络）的示例代码，该代码将字符序列作为输入并预测下一个字符。我们将使用一个小的字符集进行演示。

安装 PyTorch

在开始之前，请确保您已安装 PyTorch。您可以使用以下命令进行安装：

pip install torch

RNN 实现示例

我们将实现一个字符级 RNN，用于从序列 “hello” 中预测下一个字符。字符集为 {‘h’, ‘e’, ‘l’, ‘o’}。

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np# 定义字符集和字符到索引的映射
chars = ['h', 'e', 'l', 'o']
char_to_idx = {ch: idx for idx, ch in enumerate(chars)}
idx_to_char = {idx: ch for idx, ch in enumerate(chars)}# 超参数
input_size = len(chars)
hidden_size = 10
output_size = len(chars)
num_layers = 1
learning_rate = 0.01
num_epochs = 100# 准备数据
def char_to_tensor(char):tensor = torch.zeros(input_size)tensor[char_to_idx[char]] = 1.0return tensordef string_to_tensor(string):tensor = torch.zeros(len(string), input_size)for idx, char in enumerate(string):tensor[idx][char_to_idx[char]] = 1.0return tensorinput_seq = "hell"
target_seq = "ello"input_tensor = string_to_tensor(input_seq)
target_tensor = torch.tensor([char_to_idx[ch] for ch in target_seq])# 定义 RNN 模型
class RNN(nn.Module):def __init__(self, input_size, hidden_size, output_size):super(RNN, self).__init__()self.hidden_size = hidden_sizeself.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)self.fc = nn.Linear(hidden_size, output_size)def forward(self, x, hidden):out, hidden = self.rnn(x, hidden)out = self.fc(out[:, -1, :])return out, hiddendef init_hidden(self):return torch.zeros(num_layers, 1, hidden_size)# 初始化模型、损失函数和优化器
model = RNN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)# 训练模型
for epoch in range(num_epochs):hidden = model.init_hidden()model.zero_grad()input_seq = input_tensor.unsqueeze(0)output, hidden = model(input_seq, hidden)loss = criterion(output, target_tensor.unsqueeze(0))loss.backward()optimizer.step()if (epoch + 1) % 10 == 0:print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')# 测试模型
def predict(model, char, hidden=None):if hidden is None:hidden = model.init_hidden()input_tensor = char_to_tensor(char).unsqueeze(0).unsqueeze(0)output, hidden = model(input_tensor, hidden)_, predicted_idx = torch.max(output, 1)return idx_to_char[predicted_idx.item()], hiddenhidden = model.init_hidden()
input_char = 'h'
predicted_seq = input_char
for _ in range(len(input_seq)):next_char, hidden = predict(model, input_char, hidden)predicted_seq += next_charinput_char = next_charprint(f'Predicted sequence: {predicted_seq}')

代码说明

数据准备：
- 我们定义了一个简单的字符集 {‘h’, ‘e’, ‘l’, ‘o’}，并创建了字符到索引和索引到字符的映射。
- char_to_tensor 函数将字符转换为 one-hot 向量。
- string_to_tensor 函数将字符串转换为一系列 one-hot 向量。
定义 RNN 模型：
- RNN 类继承自 nn.Module，包含一个 RNN 层和一个全连接层。
- forward 方法执行前向传播。
- init_hidden 方法初始化隐藏状态。
训练模型：
- 我们使用交叉熵损失函数和 Adam 优化器。
- 在每个训练周期，我们进行前向传播、计算损失、反向传播和参数更新。
测试模型：
- predict 函数根据给定的输入字符生成下一个字符。
- 我们使用训练好的模型从字符 ‘h’ 开始生成一个字符序列。