深度学习笔记之循环神经网络——反向传播角度观察LSTM
- 引言
- 回顾加补充:通过时间反向传播
- LSTM \text{LSTM} LSTM的反向传播过程
- 场景构建
- 示例:求解梯度 L ( T ) ∂ W X ⇒ F \begin{aligned}\frac{\mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned} ∂WX⇒FL(T)
- 反向传播过程 T \mathcal T T时刻梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned} ∂Wx(T)⇒f(T)∂L(T) 求解
- 反向传播过程 T − 1 \mathcal T - 1 T−1时刻梯度 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T -1)}}}\end{aligned} ∂Wx(T−1)⇒f(T−1)∂L(T)求解
- T − 2 \mathcal T - 2 T−2时刻与 T − 1 \mathcal T - 1 T−1时刻关于 W x ( t ) ⇒ f ( t ) \mathcal W_{x^{(t)} \Rightarrow f^{(t)}} Wx(t)⇒f(t)梯度的比较
- 为什么 LSTM \text{LSTM} LSTM能够抑制梯度消失
引言
上一节介绍了循环神经网络反向传播中存在的梯度消失问题,并以此为引介绍了长短期记忆神经网络 ( Long-Short Term Memory,LSTM ) (\text{Long-Short Term Memory,LSTM}) (Long-Short Term Memory,LSTM)。本节将从反向传播角度观察为什么 LSTM \text{LSTM} LSTM能够抑制梯度消失的情况。
回顾加补充:通过时间反向传播
回顾上一节中针对 RNN \text{RNN} RNN中 ∂ L ( T ) ∂ W x ( 1 ) ⇒ h ( 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}}\end{aligned} ∂Wx(1)⇒h(1)∂L(T)的反向传播过程:
∂ L ( T ) ∂ W x ( 1 ) ⇒ h ( 1 ) = ( O ( T ) − y ( T ) ) ⋅ W h ( T ) ⇒ O ( T ) ⋅ { ∏ k = 1 T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] } ⋅ { ∏ k = 2 T W h ( k − 1 ) ⇒ h ( k ) } ⋅ x ( 1 ) \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}} = (\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \left\{\prod_{k=1}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right]\right\} \cdot \left\{\prod_{k=2}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}}\right\} \cdot x^{(1)} ∂Wx(1)⇒h(1)∂L(T)=(O(T)−y(T))⋅Wh(T)⇒O(T)⋅{k=1∏TDiag[1−Tanh2(Z1(k))]}⋅{k=2∏TWh(k−1)⇒h(k)}⋅x(1)
如果仅仅是描述 T \mathcal T T时刻损失结果 L ( T ) \mathcal L^{(\mathcal T)} L(T)对权重分量 W x ( 1 ) ⇒ h ( 1 ) \mathcal W_{x^{(1)} \Rightarrow h^{(1)}} Wx(1)⇒h(1)的梯度信息(红色路径),使用上述公式即可;但实际上,该网络层是一个循环过程,我们需要求解 L ( T ) \mathcal L^{(\mathcal T)} L(T)对整个权重 W X ⇒ H \mathcal W_{\mathcal X \Rightarrow \mathcal H} WX⇒H的梯度进行求解。
这意味着:每更新到一个时刻 t ( t = 1 , 2 , ⋯ , T ) t(t=1,2,\cdots,\mathcal T) t(t=1,2,⋯,T),都会将当前时刻 W x ( t ) ⇒ h ( t ) \mathcal W_{x^{(t)}\Rightarrow h^{(t)}} Wx(t)⇒h(t)的梯度累加在 W X ⇒ H \mathcal W_{\mathcal X \Rightarrow \mathcal H} WX⇒H的梯度中。因而关于 L ( T ) \mathcal L^{(\mathcal T)} L(T)对 W X ⇒ H \mathcal W_{\mathcal X \Rightarrow \mathcal H} WX⇒H的梯度 ∂ L ( T ) ∂ W X ⇒ H \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}}\end{aligned} ∂WX⇒H∂L(T)表示为如下形式:
将上式
1 ⇒ t 1 \Rightarrow t 1⇒t代入。
下面公式中最后一项的大括号不是矩阵,而是为方便表达,描述
T \mathcal T T项累加和的过程。
∂ L ( T ) ∂ W X ⇒ H = ∂ L ( T ) ∂ W x ( T ) ⇒ h ( T ) + ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ h ( T − 1 ) + ⋯ + ∂ L ( T ) ∂ W x ( 1 ) ⇒ h ( 1 ) = ∑ t = 1 T ∂ L ( t ) ∂ W x ( t ) ⇒ h ( t ) = ∑ t = 1 T [ ( O ( T ) − y ( T ) ) ⋅ W h ( T ) ⇒ O ( T ) ⋅ { ∏ k = t T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] } ⋅ { ∏ k = t + 1 T W h ( k − 1 ) ⇒ h ( k ) } ⋅ x ( t ) ] = ( O ( T ) − y ( T ) ) ⋅ W h ( T ) ⇒ O ( T ) ⋅ { Diag [ 1 − Tanh 2 ( Z 1 ( T ) ) ] ⋅ x ( T ) + ∏ k = T − 1 T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] ⋅ W h ( T − 1 ) ⇒ h ( T ) ⋅ x ( T − 1 ) ⋮ + ∏ k = 1 T Diag [ 1 − Tanh 2 ( Z 1 ( k ) ) ] ⋅ ∏ k = 2 T W h ( k − 1 ) ⇒ h ( k ) ⋅ x ( 1 ) } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow h^{(\mathcal T)}}} + \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T-1)} \Rightarrow h^{(\mathcal T-1)}}} + \cdots + \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(1)} \Rightarrow h^{(1)}}} \\ & = \sum_{t=1}^{\mathcal T} \frac{\partial \mathcal L^{(t)}}{\partial \mathcal W_{x^{(t)} \Rightarrow h^{(t)}}} \\ & = \sum_{t=1}^{\mathcal T} \left[(\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \left\{\prod_{k=t}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right]\right\} \cdot \left\{\prod_{k=t+1}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}}\right\} \cdot x^{(t)}\right] \\ & = (\mathcal O^{(\mathcal T)} - y^{(\mathcal T)}) \cdot \mathcal W_{h^{(T)} \Rightarrow \mathcal O^{(\mathcal T)}} \cdot \begin{Bmatrix} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(\mathcal T)})\right] \cdot x^{(\mathcal T)} \\ +\prod_{k=\mathcal T-1}^{\mathcal T} \text{Diag} \left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right] \cdot \mathcal W_{h^{(\mathcal T - 1)}\Rightarrow h^{(\mathcal T)}} \cdot x^{(\mathcal T - 1)} \\ \vdots \\ +\prod_{k=1}^{\mathcal T} \text{Diag}\left[1 - \text{Tanh}^2(\mathcal Z_1^{(k)})\right] \cdot \prod_{k=2}^{\mathcal T} \mathcal W_{h^{(k-1)} \Rightarrow h^{(k)}} \cdot x^{(1)} \end{Bmatrix} \end{aligned} ∂WX⇒H∂L(T)=∂Wx(T)⇒h(T)∂L(T)+∂Wx(T−1)⇒h(T−1)∂L(T)+⋯+∂Wx(1)⇒h(1)∂L(T)=t=1∑T∂Wx(t)⇒h(t)∂L(t)=t=1∑T[(O(T)−y(T))⋅Wh(T)⇒O(T)⋅{k=t∏TDiag[1−Tanh2(Z1(k))]}⋅{k=t+1∏TWh(k−1)⇒h(k)}⋅x(t)]=(O(T)−y(T))⋅Wh(T)⇒O(T)⋅⎩ ⎨ ⎧Diag[1−Tanh2(Z1(T))]⋅x(T)+∏k=T−1TDiag[1−Tanh2(Z1(k))]⋅Wh(T−1)⇒h(T)⋅x(T−1)⋮+∏k=1TDiag[1−Tanh2(Z1(k))]⋅∏k=2TWh(k−1)⇒h(k)⋅x(1)⎭ ⎬ ⎫
很明显,上述公式中大括号内共包含 T \mathcal T T项的累加结果。可以通过观察发现:越后面的累加项,梯度消失的越厉害。也就是说:梯度 ∂ L ( T ) ∂ W X ⇒ H \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal H}}\end{aligned} ∂WX⇒H∂L(T)的结果,主要贡献来源于其反向传播最开始的若干个时刻。而这种计算代价为 O ( T ) \mathcal O(\mathcal T) O(T)的反向传播算法也被称作通过时间反向传播 ( Back-Propagation Through Time,BPTT ) (\text{Back-Propagation Through Time,BPTT}) (Back-Propagation Through Time,BPTT)。
LSTM \text{LSTM} LSTM的反向传播过程
场景构建
关于 t t t时刻 LSTM \text{LSTM} LSTM的前馈计算过程表示如下:
这里
y ( t ) y^{(t)} y(t)表示当前时刻的最终输出,后与对应时刻的损失函数结果
L ( t ) \mathcal L^{(t)} L(t)相衔接。
f ( t ) = σ [ W x ( t ) ⇒ f ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ h ( t ) ⋅ h ( t − 1 ) + b f ] i ( t ) = σ [ W x ( t ) ⇒ i ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ i ( t ) ⋅ h ( t − 1 ) + b i ] O ( t ) = σ [ W x ( t ) ⇒ O ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ O ( t ) ⋅ h ( t − 1 ) + b O ] C ~ ( t ) = Tanh [ W x ( t ) ⇒ C ~ ( t ) ⋅ x ( t ) + W h ( t − 1 ) ⇒ C ~ ( t ) ⋅ h ( t − 1 ) + b C ~ ] C ( t ) = C ( t − 1 ) ∗ f ( t ) + C ~ ( t ) ∗ i ( t ) h ( t ) = O ( t ) ∗ Tanh ( C ( t ) ) y ( t ) = W h ( t ) ⇒ y ( t ) ⋅ h ( t ) + b y W h ( t ) ⇒ y ( t ) ⇒ W H ⇒ Y \begin{aligned} & f^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow f^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)}\Rightarrow h^{(t)}} \cdot h^{(t-1)} + b_f\right] \\ & i^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow i^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} \cdot h^{(t-1)} + b_i\right] \\ & {\mathcal O}^{(t)} = \sigma \left[\mathcal W_{x^{(t)} \Rightarrow {\mathcal O}^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow {\mathcal O}^{(t)}} \cdot h^{(t-1)} + b_{\mathcal O}\right] \\ & \widetilde{\mathcal C}^{(t)} = \text{Tanh} \left[\mathcal W_{x^{(t)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \cdot x^{(t)} + \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \cdot h^{(t-1)} + b_{\widetilde{\mathcal C}}\right] \\ & \mathcal C^{(t)} = \mathcal C^{(t-1)} * f^{(t)} + \widetilde{\mathcal C}^{(t)} * i^{(t)} \\ & h^{(t)} = {\mathcal O}^{(t)} * \text{Tanh}(\mathcal C^{(t)}) \\ & y^{(t)} = \mathcal W_{h^{(t)} \Rightarrow y^{(t)}} \cdot h^{(t)} + b_y \quad \mathcal W_{h^{(t)} \Rightarrow y^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal Y} \end{aligned} f(t)=σ[Wx(t)⇒f(t)⋅x(t)+Wh(t−1)⇒h(t)⋅h(t−1)+bf]i(t)=σ[Wx(t)⇒i(t)⋅x(t)+Wh(t−1)⇒i(t)⋅h(t−1)+bi]O(t)=σ[Wx(t)⇒O(t)⋅x(t)+Wh(t−1)⇒O(t)⋅h(t−1)+bO]C (t)=Tanh[Wx(t)⇒C (t)⋅x(t)+Wh(t−1)⇒C (t)⋅h(t−1)+bC ]C(t)=C(t−1)∗f(t)+C (t)∗i(t)h(t)=O(t)∗Tanh(C(t))y(t)=Wh(t)⇒y(t)⋅h(t)+byWh(t)⇒y(t)⇒WH⇒Y
和上面的 BPTT \text{BPTT} BPTT思路相同,上述公式中的权重参数如 W x ( t ) ⇒ f ( t ) , W h ( t − 1 ) ⇒ i ( t ) \mathcal W_{x^{(t)} \Rightarrow f^{(t)}},\mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} Wx(t)⇒f(t),Wh(t−1)⇒i(t)等等,它们均是某时刻 t t t关于输入变量 x ( t ) x^{(t)} x(t)或者上一时刻隐变量 h ( t − 1 ) h^{(t-1)} h(t−1)在各个门结构中的权重信息。在反向传播过程中:每一时刻的梯度均会存放在对应的权重参数中。我们将其对应设定为:
W X ⇒ : { W x ( t ) ⇒ f ( t ) ⇒ W X ⇒ F W x ( t ) ⇒ i ( t ) ⇒ W X ⇒ I W x ( t ) ⇒ O ( t ) ⇒ W X ⇒ O W x ( t ) ⇒ C ~ ( t ) ⇒ W X ⇒ C ~ W H ⇒ : { W h ( t − 1 ) ⇒ f ( t ) ⇒ W H ⇒ F W h ( t − 1 ) ⇒ i ( t ) ⇒ W H ⇒ I W h ( t − 1 ) ⇒ O ( t ) ⇒ W H ⇒ O W h ( t − 1 ) ⇒ C ~ ( t ) ⇒ W H ⇒ C ~ \begin{aligned} & \mathcal W_{\mathcal X \Rightarrow}: \begin{cases} \mathcal W_{x^{(t)} \Rightarrow f^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal F} \quad \mathcal W_{x^{(t)} \Rightarrow i^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal I} \\ \mathcal W_{x^{(t)} \Rightarrow {\mathcal O}^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \mathcal O} \quad \mathcal W_{x^{(t)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \Rightarrow \mathcal W_{\mathcal X \Rightarrow \widetilde{\mathcal C}} \end{cases} \\ & \mathcal W_{\mathcal H \Rightarrow}: \begin{cases} \mathcal W_{h^{(t-1)} \Rightarrow f^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal F} \quad \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal I} \\ \mathcal W_{h^{(t-1)} \Rightarrow {\mathcal O}^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \mathcal O} \quad \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}} \Rightarrow \mathcal W_{\mathcal H \Rightarrow \widetilde{\mathcal C}} \end{cases} \end{aligned} WX⇒:{Wx(t)⇒f(t)⇒WX⇒FWx(t)⇒i(t)⇒WX⇒IWx(t)⇒O(t)⇒WX⇒OWx(t)⇒C (t)⇒WX⇒C WH⇒:{Wh(t−1)⇒f(t)⇒WH⇒FWh(t−1)⇒i(t)⇒WH⇒IWh(t−1)⇒O(t)⇒WH⇒OWh(t−1)⇒C (t)⇒WH⇒C
示例:求解梯度 L ( T ) ∂ W X ⇒ F \begin{aligned}\frac{\mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned} ∂WX⇒FL(T)
假设序列长度 T \mathcal T T,并且 T \mathcal T T时刻输出的损失结果为 L ( T ) \mathcal L^{(\mathcal T)} L(T),我们想要求解 L ( T ) \mathcal L^{(\mathcal T)} L(T)对权重矩阵 W X ⇒ F \mathcal W_{\mathcal X \Rightarrow \mathcal F} WX⇒F的梯度结果 ∂ L ( T ) ∂ W X ⇒ F \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}}\end{aligned} ∂WX⇒F∂L(T):
和‘循环神经网络’逻辑相同,描述各时刻的梯度累加。
∂ L ( T ) ∂ W X ⇒ F = ∑ t = 1 T ∂ L ( T ) ∂ W x ( t ) ⇒ f ( t ) \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{\mathcal X \Rightarrow \mathcal F}} = \sum_{t=1}^{\mathcal T} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(t)} \Rightarrow f^{(t)}}} ∂WX⇒F∂L(T)=t=1∑T∂Wx(t)⇒f(t)∂L(T)
反向传播过程 T \mathcal T T时刻梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned} ∂Wx(T)⇒f(T)∂L(T) 求解
这里先观察最后一个时刻的梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned} ∂Wx(T)⇒f(T)∂L(T):
- 它的梯度传播路径可表示为:
{ f ~ ( T ) = W x ( T ) ⇒ f ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ f ( T ) ⋅ h ( T − 1 ) + b f ⏟ 梯度无关 f ( T ) = Sigmoid ( f ~ ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) ⏟ 梯度无关 m ( T ) = Tanh ( C ( T ) ) h ( T ) = O ( T ) ∗ m ( T ) y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y \begin{aligned} \begin{cases} & \widetilde{f}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \underbrace{\mathcal W_{h^{(\mathcal T -1)} \Rightarrow f^{(\mathcal T)}}\cdot h^{(\mathcal T -1)} + b_f}_{梯度无关} \\ & f^{(\mathcal T)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \underbrace{\widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)}}_{梯度无关} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \end{cases} \end{aligned} ⎩ ⎨ ⎧f (T)=Wx(T)⇒f(T)⋅x(T)+梯度无关 Wh(T−1)⇒f(T)⋅h(T−1)+bff(T)=Sigmoid(f (T))C(T)=C(T−1)∗f(T)+梯度无关 C (T)∗i(T)m(T)=Tanh(C(T))h(T)=O(T)∗m(T)y(T)=Wh(T)⇒y(T)⋅h(T)+by - 对应传播路径图像表示为(红色箭头路径):
因此,梯度 ∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned}\begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}}\end{aligned}\end{aligned} ∂Wx(T)⇒f(T)∂L(T)可表示为:
∂ L ( T ) ∂ W x ( T ) ⇒ f ( T ) = ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ ∂ h ( T ) ∂ m ( T ) ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ f ( T ) ⋅ ∂ f ( T ) ∂ f ~ ( T ) ⋅ ∂ f ~ ( T ) ∂ W x ( T ) ⇒ f ( T ) \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \frac{\partial h^{(\mathcal T)}}{\partial m^{(\mathcal T)}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T)}\Rightarrow f^{(\mathcal T)}}} \end{aligned} ∂Wx(T)⇒f(T)∂L(T)=∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅∂m(T)∂h(T)⋅∂C(T)∂m(T)⋅∂f(T)∂C(T)⋅∂f (T)∂f(T)⋅∂Wx(T)⇒f(T)∂f (T)
反向传播过程 T − 1 \mathcal T - 1 T−1时刻梯度 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T -1)}}}\end{aligned} ∂Wx(T−1)⇒f(T−1)∂L(T)求解
至此,距离 T \mathcal T T时刻最近的,关于 W X ⇒ F \mathcal W_{\mathcal X \Rightarrow \mathcal F} WX⇒F的的梯度信息已经求解出来。那么 T − 1 \mathcal T-1 T−1时刻呢?和 T \mathcal T T时刻相比有什么区别呢 ? ? ?损失结果 L ( T ) \mathcal L^{(\mathcal T)} L(T)关于 T − 1 \mathcal T-1 T−1时刻 W x ( T − 1 ) ⇒ f ( T − 1 ) \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T−1)⇒f(T−1)的梯度结果 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}} \end{aligned} ∂Wx(T−1)⇒f(T−1)∂L(T)进行表示:
关于 W x ( T − 1 ) ⇒ f ( T − 1 ) W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T−1)⇒f(T−1)的梯度必然会经过 T \mathcal T T时刻,它的反向传播过程包含几类路径:
- 从输出门将 h ( T ) h^{(\mathcal T)} h(T)直接反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1),再从 h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1)反向传播至 W x ( T − 1 ) ⇒ f ( T − 1 ) W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T−1)⇒f(T−1)。它的梯度传播路径可表示为:
省略号部分的与
T \mathcal T T时刻相同,仅需将对应的
T \mathcal T T改成
T − 1 \mathcal T-1 T−1即可,下面省略号同理。
{ y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) O ( T ) = σ [ W x ( T ) ⇒ O ( T ) ⋅ x ( T ) + b O ⏟ h ( T − 1 ) 无关 + W h ( T − 1 ) ⇒ O ( T ) ⋅ h ( T − 1 ) ] h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & {\mathcal O}^{(\mathcal T)} = \sigma \left[\underbrace{\mathcal W_{x^{(\mathcal T)} \Rightarrow {\mathcal O}^{(\mathcal T)}} \cdot x^{(\mathcal T)} + b_{\mathcal O}}_{h^{(\mathcal T-1)}无关} + \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow {\mathcal O}^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)}\right] \\ & h^{(\mathcal T - 1)} = {\mathcal O}^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} ⎩ ⎨ ⎧y(T)=Wh(T)⇒y(T)⋅h(T)+byh(T)=O(T)∗m(T)O(T)=σ h(T−1)无关 Wx(T)⇒O(T)⋅x(T)+bO+Wh(T−1)⇒O(T)⋅h(T−1) h(T−1)=O(T−1)∗m(T−1)⋮
因此,该路径的梯度可表示为:
∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ O ( T ) ⋅ ∂ O ( T ) ∂ h ( T − 1 ) ⋅ ∂ h ( T − 1 ) ∂ m ( T − 1 ) ⋅ ∂ m ( T − 1 ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial \mathcal O^{(\mathcal T)}} \cdot \frac{\partial \mathcal O^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \frac{\partial h^{(\mathcal T-1)}}{\partial m^{(\mathcal T-1)}} \cdot \frac{\partial m^{(\mathcal T-1)}}{\partial \mathcal C^{(\mathcal T-1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}\right] ∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅[∂O(T)∂h(T)⋅∂h(T−1)∂O(T)⋅∂m(T−1)∂h(T−1)⋅∂C(T−1)∂m(T−1)⋅∂f(T−1)∂C(T−1)⋅∂f (T−1)∂f(T−1)⋅∂Wx(T−1)⇒f(T−1)∂f (T−1)] - 从细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)角度反向传播至 C ( T − 1 ) \mathcal C^{(\mathcal T -1)} C(T−1),再从 C ( T − 1 ) \mathcal C^{(\mathcal T - 1)} C(T−1)反向传播至 f ( T − 1 ) f^{(\mathcal T-1)} f(T−1),直至 W x ( T − 1 ) ⇒ f ( T − 1 ) W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T−1)⇒f(T−1)。它的梯度传播路径可表示为:
{ y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) C ( T − 1 ) = C ( T − 2 ) ∗ f ( T − 1 ) + C ~ ( T − 1 ) ∗ i ( T − 1 ) f ( T − 1 ) = Sigmoid ( f ~ ( T − 1 ) ) f ~ ( T − 1 ) = W x ( T − 1 ) ⇒ f ( T − 1 ) ⋅ x ( T − 1 ) + W h ( T − 2 ) ⇒ f ( T − 1 ) ⋅ h ( T − 2 ) + b f ⏟ 梯度无关 \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = {\mathcal O}^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & \mathcal C^{(\mathcal T-1)} = \mathcal C^{(\mathcal T -2)} * f^{(\mathcal T-1)} + \widetilde{\mathcal C}^{(\mathcal T-1)} * i^{(\mathcal T-1)} \\ & f^{(\mathcal T-1)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T-1)}) \\ & \widetilde{f}^{(\mathcal T-1)} = \mathcal W_{x^{(\mathcal T-1)} \Rightarrow f^{(\mathcal T-1)}} \cdot x^{(\mathcal T-1)} + \underbrace{\mathcal W_{h^{(\mathcal T -2)} \Rightarrow f^{(\mathcal T-1)}}\cdot h^{(\mathcal T -2)} + b_f}_{梯度无关} \end{cases} \end{aligned} ⎩ ⎨ ⎧y(T)=Wh(T)⇒y(T)⋅h(T)+byh(T)=O(T)∗m(T)m(T)=Tanh(C(T))C(T)=C(T−1)∗f(T)+C (T)∗i(T)C(T−1)=C(T−2)∗f(T−1)+C (T−1)∗i(T−1)f(T−1)=Sigmoid(f (T−1))f (T−1)=Wx(T−1)⇒f(T−1)⋅x(T−1)+梯度无关 Wh(T−2)⇒f(T−1)⋅h(T−2)+bf
因此,该路径的梯度可表示为:
∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{ \partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial f^{(\mathcal T - 1)}} \cdot \frac{\partial f^{(\mathcal T - 1)}}{\partial \widetilde{f}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}\right] ∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅[∂mT∂h(T)⋅∂C(T)∂m(T)⋅∂C(T−1)∂C(T)⋅∂f(T−1)∂C(T−1)⋅∂f (T−1)∂f(T−1)⋅∂Wx(T−1)⇒f(T−1)∂f (T−1)] - 在细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)的基础上,通过遗忘门反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1),后续与第一种情况相同。它的传播路径可表示为:
{ y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) f ( T ) = Sigmoid ( f ~ ( T ) ) f ~ ( T ) = W x ( T ) ⇒ f ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ f ( T ) ⋅ h ( T − 1 ) + b f h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & f^{(\mathcal T)} = \text{Sigmoid}(\widetilde{f}^{(\mathcal T)}) \\ & \widetilde{f}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)} \Rightarrow f^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T -1)} \Rightarrow f^{(\mathcal T)}}\cdot h^{(\mathcal T -1)} + b_f \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned} ⎩ ⎨ ⎧y(T)=Wh(T)⇒y(T)⋅h(T)+byh(T)=O(T)∗m(T)m(T)=Tanh(C(T))C(T)=C(T−1)∗f(T)+C (T)∗i(T)f(T)=Sigmoid(f (T))f (T)=Wx(T)⇒f(T)⋅x(T)+Wh(T−1)⇒f(T)⋅h(T−1)+bfh(T−1)=O(T−1)∗m(T−1)⋮
因此,该路径的梯度可表示为:
从
h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1)到
W x ( T − 1 ) ⇒ f ( T − 1 ) \mathcal W_{x^{(\mathcal T-1)} \Rightarrow f^{(\mathcal T - 1)}} Wx(T−1)⇒f(T−1)的梯度路径是固定的,见情况
1 1 1(后
5 5 5个梯度),这里使用省略号表示,下同。
∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ f ( T ) ⋅ ∂ f ( T ) ∂ f ~ ( T ) ⋅ ∂ f ~ ( T ) ∂ h ( T − 1 ) ⋯ ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial h^{(\mathcal T-1)}} \cdots\right] ∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅[∂mT∂h(T)⋅∂C(T)∂m(T)⋅∂f(T)∂C(T)⋅∂f (T)∂f(T)⋅∂h(T−1)∂f (T)⋯] - 在细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)的基础上,通过输入门反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1),后续与第一种情况相同。它的传播路径可表示为:
新出现的符号:
i ~ ( T ) \widetilde{i}^{(\mathcal T)} i (T)表示输入门的线性计算过程。
{ y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) i ( T ) = Sigmoid ( i ~ ( T ) ) i ~ ( T ) = W x ( T ) ⇒ i ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ i ( T ) ⋅ h ( T − 1 ) + b i h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & i^{(\mathcal T)} = \text{Sigmoid}(\widetilde{i}^{(\mathcal T)}) \\ & \widetilde{i}^{(\mathcal T)} = \mathcal W_{x^{(\mathcal T)}\Rightarrow i^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T - 1)}\Rightarrow i^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)} + b_i \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned} ⎩ ⎨ ⎧y(T)=Wh(T)⇒y(T)⋅h(T)+byh(T)=O(T)∗m(T)m(T)=Tanh(C(T))C(T)=C(T−1)∗f(T)+C (T)∗i(T)i(T)=Sigmoid(i (T))i (T)=Wx(T)⇒i(T)⋅x(T)+Wh(T−1)⇒i(T)⋅h(T−1)+bih(T−1)=O(T−1)∗m(T−1)⋮
对应路径梯度可表示为:
∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ i ( T ) ⋅ ∂ i ( T ) ∂ i ~ ( T ) ⋅ ∂ i ~ ( T ) ∂ h ( T − 1 ) ⋯ ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial i^{(\mathcal T)}} \cdot \frac{\partial i^{(\mathcal T)}}{\partial \widetilde{i}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots\right] ∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅[∂mT∂h(T)⋅∂C(T)∂m(T)⋅∂i(T)∂C(T)⋅∂i (T)∂i(T)⋅∂h(T−1)∂i (T)⋯] - 在细胞状态 C ( T ) \mathcal C^{(\mathcal T)} C(T)的基础上,通过候选状态 C ~ ( T ) \widetilde{\mathcal C}^{(\mathcal T)} C (T)反向传播至 h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1),后续与第一种情况相同。它的传播路径可表示为:
{ y ( T ) = W h ( T ) ⇒ y ( T ) ⋅ h ( T ) + b y h ( T ) = O ( T ) ∗ m ( T ) m ( T ) = Tanh ( C ( T ) ) C ( T ) = C ( T − 1 ) ∗ f ( T ) + C ~ ( T ) ∗ i ( T ) C ~ ( T ) = Tanh [ W x ( T ) ⇒ C ~ ( T ) ⋅ x ( T ) + W h ( T − 1 ) ⇒ C ~ ( T ) ⋅ h ( T − 1 ) + b C ~ ] h ( T − 1 ) = O ( T − 1 ) ∗ m ( T − 1 ) ⋮ \begin{aligned} \begin{cases} & y^{(\mathcal T)} = \mathcal W_{h^{(\mathcal T)} \Rightarrow y^{(\mathcal T)}} \cdot h^{(\mathcal T)} + b_y \\ & h^{(\mathcal T)} = \mathcal O^{(\mathcal T)} * m^{(\mathcal T)} \\ & m^{(\mathcal T)} = \text{Tanh}(\mathcal C^{(\mathcal T)}) \\ & \mathcal C^{(\mathcal T)} = \mathcal C^{(\mathcal T -1)} * f^{(\mathcal T)} + \widetilde{\mathcal C}^{(\mathcal T)} * i^{(\mathcal T)} \\ & \widetilde{\mathcal C}^{(\mathcal T)} = \text{Tanh} \left[\mathcal W_{x^{(\mathcal T)} \Rightarrow \widetilde{\mathcal C}^{(\mathcal T)}} \cdot x^{(\mathcal T)} + \mathcal W_{h^{(\mathcal T - 1)} \Rightarrow \widetilde{\mathcal C}^{(\mathcal T)}} \cdot h^{(\mathcal T - 1)} + b_{\widetilde{\mathcal C}}\right] \\ & h^{(\mathcal T - 1)} = \mathcal O^{(\mathcal T - 1)} * m^{(\mathcal T - 1)} \\ & \quad\vdots \end{cases} \end{aligned} ⎩ ⎨ ⎧y(T)=Wh(T)⇒y(T)⋅h(T)+byh(T)=O(T)∗m(T)m(T)=Tanh(C(T))C(T)=C(T−1)∗f(T)+C (T)∗i(T)C (T)=Tanh[Wx(T)⇒C (T)⋅x(T)+Wh(T−1)⇒C (T)⋅h(T−1)+bC ]h(T−1)=O(T−1)∗m(T−1)⋮
对应路径的梯度可表示为:
∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ [ ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ ∂ C ( T ) ∂ C ~ ( T ) ⋅ ∂ C ~ ( T ) ∂ h ( T − 1 ) ⋯ ] \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \left[\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \widetilde{\mathcal C}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots\right] ∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅[∂mT∂h(T)⋅∂C(T)∂m(T)⋅∂C (T)∂C(T)⋅∂h(T−1)∂C (T)⋯]
至此,我们将所有关于 ∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}}\end{aligned} ∂Wx(T−1)⇒f(T−1)∂L(T)的全部梯度路径查找完毕。将这些梯度结果进行累加:
其中大括号内的所有部分同上,仅表示各项的累加结果,并非矩阵;其中有
4 4 4条路径是从细胞状态
C ( T ) \mathcal C^{(\mathcal T)} C(T)得到。将其进行简写。
∂ L ( T ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) = ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) ⋅ { ∂ h ( T ) ∂ O ( T ) ⋅ ∂ O ( T ) ∂ h ( T − 1 ) ⋅ ∂ h ( T − 1 ) ∂ m ( T − 1 ) ⋅ ∂ m ( T − 1 ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) ⏟ ⋯ + ∂ h ( T ) ∂ m T ⋅ ∂ m ( T ) ∂ C ( T ) ⋅ [ ∂ C ( T ) ∂ C ( T − 1 ) ⋅ ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ W x ( T − 1 ) ⇒ f ( T − 1 ) + ∂ C ( T ) ∂ f ( T ) ⋅ ∂ f ( T ) ∂ f ~ ( T ) ⋅ ∂ f ~ ( T ) ∂ h ( T − 1 ) ⋯ + ∂ C ( T ) ∂ i ( T ) ⋅ ∂ i ( T ) ∂ i ~ ( T ) ⋅ ∂ i ~ ( T ) ∂ h ( T − 1 ) ⋯ + ∂ C ( T ) ∂ C ~ ( T ) ⋅ ∂ C ~ ( T ) ∂ h ( T − 1 ) ⋯ ] } \begin{aligned} \frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T - 1)} \Rightarrow f^{(\mathcal T - 1)}}} & = \frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}} \cdot \begin{Bmatrix} \frac{\partial h^{(\mathcal T)}}{\partial \mathcal O^{(\mathcal T)}} \cdot \frac{\partial \mathcal O^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdot \underbrace{\frac{\partial h^{(\mathcal T-1)}}{\partial m^{(\mathcal T-1)}} \cdot \frac{\partial m^{(\mathcal T-1)}}{\partial \mathcal C^{(\mathcal T-1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}}}_{\cdots} \\ \quad \\ +\frac{\partial h^{(\mathcal T)}}{\partial m^{\mathcal T}} \cdot \frac{\partial m^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T)}} \cdot \begin{bmatrix} \frac{\partial \mathcal C^{(\mathcal T)}}{\partial \mathcal C^{(\mathcal T - 1)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial f^{(\mathcal T - 1)}} \cdot \frac{\partial f^{(\mathcal T - 1)}}{\partial \widetilde{f}^{(\mathcal T - 1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial \mathcal W_{x^{(\mathcal T-1)}\Rightarrow f^{(\mathcal T-1)}}} \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial f^{(\mathcal T)}} \cdot \frac{\partial f^{(\mathcal T)}}{\partial \widetilde{f}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T)}}{\partial h^{(\mathcal T-1)}} \cdots \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial i^{(\mathcal T)}} \cdot \frac{\partial i^{(\mathcal T)}}{\partial \widetilde{i}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots \\ +\frac{\partial \mathcal C^{(\mathcal T)}}{\partial \widetilde{\mathcal C}^{(\mathcal T)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T)}}{\partial h^{(\mathcal T - 1)}} \cdots \end{bmatrix} \end{Bmatrix} \end{aligned} ∂Wx(T−1)⇒f(T−1)∂L(T)=∂y(T)∂L(T)⋅∂h(T)∂y(T)⋅⎩ ⎨ ⎧∂O(T)∂h(T)⋅∂h(T−1)∂O(T)⋅⋯ ∂m(T−1)∂h(T−1)⋅∂C(T−1)∂m(T−1)⋅∂f(T−1)∂C(T−1)⋅∂f (T−1)∂f(T−1)⋅∂Wx(T−1)⇒f(T−1)∂f (T−1)+∂mT∂h(T)⋅∂C(T)∂m(T)⋅ ∂C(T−1)∂C(T)⋅∂f(T−1)∂C(T−1)⋅∂f (T−1)∂f(T−1)⋅∂Wx(T−1)⇒f(T−1)∂f (T−1)+∂f(T)∂C(T)⋅∂f (T)∂f(T)⋅∂h(T−1)∂f (T)⋯+∂i(T)∂C(T)⋅∂i (T)∂i(T)⋅∂h(T−1)∂i (T)⋯+∂C (T)∂C(T)⋅∂h(T−1)∂C (T)⋯ ⎭ ⎬ ⎫
T − 2 \mathcal T - 2 T−2时刻与 T − 1 \mathcal T - 1 T−1时刻关于 W x ( t ) ⇒ f ( t ) \mathcal W_{x^{(t)} \Rightarrow f^{(t)}} Wx(t)⇒f(t)梯度的比较
T − 2 \mathcal T - 2 T−2时刻的梯度结果 ∂ L ( T ) ∂ W x ( T − 2 ) ⇒ f ( T − 2 ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial \mathcal W_{x^{(\mathcal T-2)} \Rightarrow f^{(\mathcal T - 2)}}}\end{aligned} ∂Wx(T−2)⇒f(T−2)∂L(T)是否与 T − 1 \mathcal T - 1 T−1时刻的情况相同呢?不相同。原因在于: T ⇒ T − 1 \mathcal T \Rightarrow \mathcal T - 1 T⇒T−1时刻仅包含 h ( T ) h^{(\mathcal T)} h(T)的相关路径,也就是说,它均是从 ∂ L ( T ) ∂ y ( T ) ⋅ ∂ y ( T ) ∂ h ( T ) \begin{aligned}\frac{\partial \mathcal L^{(\mathcal T)}}{\partial y^{(\mathcal T)}} \cdot \frac{\partial y^{(\mathcal T)}}{\partial h^{(\mathcal T)}}\end{aligned} ∂y(T)∂L(T)⋅∂h(T)∂y(T)执行下来的。但是: T − 1 ⇒ T − 2 \mathcal T- 1 \Rightarrow \mathcal T -2 T−1⇒T−2时刻不仅存在 h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1)的相关路径,并且还包含 C ( T − 1 ) \mathcal C^{(\mathcal T - 1)} C(T−1)的相关路径:
其中关于
h ( T − 1 ) h^{(\mathcal T - 1)} h(T−1)的相关路径与
h ( T ) h^{(\mathcal T)} h(T)相同,不再赘述;与
C ( T − 1 ) \mathcal C^{(\mathcal T-1)} C(T−1)的相关路径存在如下几种形式。可以看出它们之间确实存在重合的部分,但需要分开进行梯度计算。因为
C ( T − 1 ) \mathcal C^{(\mathcal T-1)} C(T−1)和
h ( T − 1 ) h^{(\mathcal T -1)} h(T−1)不是一个东西。
C ( T − 1 ) ⇒ { ∂ C ( T − 1 ) ∂ C ( T − 2 ) ⋅ ∂ C ( T − 2 ) ∂ f ( T − 2 ) ⋅ ∂ f ( T − 2 ) ∂ f ~ ( T − 2 ) ⋅ ∂ f ~ ( T − 2 ) ∂ W x ( T − 2 ) ⇒ f ( T − 2 ) ∂ C ( T − 1 ) ∂ f ( T − 1 ) ⋅ ∂ f ( T − 1 ) ∂ f ~ ( T − 1 ) ⋅ ∂ f ~ ( T − 1 ) ∂ h ( T − 2 ) ⋯ ∂ C ( T − 1 ) ∂ i ( T − 1 ) ⋅ ∂ i ( T − 1 ) ∂ i ~ ( T − 1 ) ⋅ ∂ i ~ ( T − 1 ) ∂ h ( T − 2 ) ⋯ ∂ C ( T − 1 ) ∂ C ~ ( T − 1 ) ⋅ ∂ C ~ ( T − 1 ) ∂ h ( T − 2 ) ⋯ \mathcal C^{(\mathcal T -1)} \Rightarrow \begin{cases} \begin{aligned} & \frac{\partial \mathcal C^{(\mathcal T - 1)}}{\partial \mathcal C^{(\mathcal T-2)}} \cdot \frac{\partial \mathcal C^{(\mathcal T - 2)}}{\partial f^{(\mathcal T - 2)}} \cdot \frac{\partial f^{(\mathcal T - 2)}}{\partial \widetilde{f}^{(\mathcal T - 2)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-2)}}{\partial \mathcal W_{x^{(\mathcal T-2)}\Rightarrow f^{(\mathcal T-2)}}} \\ & \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial f^{(\mathcal T-1)}} \cdot \frac{\partial f^{(\mathcal T-1)}}{\partial \widetilde{f}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{f}^{(\mathcal T-1)}}{\partial h^{(\mathcal T-2)}} \cdots \\ &\frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial i^{(\mathcal T-1)}} \cdot \frac{\partial i^{(\mathcal T-1)}}{\partial \widetilde{i}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{i}^{(\mathcal T-1)}}{\partial h^{(\mathcal T - 2)}} \cdots \\ & \frac{\partial \mathcal C^{(\mathcal T-1)}}{\partial \widetilde{\mathcal C}^{(\mathcal T-1)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(\mathcal T-1)}}{\partial h^{(\mathcal T - 2)}} \cdots \end{aligned} \end{cases} C(T−1)⇒⎩ ⎨ ⎧∂C(T−2)∂C(T−1)⋅∂f(T−2)∂C(T−2)⋅∂f (T−2)∂f(T−2)⋅∂Wx(T−2)⇒f(T−2)∂f (T−2)∂f(T−1)∂C(T−1)⋅∂f (T−1)∂f(T−1)⋅∂h(T−2)∂f (T−1)⋯∂i(T−1)∂C(T−1)⋅∂i (T−1)∂i(T−1)⋅∂h(T−2)∂i (T−1)⋯∂C (T−1)∂C(T−1)⋅∂h(T−2)∂C (T−1)⋯
最终, T ⇒ T − 2 \mathcal T \Rightarrow \mathcal T-2 T⇒T−2一共包含 4 × 5 + 4 = 24 4 \times 5 + 4 = 24 4×5+4=24条路经。
这个
+ 4 +4 +4是指输出门路径,因为该路径没有经过‘细胞状态’
C ( t ) \mathcal C^{(t)} C(t),因此每一次达到
h ( T ) , h ( T − 1 ) h^{(\mathcal T)},h^{(\mathcal T - 1)} h(T),h(T−1)时,它仅存在唯一一条路径向对应的
h ( T − 1 ) , h ( T − 2 ) h^{(\mathcal T - 1)},h^{(\mathcal T - 2)} h(T−1),h(T−2)传播。
为什么 LSTM \text{LSTM} LSTM能够抑制梯度消失
随着反向传播深度的增加,反向传播路径的数量呈指数级别增长。即便可能出现梯度消失,也可以从数量的角度进行补充;
例如:关于细胞状态的梯度 ∂ C ( t ) ∂ C ( t − 1 ) \begin{aligned}\frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned} ∂C(t−1)∂C(t)可以通过各门结构权重参数进行调节:
∂ C ( t ) ∂ C ( t − 1 ) = f ( t ) + ( ∂ C ( t ) ∂ f ( t ) ⋅ ∂ f ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( t − 1 ) ∂ C ( t − 1 ) + ∂ C ( t ) ∂ i ( t ) ⋅ ∂ i ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( t − 1 ) ∂ C ( t − 1 ) + ∂ C ( t ) ∂ C ~ ( t ) ⋅ ∂ C ~ ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( t − 1 ) ∂ C ( t − 1 ) ) = f ( t ) + ( C ( t − 1 ) ⋅ { [ Sigmoid ( ⋅ ) ] ′ ⋅ W h ( t − 1 ) ⇒ f ( t ) } ⋅ { O ( t − 1 ) ∗ [ Tanh ( C ( t − 1 ) ) ] ′ } + C ~ ( t ) ⋅ { [ Sigmoid ( ⋅ ) ] ′ ⋅ W h ( t − 1 ) ⇒ i ( t ) } ⋅ { O ( t − 1 ) ∗ [ Tanh ( C ( t − 1 ) ) ] ′ } + i ( t ) ⋅ { [ Tanh ( ⋅ ) ] ′ ⋅ W h ( t − 1 ) ⇒ C ~ ( t ) } ⋅ { O ( t − 1 ) ∗ [ Tanh ( C ( t − 1 ) ) ] ′ } ) \begin{aligned} \frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}} & = f^{(t)} + \begin{pmatrix} \begin{aligned} & \quad \frac{\partial \mathcal C^{(t)}}{\partial f^{(t)}} \cdot \frac{\partial f^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \\ &+ \frac{\partial \mathcal C^{(t)}}{\partial i^{(t)}} \cdot \frac{\partial i^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \\ &+ \frac{\partial \mathcal C^{(t)}}{\partial \widetilde{\mathcal C}^{(t)}} \cdot \frac{\partial \widetilde{\mathcal C}^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial \mathcal C^{(t-1)}} \end{aligned} \end{pmatrix} \\ & = f^{(t)} + \begin{pmatrix} \mathcal C^{(t-1)} \cdot \left\{\left[\text{Sigmoid}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow f^{(t)}}\right\} \cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\} \\ +\widetilde{\mathcal C}^{(t)} \cdot \left\{\left[\text{Sigmoid}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow i^{(t)}}\right\} \cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\} \\ +i^{(t)} \cdot \left\{\left[\text{Tanh}(\cdot)\right]' \cdot \mathcal W_{h^{(t-1)} \Rightarrow \widetilde{\mathcal C}^{(t)}}\right\}\cdot \left\{\mathcal O^{(t-1)} * \left[\text{Tanh}(\mathcal C^{(t-1)})\right]'\right\}\\ \end{pmatrix} \end{aligned} ∂C(t−1)∂C(t)=f(t)+ ∂f(t)∂C(t)⋅∂h(t−1)∂f(t)⋅∂C(t−1)∂h(t−1)+∂i(t)∂C(t)⋅∂h(t−1)∂i(t)⋅∂C(t−1)∂h(t−1)+∂C (t)∂C(t)⋅∂h(t−1)∂C (t)⋅∂C(t−1)∂h(t−1) =f(t)+ C(t−1)⋅{[Sigmoid(⋅)]′⋅Wh(t−1)⇒f(t)}⋅{O(t−1)∗[Tanh(C(t−1))]′}+C (t)⋅{[Sigmoid(⋅)]′⋅Wh(t−1)⇒i(t)}⋅{O(t−1)∗[Tanh(C(t−1))]′}+i(t)⋅{[Tanh(⋅)]′⋅Wh(t−1)⇒C (t)}⋅{O(t−1)∗[Tanh(C(t−1))]′}
可以发现:每向前反向传播一个梯度,都回出现 4 4 4项偏导伴随着该时刻梯度的出现,并且其中三项是由当前时刻遗忘门、输入门、输出门的权重参数相互调节决定的。
可以理解为:
- 整个反向传播过程中,所有时刻门结构的权重均参与到了 ∂ C ( t ) ∂ C ( t − 1 ) \begin{aligned}\frac{\partial \mathcal C^{(t)}}{\partial \mathcal C^{(t-1)}}\end{aligned} ∂C(t−1)∂C(t)的调节中,相比于循环神经网络中仅有一个权重矩阵的描述,它的鲁棒性会强很多;
- 并且循环神经网络中的权重矩阵是纯纯的累积,而 LSTM \text{LSTM} LSTM是各项累加,即便是其中一个时刻某门结构梯度消失,剩余门结构也会做出相应调整,来维持当前时刻梯度。
相关参考:
LSTM如何缓解梯度消失(公式推导)