1
a.
∂ y i ∂ x j = ∂ ( σ ( W i 1 x 1 + . . . + W i d x d ) ) ∂ x j = ∂ ( 1 1 + e − ( W i 1 x 1 + . . . + W i d x d ) ) ∂ x j = e − ( W i 1 x 1 + . . . + W i d x d ) W i j ( 1 + e − ( W i 1 x 1 + . . . + W i d x d ) ) 2 \frac{\partial y_i}{\partial x_j}\newline =\frac{\partial(\sigma(W_{i1}x_1+...+W_{id}x_d))}{\partial x_j}\newline =\frac{\partial(\frac{1}{1+e^{-(W_{i1}x_1+...+W_{id}x_d)}})}{\partial x_j}\newline =\frac{e^{-(W_{i1}x_1+...+W_{id}x_d)}W_{ij}}{(1+e^{-(W_{i1}x_1+...+W_{id}x_d)})^2} ∂xj∂yi=∂xj∂(σ(Wi1x1+...+Widxd))=∂xj∂(1+e−(Wi1x1+...+Widxd)1)=(1+e−(Wi1x1+...+Widxd))2e−(Wi1x1+...+Widxd)Wij
let
q i = e − ( W i 1 x 1 + . . . + W i d x d ) ( 1 + e − ( W i 1 x 1 + . . . + W i d x d ) ) 2 q_i=\frac{e^{-(W_{i1}x_1+...+W_{id}x_d)}}{(1+e^{-(W_{i1}x_1+...+W_{id}x_d)})^2} qi=(1+e−(Wi1x1+...+Widxd))2e−(Wi1x1+...+Widxd)
∂ Y ∂ X = [ q 1 W 11 . . . q 1 W 1 d . . . . . . . . . q n W n 1 . . . q n W n d ] = [ q 1 . . . 0 . . . . . . . . . 0 . . . q n ] ∗ [ W 11 . . . W 1 d . . . . . . . . . W n 1 . . . W n d ] \frac{\partial Y}{\partial X}= \left[ \begin{matrix} q_1W_{11} & ... & q_1W_{1d} \\ ... & ... &... \\ q_nW_{n1} & ... & q_nW_{nd} \end{matrix} \right]\newline =\left[ \begin{matrix} q_1 & ... & 0\\ ... & ... &... \\ 0 & ... & q_n \end{matrix} \right]*\left[ \begin{matrix} W_{11} & ... & W_{1d} \\ ... & ... &... \\ W_{n1} & ... & W_{nd} \end{matrix} \right] ∂X∂Y=⎣⎡q1W11...qnWn1.........q1W1d...qnWnd⎦⎤=⎣⎡q1...0.........0...qn⎦⎤∗⎣⎡W11...Wn1.........W1d...Wnd⎦⎤
calculate
l e t z = W x σ ′ ( z i ) = q i σ ′ ( z ) = [ q 1 q 2 . . . q n ] let \space z = Wx\newline \sigma_{'}(z_i)=q_i\newline \sigma_{'}(z) =\left[ \begin{matrix} q_1 \\ q_2 \\ ... \\ q_n \\ \end{matrix} \right] let z=Wxσ′(zi)=qiσ′(z)=⎣⎢⎢⎡q1q2...qn⎦⎥⎥⎤
calculate
∂ Y ∂ X = d i a g ( σ ′ ) ∗ W \frac{\partial Y}{\partial X}=diag(\sigma^{'})*W ∂X∂Y=diag(σ′)∗W
b.Derive the quantity ∂ L ∂ W = ∑ t = 0 T ∑ k = 1 t ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W \frac{\partial L}{\partial W}=\sum_{t=0}^T\sum_{k=1}^t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W} ∂W∂L=∑t=0T∑k=1t∂ht∂Lt∂hk∂ht∂W∂hk
根据连式法则:
h k = f 1 ( x ; W ) h t = f 2 ( y 1 , W 2 ) L t = L o s s ( h t , h G T ) 于 是 有 : ∂ L t ∂ W 1 ∂ L t ∂ W 2 ∂ L t ∂ W 2 = ( ∂ L t ∂ h t ) ( ∂ h t ∂ W 2 ) ∂ L t ∂ W = ( ∂ L t ∂ h t ) ( ∂ h k ∂ W ) L t W = ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W 于 是 直 到 T 次 有 : ∂ L ∂ W = ∑ t = 0 T ∑ k = 1 t ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W h_k=f_1(x;W)\newline h_t=f_2(y_1,W_2)\newline L_t=Loss(h_t,h_{GT})\newline 于是有:\newline \frac{\partial L_t}{\partial W_1}\frac{\partial L_t}{\partial W_2}\newline \frac{\partial L_t}{\partial W_2}=(\frac{\partial L_t}{\partial h_t})(\frac{\partial h_t}{\partial W_2})\newline \frac{\partial L_t}{\partial W}=(\frac{\partial L_t}{\partial h_t})(\frac{\partial h_k}{\partial W})\newline \frac{L_t}{W}=\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}\newline 于是直到T次有: \newline \frac{\partial L}{\partial W}=\sum_{t=0}^T\sum_{k=1}^t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W} hk=f1(x;W)ht=f2(y1,W2)Lt=Loss(ht,hGT)于是有:∂W1∂Lt∂W2∂Lt∂W2∂Lt=(∂ht∂Lt)(∂W2∂ht)∂W∂Lt=(∂ht∂Lt)(∂W∂hk)WLt=∂ht∂Lt∂hk∂ht∂W∂hk于是直到T次有:∂W∂L=t=0∑Tk=1∑t∂ht∂Lt∂hk∂ht∂W∂hk
2.
a.
当T=3:
∂ L ∂ W = ∑ t = 0 3 ∑ k = 1 t ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W = ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W + ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W + ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W + ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W + ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W + ∂ L t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W \frac{\partial L}{\partial W}=\sum_{t=0}^3\sum_{k=1}^t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}\newline =\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}+ \frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}+ \frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}\newline+ \frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}+ \frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}+ \frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W} ∂W∂L=t=0∑3k=1∑t∂ht∂Lt∂hk∂ht∂W∂hk=∂ht∂Lt∂hk∂ht∂W∂hk+∂ht∂Lt∂hk∂ht∂W∂hk+∂ht∂Lt∂hk∂ht∂W∂hk+∂ht∂Lt∂hk∂ht∂W∂hk+∂ht∂Lt∂hk∂ht∂W∂hk+∂ht∂Lt∂hk∂ht∂W∂hk
b.
M n = M n − 1 M M = Q A Q − 1 M n = M n − 1 Q A Q − 1 = M n − 2 Q A Q − 1 Q A Q − 1 = M n − 2 Q A 2 Q − 1 . . . = Q A n Q − 1 M^n=M^{n-1}M\newline M=QAQ^{-1}\newline M^n=M^{n-1}QAQ^{-1}\newline =M^{n-2}QAQ^{-1}QAQ^{-1}\newline =M^{n-2}QA^2Q^{-1}\newline ...\newline =QA^nQ^{-1} Mn=Mn−1MM=QAQ−1Mn=Mn−1QAQ−1=Mn−2QAQ−1QAQ−1=Mn−2QA2Q−1...=QAnQ−1
c.
A 30 = [ 0. 9 30 0 0 0. 4 30 ] w 30 = [ 0.6 ∗ 0. 9 30 0.8 ∗ 0. 4 30 0.8 ∗ 0. 9 30 0.6 ∗ 0. 4 30 ] A^{30}=\left[ \begin{matrix} 0.9^{30} & 0 \\ 0 & 0.4^{30} \\ \end{matrix} \right] w^{30}=\left[ \begin{matrix} 0.6*0.9^{30} & 0.8*0.4^{30} \\ 0.8*0.9^{30} & 0.6*0.4^{30} \\ \end{matrix} \right] A30=[0.930000.430]w30=[0.6∗0.9300.8∗0.9300.8∗0.4300.6∗0.430]
分析:通过计算矩阵的 30 次方最后矩阵的值都会趋于 0,如果特征值的绝对值都小于1则在矩阵n次方后,特征值会趋近于0,所以计算结果趋近于0,但是如果一个特征值大于1,那么在指数增长下,对应的列会趋近于无穷。
3.
a.
三个函数都是LSTMs中的门函数,用于保护和控制单元的状态。每一个门函数其基础函数都是sigmoid函数。其中 i t , o t i_t,o_t it,ot在LSTMs中起到控制状态信息存储的功能。
f t f_t ft:遗忘层,它对每一个 C t 1 C_{t1} Ct1生成一个0到1之间的数,1表示完全保留,0表示完全放弃。
i t i_t it:输入层,生成0到1之间的数,从而决定要更新的值,并且在tanh层创建新的候选值,添加到里面,之后将两个结合,创建一个更新。
o t o_t ot:输出层,生成-1到1之间的数,决定单元状态中哪一个需要被输出,在经过tanh之后,利用sigmoid函数相乘,即可得到对应需要的输出。
b.
因为 f t , i t , o t 总是非负数, 并且取值范围为 [0,1],由于 f t , i t , o t 都是属于 sigmoid 函
数,则对应的值输出区间为 [0,1], 其余函数由于与 tanh 函数有关,取值范围在 [-1,1] 之
间。
c.
因为 ∂ C t ∂ C k = ∏ i = k + 1 t ∂ C t ∂ C t − 1 \frac{\partial C_t}{\partial C_k}=\prod_{i=k+1}^t\frac{\partial C_t}{\partial C_{t-1}} ∂Ck∂Ct=∏i=k+1t∂Ct−1∂Ct
所以由 f t = 1 , i t = 0 f_t=1,i_t=0 ft=1,it=0可以得:
C t = f t ⊗ C t − 1 + i t ⊗ C t ‾ C t = C t − 1 C_t=f_t\otimes C_{t-1}+i_t\otimes\overline{C_t}\newline C_t=C_{t-1}\newline Ct=ft⊗Ct−1+it⊗CtCt=Ct−1
所以:
∂ C t ∂ C k = ∏ i = k + 1 t ∂ C t ∂ C t − 1 = ∏ i = k + 1 t 1 \frac{\partial C_t}{\partial C_k}=\prod_{i=k+1}^{t}\frac{\partial C_t}{\partial C_{t-1}}\newline =\prod_{i=k+1}^{t}1 ∂Ck∂Ct=i=k+1∏t∂Ct−1∂Ct=i=k+1∏t1