循环神经网络RNN的前向传播与反向传播

文章目录

1. RNN模型
2. RNN的前向传播
3. RNN的反向传播
4. RNN的缺点

1. RNN模型

在这里插入图片描述

2. RNN的前向传播

对于当前的索引号 $t$ ，隐藏状态 $h^t$ 由 $x^t$ 和 $h^{t-1}$ 共同得到：
$h^t = \tanh(Ux^t+Wh^{t-1}+b) \tag{1}$
其中选用了tanh作为激活函数， $b$ 是bias。

每次网络的输出值：
$o^t = Vh^t + c \tag{2}$

输出的预测值：
$a^t = \text{softmax}(o^t) = \text{softmax}(Vh^t+c) \tag{3}$

使用交叉熵损失函数：
$L^t = -\sum_{i=1}^Ny_i^t\log a_i^t = -\log a_k^t$
化简的结果是因为在所有的 $N$ 个分类中，只有 $y_k=1$

3. RNN的反向传播

RNN的反向传播有时也叫做BPTT(back-propagation through time)，所有的参数 $U, W, V, b, c$ 在网络的各个位置都是共享的。

成本函数：
$L = \sum_{t=1}^mL^t$
其中 $m$ 是训练集的数据量。

从《交叉熵的反向传播梯度推导（使用softmax激活函数）》一文得知，
$\frac{\partial L^t}{\partial o^t} = a^t - y^t$
所以
$\frac{\partial L}{\partial o^t} = \sum_{t=1}^m(a^t - y^t) \tag{4}$

参数 $V, c$ 的梯度可以直接计算：
$\left\{\begin{aligned} &\frac{\partial L}{\partial V} = \frac{\partial L}{\partial o^t} \frac{\partial o^t}{\partial V} = \sum_{t=1}^m(a^t - y^t)(h^t)^T\\ &\\ & \frac{\partial L}{\partial c} = \frac{\partial L}{\partial o^t} \frac{\partial o^t}{\partial c} = \sum_{t=1}^m(a^t - y^t) \end{aligned}\right.$

参数 $W, U, b$ 的梯度计算可以仿照DNN的反向传播算法，定义辅助变量，也即隐藏状态的梯度：
$\delta^{t} = \frac{\partial L^t}{\partial h^{t}} \tag{5}$
则
$\begin{aligned} \delta^t &= \frac{\partial L^t}{\partial o^{t}}\frac{\partial o^t}{\partial h^{t}} + \frac{\partial L^t}{\partial h^{t+1}}\frac{\partial h^{t+1}}{\partial h^{t}} + \cdots + \frac{\partial L^t}{\partial h^{t+k}}\frac{\partial h^{t+k}}{\partial h^{t}} + \cdots\\ \\ &= V^T(a^t-y^t) + W^T\delta^{t+1}\odot(\tanh'(h^{t+1})) \tag{6} \end{aligned}$
其中只有当 $k=1$ ，也即只有 $h^{t+1}$ 中才含有 $h^t$ 分量，因此式(6)中最后的扩展项中，只有 $\dfrac{\partial h^{t+1}}{\partial h^{t}}$ 这一项有结果，随后的所有项，都为0。

最后一项：
$\delta^m = \frac{\partial L^m}{\partial o^{m}}\frac{\partial o^m}{\partial h^{m}}=V^T(a^m-y^m) \tag{7}$

则参数 $W, U, b$ 的梯度可以计算如下：
$\left\{\begin{aligned} &\frac{\partial L}{\partial W} = \frac{\partial L}{\partial h^t} \frac{\partial h^t}{\partial W} = \sum_{t=1}^m \delta^t \odot (1-(h^t)^2) (h^{t-1})^T\\ &\\ &\frac{\partial L}{\partial U} = \frac{\partial L}{\partial h^t} \frac{\partial h^t}{\partial U} = \sum_{t=1}^m \delta^t \odot (1-(h^t)^2) (x^t)^T\\ &\\ & \frac{\partial L}{\partial b} = \frac{\partial L}{\partial h^t} \frac{\partial h^t}{\partial b} = \sum_{t=1}^m \delta^t\odot (1-(h^t)^2) \tag{8} \end{aligned}\right.$
其中 $1-(h^t)^2 = \tanh'(h^t)$

4. RNN的缺点

前后的信息间隔变大时，RNN不能有效获得信息。对long-term dependencies 不能有效地学习。
解决方法：使用LSTM网络