RNN-LSTM-GRU

循环神经网络（Recurrent Neural Network, RNN）

这里写图片描述

假设 ${X}_t \in \mathbb{R}^{n \times d}$ 是序列中时间步 t 的小批量输入 ${H}_t \in \mathbb{R}^{n \times h}$ 该时间步的隐藏层变量。跟多层感知机不同在于这里我们保存上一时间步的隐藏变量 ${H}_{t-1}$ 并引入一个新的权重参数 ${W}_{hh} \in \mathbb{R}^{h \times h}$ ，它用来描述在当前时间步如何使用上一时间步的隐藏变量。具体来说，当前隐藏变量的计算由当前输入和上一时间步的隐藏状态共同决定：

H_{t} = ϕ (X_{t} W_{x h} + H_{t - 1} W_{h h} + b_{h}),

${H}_t = \phi({X}_t {W}_{xh} + {H}_{t-1} {W}_{hh} + {b}_h),$

这里隐藏变量捕捉了截至当前时间步的序列历史信息，就像是神经网络当前时间步的状态或记忆一样，因此也称之为隐藏状态。

O_{t} = H_{t} W_{h y} + b_{y} .

${O}_t = {H}_t {W}_{hy} + {b}_y.$

def rnn(inputs, state, params):
    # inputs 和 outputs 皆为 num_steps 个形状为（batch_size, vocab_size）的矩阵。
    W_xh, W_hh, b_h, W_hy, b_y = params
    H, = state
    outputs = []
    for X in inputs:
        H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)
        Y = nd.dot(H, W_hy) + b_y
        outputs.append(Y)
    return outputs, (H,)

深度循环神经网络

这里写图片描述

H_{t}^{(1)} = ϕ (X_{t} W_{x h}^{(1)} + H_{t - 1}^{(1)} W_{h h}^{(1)} + b_{h}^{(1)}),

${H}_t^{(1)} = \phi({X}_t {W}_{xh}^{(1)} + {H}_{t-1}^{(1)} {W}_{hh}^{(1)} + {b}_h^{(1)}),$

H_{t}^{(l)} = ϕ (H_{t}^{(l - 1)} W_{x h}^{(l)} + H_{t - 1}^{(1)} W_{h h}^{(l)} + b_{h}^{(l)}),

${H}_t^{(l)} = \phi({H}_t^{(l-1)} {W}_{xh}^{(l)} + {H}_{t-1}^{(1)} {W}_{hh}^{(l)} + {b}_h^{(l)}),$

O_{t} = H_{t}^{(L)} W_{h y} + b_{y},

${O}_t = {H}_t^{(L)} {W}_{hy} + {b}_y,$

双向循环神经网络

这里写图片描述

{\vec{H}}_{t} = ϕ (X_{t} W_{x h}^{(f)} + {\vec{H}}_{t - 1} W_{h h}^{(f)} + b_{h}^{(f)})

$\overrightarrow{{H}}_t = \phi({X}_t {W}_{xh}^{(f)} + \overrightarrow{{H}}_{t-1} {W}_{hh}^{(f)} + {b}_h^{(f)})$

{\overset{\leftarrow}{H}}_{t} = ϕ (X_{t} W_{x h}^{(b)} + {\overset{\leftarrow}{H}}_{t + 1} W_{h h}^{(b)} + b_{h}^{(b)}),

$\overleftarrow{{H}}_t = \phi({X}_t {W}_{xh}^{(b)} + \overleftarrow{{H}}_{t+1} {W}_{hh}^{(b)} + {b}_h^{(b)}),$

H_{t} = c o n c a t ({\vec{H}}_{t}, {\overset{\leftarrow}{H}}_{t}) H_{t} \in R^{n \times 2 h}

${H}_t = concat(\overrightarrow{{H}}_t, \overleftarrow{{H}}_t) \qquad {H}_t \in \mathbb{R}^{n \times 2h}$

O_{t} = H_{t} W_{h y} + b_{y},

${O}_t = {H}_t {W}_{hy} + {b}_y,$

梯度裁剪

循环神经网络中较容易出现梯度衰减或爆炸，为了应对梯度爆炸，我们可以裁剪梯度（clipping gradient）。假设我们把所有模型参数梯度的元素拼接成一个向量 g，并设裁剪的阈值是 θ。裁剪后梯度的 L2 范数不超过 θ。：

min (\frac{θ}{‖ g ‖}, 1) g

$\min\left(\frac{\theta}{\|{g}\|}, 1\right){g}$

def grad_clipping(params, theta, ctx):
    norm = nd.array([0.0], ctx)
    for param in params:
        norm += (param.grad ** 2).sum()
    norm = norm.sqrt().asscalar()
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm

http://zh.gluon.ai/chapter_recurrent-neural-networks/rnn-scratch.html

梯度裁剪可以解决梯度爆炸的问题，梯度衰减呢？ -> 门控单元

LSTM

这里写图片描述

输入门、遗忘门和输出门:

I_{t} = σ (X_{t} W_{x i} + H_{t - 1} W_{h i} + b_{i})

${I}_t = \sigma({X}_t {W}_{xi} + {H}_{t-1} {W}_{hi} + {b}_i)$

F_{t} = σ (X_{t} W_{x f} + H_{t - 1} W_{h f} + b_{f})

${F}_t = \sigma({X}_t {W}_{xf} + {H}_{t-1} {W}_{hf} + {b}_f)$

O_{t} = σ (X_{t} W_{x o} + H_{t - 1} W_{h o} + b_{o}),

${O}_t = \sigma({X}_t {W}_{xo} + {H}_{t-1} {W}_{ho} + {b}_o),$

候选记忆细胞:

{\tilde{C}}_{t} = tanh (X_{t} W_{x c} + H_{t - 1} W_{h c} + b_{c}),

$\tilde{{C}}_t = \text{tanh}({X}_t {W}_{xc} + {H}_{t-1} {W}_{hc} + {b}_c),$

记忆细胞:

C_{t} = F_{t} ⊙ C_{t - 1} + I_{t} ⊙ {\tilde{C}}_{t} .

${C}_t = {F}_t \odot {C}_{t-1} + {I}_t \odot \tilde{{C}}_t.$

隐藏状态:

H_{t} = O_{t} ⊙ tanh (C_{t}) .

${H}_t = {O}_t \odot \text{tanh}({C}_t).$

def lstm(inputs, state, params):
    [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c,
     W_hy, b_y] = params
    (H, C) = state
    outputs = []
    for X in inputs:
        I = nd.sigmoid(nd.dot(X, W_xi) + nd.dot(H, W_hi) + b_i)
        F = nd.sigmoid(nd.dot(X, W_xf) + nd.dot(H, W_hf) + b_f)
        O = nd.sigmoid(nd.dot(X, W_xo) + nd.dot(H, W_ho) + b_o)
        C_tilda = nd.tanh(nd.dot(X, W_xc) + nd.dot(H, W_hc) + b_c)
        C = F * C + I * C_tilda
        H = O * C.tanh()
        Y = nd.dot(H, W_hy) + b_y
        outputs.append(Y)
    return outputs, (H, C)

GRU

这里写图片描述

重置门和更新门:

R_{t} = σ (X_{t} W_{x r} + H_{t - 1} W_{h r} + b_{r})

${R}_t = \sigma({X}_t {W}_{xr} + {H}_{t-1} {W}_{hr} + {b}_r)$

Z_{t} = σ (X_{t} W_{x z} + H_{t - 1} W_{h z} + b_{z})

${Z}_t = \sigma({X}_t {W}_{xz} + {H}_{t-1} {W}_{hz} + {b}_z)$

候选隐藏状态:

{\tilde{H}}_{t} = tanh (X_{t} W_{x h} + (R_{t} ⊙ H_{t - 1}) W_{h h} + b_{h}),

$\tilde{{H}}_t = \text{tanh}({X}_t {W}_{xh} + \left({R}_t \odot {H}_{t-1}\right) {W}_{hh} + {b}_h),$

隐藏状态:

H_{t} = Z_{t} ⊙ H_{t - 1} + (1 - Z_{t}) ⊙ {\tilde{H}}_{t} .

${H}_t = {Z}_t \odot {H}_{t-1} + (1 - {Z}_t) \odot \tilde{{H}}_t.$

def gru(inputs, state, params):
    W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hy, b_y = params
    H, = state
    outputs = []
    for X in inputs:
        Z = nd.sigmoid(nd.dot(X, W_xz) + nd.dot(H, W_hz) + b_z)
        R = nd.sigmoid(nd.dot(X, W_xr) + nd.dot(H, W_hr) + b_r)
        H_tilda = nd.tanh(nd.dot(X, W_xh) + R * nd.dot(H, W_hh) + b_h)
        H = Z * H + (1 - Z) * H_tilda
        Y = nd.dot(H, W_hy) + b_y
        outputs.append(Y)
    return outputs, (H,)

http://www.tensorfly.cn/tfdoc/tutorials/recurrent.html

Recurrent Neural Network Regularization - Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals https://arxiv.org/abs/1409.2329

https://github.com/tensorflow/models/tree/master/tutorials/rnn

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

Cho, K., Van Merri ë nboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://deeplearning.net/tutorial/lstm.html#lstm

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://distill.pub/2016/augmented-rnns/

循环神经网络（Recurrent Neural Network, RNN）

深度循环神经网络

双向循环神经网络

梯度裁剪

LSTM

GRU

猜你喜欢