Introduction to Deep Learning (62) Recurrent Neural Network - Deep Recurrent Neural Network

foreword

The core content comes from blog link 1 blog link 2 I hope you can support the author a lot
This article is used for records to prevent forgetting

Recurrent Neural Networks - Deep Recurrent Neural Networks

courseware

Recap: Recurrent Neural Networks


Update hidden state: ht = ϕ ( W hhht − 1 + W hxxt − 1 ) + bh h_t=\phi(W_{hh}h_{t-1}+W_{hx}x_{t-1})+b_hht=ϕ ( Whhht1+Whxxt1)+bh
Export: ot = ϕ ( W hoht + bo ) o_t=\phi(W_{ho}h_t+b_o)ot=ϕ ( Wh oht+bo)

How to get more non-linearity?

PLANA:Nonlinearity in the units

insert image description here
h t = ϕ ( W h h h t − 1 + W h x x t − 1 ) + b h h_t=\phi(W_{hh}h_{t-1}+W_{hx}x_{t-1})+b_h ht=ϕ ( Whhht1+Whxxt1)+bh
ot = ϕ ( W hoht + bo ) o_t=\phi(W_{ho}h_t+b_o)ot=ϕ ( Wh oht+bo)

Replace ϕ \phiϕ

Deeper

insert image description here
insert image description here

Summarize

  • Deep Recurrent Neural Networks use multiple hidden layers for more nonlinearity

courseware

So far, we have only discussed recurrent neural networks with one unidirectional hidden layer. The way in which hidden variables and observations interact with the specific functional form is quite arbitrary. This is not a huge problem as long as there is sufficient flexibility in modeling the interaction types. However, this can be quite challenging for a single layer. Previously in linear models we solved this problem by adding more layers. And in recurrent neural networks, we first need to figure out how to add more layers, and where to add the extra non-linearity, so this problem is a bit trickier.

In fact, we can stack multiple layers of recurrent neural networks together, resulting in a flexible mechanism through the composition of a few simple layers. In particular, data may relate to stacks of different layers. For example, we may wish to keep macro data on financial market conditions (bear or bull) available, while micro data only capture shorter-term temporal dynamics.

The figure below depicts a deep recurrent neural network with N hidden layers, each hidden state is passed continuously to the next time step of the current layer and the current time step of the next layer.
insert image description here

1 Functional dependencies

We can formalize the functional dependencies in the deep architecture, which is described by LL in the figure aboveIt consists of L hidden layers. Subsequent discussions focus on classical recurrent neural network models, but these discussions are also applicable to other sequence models.

Suppose at time step ttt has a small batch of input dataX t ∈ R n × d \mathbf{X}_t \in \mathbb{R}^{n \times d}XtRn × d (number of samples:nnn , the number of inputs in each sample:ddd ). At the same time,the lthl^\mathrm{th}lth hidden layer (l = 1 , … , L l=1,\ldots,Ll=1,,L)The hidden state is set toH t ( l ) ∈ R n × h \mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}Ht(l)Rn × h (number of hidden units:hhh ), the output layer variable is set toO t ∈ R n × q \mathbf{O}_t \in \mathbb{R}^{n \times q}OtRn × q (Number of outputs:qqq)。 设置 H t ( 0 ) = X t \mathbf{H}_t^{(0)} = \mathbf{X}_t Ht(0)=Xt, th llThe hidden state of the l hidden layer uses the activation functionϕ l \phi_lϕl,则:
H t ( l ) = ϕ l ( H t ( l − 1 ) W x h ( l ) + H t − 1 ( l ) W h h ( l ) + b h ( l ) ) , \mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)} + \mathbf{b}_h^{(l)}), Ht(l)=ϕl(Ht(l1)Wxh(l)+Ht1(l)Whh(l)+bh(l)) ,
where, weightW xh ( l ) ∈ R h × h \mathbf{W}_{xh}^{(l)} \in \mathbb{R}^{h \times h}Wxh(l)Rh×h, W h h ( l ) ∈ R h × h \mathbf{W}_{hh}^{(l)} \in \mathbb{R}^{h \times h} Whh(l)Rh × h and biasbh ( l ) ∈ R 1 × h \mathbf{b}_h^{(l)} \in \mathbb{R}^{1 \times h}bh(l)R1 × h is thellthThe model parameters of the l hidden layers.

Finally, the calculation of the output layer is based only on the llthThe final hidden state of l
hidden layers: O t = H t ( L ) W hq + bq , \mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{hq} + \mathbf{b}_q,Ot=Ht(L)Whq+bq,
where, the weightW hq ∈ R h × q \mathbf{W}_{hq} \in \mathbb{R}^{h \times q}WhqRh × q and biasbq ∈ R 1 × q \mathbf{b}_q \in \mathbb{R}^{1 \times q}bqR1 × q are the model parameters of the output layer.

Like the multi-layer perceptron, the number of hidden layers LLL and the number of hidden unitshhh are hyperparameters. That is, they can be adjusted by us. Also, replaceH t ( l ) = ϕ l ( H t ( l − 1 ) W xh ( l ) + H t − 1 ( l ) W hh ( l ) + bh ( l ) ) , \mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{( l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)} + \mathbf{b}_h^{(l)}) ,Ht(l)=ϕl(Ht(l1)Wxh(l)+Ht1(l)Whh(l)+bh(l)) , the hidden state in , can be easily obtained deep gated recurrent neural network or deep long short-term memory neural network.

2 Concise implementation

Many of the logical details needed to implement multilayer recurrent neural networks are readily available in the high-level API. For simplicity, we only demonstrate implementations using such built-in functions. Taking the LSTM network model as an example, the code is very similar to the code we used earlier in the LSTM section, the only difference in fact is that we specify the number of layers instead of using the default value of a single layer. As usual, we start by loading the dataset.

import torch
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

Architectural decisions like choosing hyperparameters are also very similar to those in the LSTM section. Since we have different tokens, the same number is chosen for both input and output, ie vocab_size. The number of hidden units is still 256. The only difference is that we now pass num_layersthe value to set the number of hidden layers.

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size
device = d2l.try_gpu()
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)

3 Training and Prediction

Since the LSTM network model is used to instantiate two layers, the training speed is greatly reduced.

num_epochs, lr = 500, 2
d2l.train_ch8(model, train_iter, vocab, lr*1.0, num_epochs, device)

output:

perplexity 1.0, 224250.2 tokens/sec on cuda:0
time travelleryou can show black is white by argument said filby
travelleryou can show black is white by argument said filby

4 Summary

  • In deep recurrent neural networks, the information of the hidden state is passed to the next time step of the current layer and the current time step of the next layer.

  • There are many different flavors of deep recurrent neural networks, such as long short-term memory networks, gated recurrent units, or classical recurrent neural networks. These models are covered in the high-level APIs of deep learning frameworks.

  • Overall, deep recurrent neural networks require extensive tuning of parameters (such as learning rate and pruning) to ensure proper convergence, and model initialization also requires care.

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/128376643
Recommended