Chapter 6 Recurrent Neural Networks

Series Article Directory

Chapter 1 Introduction
Chapter 2 Overview of Machine Learning
Chapter 3 Linear Model
Chapter 4 Feedforward Neural Network
Chapter 5 Convolutional Neural Network Chapter 6
Recurrent Neural Network
Chapter 7 Network Optimization and Regularization
Chapter 8 Attention Mechanism and External Memory
Chapter 9 Unsupervised Learning
Chapter 10 Model Independent Learning
Chapter 11 Probabilistic Graphical Models
Chapter 12 Deep Belief Networks
Chapter 13 Deep Generative Models
Chapter 14 Deep Reinforcement Learning
Chapter 15 Sequence Generative Models



foreword

This article introduces recurrent neural networks.


6.1 Adding memory to neural networks

6.1.1 Delay Neural Network

Time Delay Neural Network (TDNN), that is, to establish an additional delay unit to store the historical information of the network (including input, output, hidden state, etc.).
ht ( l ) = f ( ht ( l − 1 ) , ht − 1 ( l − 1 ) , … ht − K ( l − 1 ) ) h_t^{(l)}=f(h_t^{(l-1 )},h_{t-1}^{(l-1)},…h_{tK}^{(l-1)})ht(l)=f(ht(l1),ht1(l1),htK(l1))
insert image description here

6.1.2 Autoregressive models

Autoregressive Model (AR), a type of time series model, uses historical information of variables to predict itself.
yt = w 0 + ∑ k = 1 K wkyt − k + ϵ t y_t=w_0+\sum _{k=1}^K w_ky_{tk}+\epsilon_tyt=w0+k=1Kwkytk+ϵt

ϵ t \epsilon_tϵtis the noise at the tth moment

6.1.3 Nonlinear autoregressive models

Nonlinear Autoregressive with Exogenous Inputs Model (NARX)
yt = f ( xt , xt − 1 , … , xt − K x ​​, yt − 1 , yt − 2 , … , yt − K x ) y_t=f(x_t,x_{t-1},…,x_{t-K_x},y_{t-1},y_{t-2},…,y_{t-K_x})yt=f(xt,xt1,xtKx,yt1,yt2,ytKx)
where f(⋅) represents a nonlinear function, which can be a feed-forward network,K x K_xKxK and K_yKyas a hyperparameter

insert image description here

6.2 Recurrent Neural Networks

6.2.1 Network structure

insert image description here
h t = f ( h t − 1 , x t ) h_t=f(h_{t-1},x_t) ht=f(ht1,xt)

  • Recurrent neural networks are capable of processing time-series data of arbitrary length by using neurons with self-feedback.

  • Recurrent neural networks are more consistent with the structure of biological neural networks than feedforward neural networks.

  • Recurrent neural networks have been widely used in tasks such as speech recognition, language models, and natural language generation.

6.2.2 Network Expansion in Time

insert image description here

6.2.3 Simple Recurrent Networks

State update function :
ht = f ( U ht − 1 + W xt + b ) h_t=f(Uh_{t-1}+Wx_{t}+b)ht=f(Uht1+Wxt+b )
General Approximation Theorem:

insert image description here

6.2.4 Turing Complete

Turing Completeness refers to a data manipulation rule, such as a computer programming language, which can realize all the functions of a Turing machine and solve all computable problems.
insert image description here
A fully connected recurrent neural network can approximately solve all computable problems.

6.2.5 Applications

  • A machine learning model as an input-output mapping (this section focuses on this case).
  • As an associative memory model in memory.

6.3 Applications of Recurrent Neural Networks in Machine Learning

6.3.1 Sequence to Category

Model Architecture :
insert image description here
Practical Applications :
Sentiment Classification Tasks

insert image description here

6.3.2 Synchronized Sequence-to-Sequence Mode

Model Structure :
insert image description here
Practical Application :

  1. Chinese participle

insert image description here
2. Information Extraction (IE)

Extract structured information from unstructured text to form knowledge.
insert image description here
3. Speech recognition

insert image description here

6.3.3 Asynchronous sequence-to-sequence mode

Model Structure :
insert image description here
Application :
Machine Translation

insert image description here
ϵ t \epsilon_tϵtis the noise at the tth moment

6.4 Gradients

6.4.1 Definition

Given a training sample (x,y), where x = ( x 1 x_{1}x1,… , x T x_{T} xT) is the input sequence whose length is T, y = ( y 1 y_{1}y1 ,… , y T y_{T} yT) is a tag sequence of length T. Then the instantaneous loss function at time t is:
L t = L ( yt , g ( ht ) ) L_{t}=L(y_t,g(h_t))Lt=L ( yt,g(ht) )
The total loss function is:
L = ∑ t = 1 TL t L=\sum_{t=1}^{T}L_{t}L=t=1TLt

6.4.2 Calculation

Backpropagation through time algorithm:
insert image description here

insert image description here

Vanishing and Exploding Gradients:

insert image description here
Due to the calculation process λ \lambdaThe λ part needs to be quadratured many times, so when the obtained value is greater than 1, the problem of gradient explosion will occur, and when it is less than 1, the problem of gradient disappearance will occur.

6.4.3 Long-range dependency problem

The reason for the occurrence :
the cyclic neural network is very deep in the time dimension, so there will be gradient disappearance or explosion problems, so in fact only short-term dependencies can be learned. This is the so-called long-range dependency problem .

Improvement principle :
For the gradient explosion problem, the weight decay or gradient truncation method can be used; for the gradient disappearance problem, the model can be improved.
ways to improve

  1. The loop edge is changed to a linear dependency:
    ht = ht − 1 + g ( xt ; θ ) h_t = h_{t-1}+g(x_t;\theta)ht=ht1+g(xt;i )
  2. Add nonlinearity:
    ht = ht − 1 + g ( xt ; ht − 1 ; θ ) h_t = h_{t-1}+g(x_t;h_{t-1};\theta)ht=ht1+g(xt;ht1;i )
  3. Use the door controller:
    control the accumulation speed of information, including selectively adding new information and selectively forgetting previously accumulated information.
    ht = ht − 1 + g ( xt ; ht − 1 ; θ ) h_t = h_{t-1}+g(x_t;h_{t-1};\theta)ht=ht1+g(xt;ht1;i )

6.5 GRU and LSTM

6.5.1GRU(Gated Recurrent Unit)

  1. structure diagram
    insert image description here
  2. 计算公式
    r t = σ ( W r x t + U r h t − 1 + b r ) z t = σ ( W r x t + U z h t − 1 + b z ) h ~ t = t a n h ( W c x t + U ( r t ⊙ h t − 1 ) ) h t = z t ⊙ h t − 1 + ( 1 − z t ) ⊙ h ~ t r_t = \sigma(W_rx_t+U_rh_{t-1}+b_r) \\ z_t = \sigma(W_rx_t+U_zh_{t-1}+b_z)\\ \tilde h_t = tanh(W_cx_t+U(r_{t}\odot h_{t-1}))\\ h_t = z_t\odot h_{t-1}+(1-z_t)\odot \tilde h_t rt=s ( Wrxt+Urht1+br)zt=s ( Wrxt+Uzht1+bz)h~t=t a n h ( Wcxt+U(rtht1))ht=ztht1+(1zt)h~t

6.5.2 LSTM(Long Short-Term Memory)

  1. structure diagram
    insert image description here

  2. 计算公式
    f t = σ ( W f x t + U f h t − 1 + b f ) i t = σ ( W i x t + U i h t − 1 + b i ) c ~ t = t a n h ( W c x t + U c h t − 1 + b c ) o t = σ ( W o x t + U o h t − 1 + b o ) c t = f t ⊙ c t − 1 + i t ⊙ c ~ t h t = o t ⊙ t a n h ( c t ) f_t = \sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t = \sigma(W_ix_t+U_ih_{t-1}+b_i)\\ \tilde c_t = tanh(W_cx_t+U_ch_{t-1}+b_c)\\ o_t = \sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t = f_t\odot c_{t-1}+i_t\odot \tilde c_t\\ h_t = o_t\odot tanh(c_t) ft=s ( Wfxt+Ufht1+bf)it=s ( Wixt+Uiht1+bi)c~t=t a n h ( Wcxt+Ucht1+bc)ot=s ( Woxt+Uoht1+bo)ct=ftct1+itc~tht=ott a n h ( ct)

  3. Variations of LSTMs
    insert image description here

6.6 Deep Models

Stacked Recurrent Neural Networks :
insert image description here
Bidirectional Recurrent Neural Networks :
insert image description here

6.7 Graph Networks

6.7.1 Recurrent Neural Networks

Recurrent neural networks actually share a composition function on a directed graph acyclic graph:

insert image description here
Degenerates to a recurrent neural network:

insert image description here
Natural language processing:
insert image description here

6.7.2 Graph Networks

Calculation process:
insert image description here
Function Compute:
insert image description here

6.7 Applications of Recurrent Networks

Judging the rationality of sentences, composing lyrics, composing poems, machine translation, speaking by looking at pictures, writing, and dialogue systems.


Summarize

Guess you like

Origin blog.csdn.net/qq_40940944/article/details/127994822