Series Article Directory
Chapter 1 Introduction
Chapter 2 Overview of Machine Learning
Chapter 3 Linear Model
Chapter 4 Feedforward Neural Network
Chapter 5 Convolutional Neural Network Chapter 6
Recurrent Neural Network
Chapter 7 Network Optimization and Regularization
Chapter 8 Attention Mechanism and External Memory
Chapter 9 Unsupervised Learning
Chapter 10 Model Independent Learning
Chapter 11 Probabilistic Graphical Models
Chapter 12 Deep Belief Networks
Chapter 13 Deep Generative Models
Chapter 14 Deep Reinforcement Learning
Chapter 15 Sequence Generative Models
Article directory
foreword
This article introduces recurrent neural networks.
6.1 Adding memory to neural networks
6.1.1 Delay Neural Network
Time Delay Neural Network (TDNN), that is, to establish an additional delay unit to store the historical information of the network (including input, output, hidden state, etc.).
ht ( l ) = f ( ht ( l − 1 ) , ht − 1 ( l − 1 ) , … ht − K ( l − 1 ) ) h_t^{(l)}=f(h_t^{(l-1 )},h_{t-1}^{(l-1)},…h_{tK}^{(l-1)})ht(l)=f(ht(l−1),ht−1(l−1),…ht−K(l−1))
6.1.2 Autoregressive models
Autoregressive Model (AR), a type of time series model, uses historical information of variables to predict itself.
yt = w 0 + ∑ k = 1 K wkyt − k + ϵ t y_t=w_0+\sum _{k=1}^K w_ky_{tk}+\epsilon_tyt=w0+k=1∑Kwkyt−k+ϵt
ϵ t \epsilon_tϵtis the noise at the tth moment
6.1.3 Nonlinear autoregressive models
Nonlinear Autoregressive with Exogenous Inputs Model (NARX)
yt = f ( xt , xt − 1 , … , xt − K x , yt − 1 , yt − 2 , … , yt − K x ) y_t=f(x_t,x_{t-1},…,x_{t-K_x},y_{t-1},y_{t-2},…,y_{t-K_x})yt=f(xt,xt−1,…,xt−Kx,yt−1,yt−2,…,yt−Kx)
where f(⋅) represents a nonlinear function, which can be a feed-forward network,K x K_xKx和K and K_yKyas a hyperparameter
6.2 Recurrent Neural Networks
6.2.1 Network structure
h t = f ( h t − 1 , x t ) h_t=f(h_{t-1},x_t) ht=f(ht−1,xt)
-
Recurrent neural networks are capable of processing time-series data of arbitrary length by using neurons with self-feedback.
-
Recurrent neural networks are more consistent with the structure of biological neural networks than feedforward neural networks.
-
Recurrent neural networks have been widely used in tasks such as speech recognition, language models, and natural language generation.
6.2.2 Network Expansion in Time
6.2.3 Simple Recurrent Networks
State update function :
ht = f ( U ht − 1 + W xt + b ) h_t=f(Uh_{t-1}+Wx_{t}+b)ht=f(Uht−1+Wxt+b )
General Approximation Theorem:
6.2.4 Turing Complete
Turing Completeness refers to a data manipulation rule, such as a computer programming language, which can realize all the functions of a Turing machine and solve all computable problems.
A fully connected recurrent neural network can approximately solve all computable problems.
6.2.5 Applications
- A machine learning model as an input-output mapping (this section focuses on this case).
- As an associative memory model in memory.
6.3 Applications of Recurrent Neural Networks in Machine Learning
6.3.1 Sequence to Category
Model Architecture :
Practical Applications :
Sentiment Classification Tasks
6.3.2 Synchronized Sequence-to-Sequence Mode
Model Structure :
Practical Application :
- Chinese participle
2. Information Extraction (IE)
Extract structured information from unstructured text to form knowledge.
3. Speech recognition
6.3.3 Asynchronous sequence-to-sequence mode
Model Structure :
Application :
Machine Translation
ϵ t \epsilon_tϵtis the noise at the tth moment
6.4 Gradients
6.4.1 Definition
Given a training sample (x,y), where x = ( x 1 x_{1}x1,… , x T x_{T} xT) is the input sequence whose length is T, y = ( y 1 y_{1}y1 ,… , y T y_{T} yT) is a tag sequence of length T. Then the instantaneous loss function at time t is:
L t = L ( yt , g ( ht ) ) L_{t}=L(y_t,g(h_t))Lt=L ( yt,g(ht) )
The total loss function is:
L = ∑ t = 1 TL t L=\sum_{t=1}^{T}L_{t}L=t=1∑TLt
6.4.2 Calculation
Backpropagation through time algorithm:
Vanishing and Exploding Gradients:
Due to the calculation process λ \lambdaThe λ part needs to be quadratured many times, so when the obtained value is greater than 1, the problem of gradient explosion will occur, and when it is less than 1, the problem of gradient disappearance will occur.
6.4.3 Long-range dependency problem
The reason for the occurrence :
the cyclic neural network is very deep in the time dimension, so there will be gradient disappearance or explosion problems, so in fact only short-term dependencies can be learned. This is the so-called long-range dependency problem .
Improvement principle :
For the gradient explosion problem, the weight decay or gradient truncation method can be used; for the gradient disappearance problem, the model can be improved.
ways to improve
- The loop edge is changed to a linear dependency:
ht = ht − 1 + g ( xt ; θ ) h_t = h_{t-1}+g(x_t;\theta)ht=ht−1+g(xt;i ) - Add nonlinearity:
ht = ht − 1 + g ( xt ; ht − 1 ; θ ) h_t = h_{t-1}+g(x_t;h_{t-1};\theta)ht=ht−1+g(xt;ht−1;i ) - Use the door controller:
control the accumulation speed of information, including selectively adding new information and selectively forgetting previously accumulated information.
ht = ht − 1 + g ( xt ; ht − 1 ; θ ) h_t = h_{t-1}+g(x_t;h_{t-1};\theta)ht=ht−1+g(xt;ht−1;i )
6.5 GRU and LSTM
6.5.1GRU(Gated Recurrent Unit)
- structure diagram
- 计算公式
r t = σ ( W r x t + U r h t − 1 + b r ) z t = σ ( W r x t + U z h t − 1 + b z ) h ~ t = t a n h ( W c x t + U ( r t ⊙ h t − 1 ) ) h t = z t ⊙ h t − 1 + ( 1 − z t ) ⊙ h ~ t r_t = \sigma(W_rx_t+U_rh_{t-1}+b_r) \\ z_t = \sigma(W_rx_t+U_zh_{t-1}+b_z)\\ \tilde h_t = tanh(W_cx_t+U(r_{t}\odot h_{t-1}))\\ h_t = z_t\odot h_{t-1}+(1-z_t)\odot \tilde h_t rt=s ( Wrxt+Urht−1+br)zt=s ( Wrxt+Uzht−1+bz)h~t=t a n h ( Wcxt+U(rt⊙ht−1))ht=zt⊙ht−1+(1−zt)⊙h~t
6.5.2 LSTM(Long Short-Term Memory)
-
structure diagram
-
计算公式
f t = σ ( W f x t + U f h t − 1 + b f ) i t = σ ( W i x t + U i h t − 1 + b i ) c ~ t = t a n h ( W c x t + U c h t − 1 + b c ) o t = σ ( W o x t + U o h t − 1 + b o ) c t = f t ⊙ c t − 1 + i t ⊙ c ~ t h t = o t ⊙ t a n h ( c t ) f_t = \sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t = \sigma(W_ix_t+U_ih_{t-1}+b_i)\\ \tilde c_t = tanh(W_cx_t+U_ch_{t-1}+b_c)\\ o_t = \sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t = f_t\odot c_{t-1}+i_t\odot \tilde c_t\\ h_t = o_t\odot tanh(c_t) ft=s ( Wfxt+Ufht−1+bf)it=s ( Wixt+Uiht−1+bi)c~t=t a n h ( Wcxt+Ucht−1+bc)ot=s ( Woxt+Uoht−1+bo)ct=ft⊙ct−1+it⊙c~tht=ot⊙t a n h ( ct) -
Variations of LSTMs
6.6 Deep Models
Stacked Recurrent Neural Networks :
Bidirectional Recurrent Neural Networks :
6.7 Graph Networks
6.7.1 Recurrent Neural Networks
Recurrent neural networks actually share a composition function on a directed graph acyclic graph:
Degenerates to a recurrent neural network:
Natural language processing:
6.7.2 Graph Networks
Calculation process:
Function Compute:
6.7 Applications of Recurrent Networks
Judging the rationality of sentences, composing lyrics, composing poems, machine translation, speaking by looking at pictures, writing, and dialogue systems.