Deep Learning Recurrent Neural Networks

(1) recurrent neural network theory

RNNs objects using data processing sequence. In traditional neural network model, from the input layer to the hidden layer and then to the output layer, between the layers are fully connected, the nodes between each layer is connectionless. But this common neural network for a lot of problems but no can afford. For example, you want to predict what the next word in a sentence is, generally need to use the front of the word, because the sentence before and after a word is not independent. RNNs cycle is called neural network, i.e. a previous output of the current output sequence is also relevant. Specific forms of the foregoing network for information storage and will be used to calculate the current output, i.e. between the nodes in the hidden layer is connected but no longer has a connection, not only the hidden layer and input layer comprises an input output further comprising a timing output of the hidden layer. Theoretically, RNNs can process data in any sequence length. In practice, however, in order to reduce the complexity often assumed that the current state is only related to the previous state of the several

one-hot vector: a form of one-hot vector to convert the categorical variables algorithm is easy to use machine learning process, this vector is represented by a feature vector of attributes, i.e. the same time only one active point (not 0 ), this is only one feature vector is not 0, the other is 0, especially sparse.
one-hot vector is represented as ti = {0,0,0, ..., 1 , ... 0}, the length value determined in accordance with all of the features of discrete, discrete number of values of all characteristic features are added, than the general characteristics the number more than the number of the specific features of the scene is not the same.

After the network receives the input at time t Xt, the value of the hidden layer is St, the output value Ot. The key point is that value depends not only on St Xt, but also on the Xt-1

Formula 1 formula is the output layer, the output layer is a layer fully connected, i.e. each of its nodes and each node is connected to the hidden layer. V is the output layer weight matrix, g is the activation function. Equation 2 is calculated in the hidden layer, which layer is cyclic. U is the input x weight weight matrix, W is a weight value as the input time of this weight matrix, f is the activation function.

We can see from the above equation, the difference between the cycle and the layer is fully connected layers of a multi-layer loop weighting matrix W.
If you repeatedly skill to type 2 into 1, we will get:

As can be seen from the above, the output value of the neural network Ot cycle, the previous input value is limited by the foregoing Xt, Xt-1, Xt-1, Xt-1 ... effects, which is why any cycle neural network can look forward a plurality of input values ​​reasons.

Training algorithm (2) neural network cycle: the BPTT
the BPTT algorithm is for training algorithm loop layer, and its basic principle is the same as BP algorithm, also contain the same three steps:
to calculate each nerve (A) before the element output value;
(b) each neuron is calculated backward error term value δj, the error function E which is the weighted partial derivative of the input neurons j of netj;
(iii) calculate the weight of each weight gradient.
Finally then stochastic gradient descent algorithm updates the weights.

① before the calculation

S / X is a vector, U is the matrix, the time index vector represents
assumed that the input vector x dimension is m, the dimension of the output vector s is n, a n-dimension matrix U m, is the dimension of matrix W n- n- . The following formula is expanded like a matrix

Handwritten letter represents an element vector, it indicates that it is the subscript of several elements of this vector, which represents the number of time superscript

② calculating an error term
BTPP l algorithm is to first layer error value of the propagation time t in the two directions:
1) a direction which is transmitted to the network layer, to obtain, which only part of the weight matrix U and related.
2) the other direction to pass along a time line to the initial moment, to give this part of the weight matrix W and the only relevant

用向量nett表示神经元在t时刻的加权输入

也就是说

A represents a column vector, a row vector represented by aT. The first term is a vector of a vector derivation function, a result Jacobian matrix

Similarly, the second term is a Jacobian matrix.

diag [a] denotes a diagonal matrix according to create a vector

Finally, the two together

......

Several derivation is omitted, said direct result of:

This is the error term is transmitted to the algorithm level

③ gradient calculated for each weight
1) weight matrix W gradient at time t is:

2) a gradient of a final

3) similar with the right weight matrix W, we can get the weight matrix U

最终的梯度也是各个时刻的梯度之和

(3) RNN language model based on
the word sequentially input to the recurrent neural network, each input a word, recurrent neural network can output To date, most likely next word
first, to create a dictionary of all the words contained, each word in the dictionary which has a unique number.
Second, a word may be any one-hot vector with a N-dimensional to represent.

High-dimensional sparse vector (sparse refers to the value of most of the elements are zero). Such vector processing will lead to our neural network has a lot of parameters, bring a large amount of calculation. Therefore, often you need to use some dimensionality reduction method, the high-dimensional sparse vector into a dense low-dimensional vector.

Let recurrent neural network calculation of each word in the dictionary is the probability of the next word, so that the probability of the biggest word is the next most likely word. Thus, the output vector of the neural network is a N-dimensional vector, each vector element corresponding to the corresponding word in the dictionary is a word probability.

How to make probabilistic neural network output it? The method is to use the layer as a softmax output layer of the neural network.

 Softmax called normalized exponential function, vector compression, such that the scope of each element in between (0,1), the output probability needs and adapted.

For the language model, we believe that the model can predict the next word in the dictionary is the probability of the first word is 0.03, the probability that the second word in the dictionary is 0.09, so
training (4) Language Model
supervised learning , you first need labeling

Using the quantization method, the input x and y labels vectorization
optimization target Finally, the use of cross-entropy as the error function, the model is optimized.

E.g. y1 = [1,0,0,0], if the network output o = [0.03,0.09,0.24,0.64]

When probabilistic modeling, select the cross-entropy error function is more reasonable.

Model, optimization goals, gradient expression has, and you can use gradient descent algorithm for training

7. cyclic structure neural network
(1) unidirectional loop neural network
is assumed that at time t, the input network is xt, hidden state (i.e., hidden layer neuron activity value) input and the current time xt ht not relevant, and also on a hidden states associated time ht-1, in turn related to all past input sequence (x1, x2, ..., xt -1, xt)

Where zt is the net input of the hidden layer; f (•) is a non-linear activation function, typically Tanh Sigmoid function or function; is the U-Status - weight matrix, W is a state - input weighting matrix, b is a bias

(2) Recurrent Neural Networks bidirectional
bidirectional recurrent neural network (Bidirectional Recurrent Neural Network, Bi- RNN), which consists of two layers of the neural network of the cycle, both layers are the network input sequence x, but in the opposite direction of information transfer.
Suppose the first layer transmission information in chronological order, the second layer is transmitted in reverse chronological information, the two layers in hidden state at time t, respectively HT (1) and ht (2):

The third formula represents a hidden state vectors of the two spliced ​​together.

8. The cycle parameter learning neural network

Demand parameters: Status - weight matrix U, state - input weight matrix W, and the bias b
learning methods: gradient descent

(1) loss function and gradient:
calculating the loss function over the entire sequence of gradient parameters U

t time, loss, T is the length of the sample
(2) over time back propagation algorithm
with time back propagation algorithm and feedforward neural network error back propagation algorithm similar comparison, but is considered as the Recurrent Neural Networks an expanded multilayer feedforward network, recurrent neural network corresponding to each time point for each layer.
In the expanded multi-layer feedforward network, the parameters of all layers is shared, so real gradient parameter is a parameter for all gradients and feedforward network layer.
First calculate the partial derivative of time t loss parameters U and by calculating the gradient of the loss function of the entire sequence of parameters U.
① Calculation time t loss parameters U partial derivative of
the loss function of time t, the net input zt is calculated by the following formula a step up from
    

Hidden layer k-th time (1 ≤ k ≤ t, represents all time t the time elapsed) is the net input zk = Uhk-1 + Wxk + b can be obtained, a loss function of the time t on gradient parameter Uij is

Note zk = Uhk-1 + Wxk + b to calculate the partial derivatives of Uij when zk, hk-1 to maintain constant
demand loss function of the time t of the gradient parameter regarding Uij:

② Calculation of a gradient of the loss function of the entire sequence of parameters U

③ back propagation algorithm (the BPTT) over time, the gradient parameter requires a complete "forward" and computing "reverse" in order for the calculation, and updates the parameters, it is necessary to save all the time gradient of the intermediate space higher complexity. Real-time loop learning (RTRL) algorithm at the time t, can be calculated in real-time loss on gradient parameters, without gradient return, space complexity low
long-term dependence problem 9. recurrent neural network
(1) rely on long-term problem
of long-term dependence is refers to the state of the current system, may be affected long before the system state, RNN is a problem that can not be solved.
Although recurrent neural network theory, can establish dependencies between states for a long time interval, but due to the explosion gradient or gradient disappearing, in fact, may only learn to short-term dependency.
Error δt, k equation expands to:

Be turned into

   
If you define

There

(2) improvement program
can alleviate the problem disappear gradient explosions and gradient cycle neural network to avoid long-term dependency problems
① gradient explosion:
can cut through heavy attenuation and gradient gradient right to avoid explosion. Attenuation is the weight increased by a regularization parameter L1 and L2 to limit the regularization parameter range, so that γ≤1. The gradient is cut off when the gradient modulus greater than a predetermined threshold value, it will be truncated to a smaller number.

② gradient disappears:
may change the model, such as making U = I, while that f '(zi) = 1

③ Memory capacity problem:
Solution gradient disappearing, by the introduction of a function g (•), such that both the linear relationship between ht and 1 ht-, there are non-linear relationship. This will bring the memory capacity problem, which is stored along with the continuous ht new input information, it will become more and more, that is, saturation phenomenon. The information can be stored in a hidden state ht is limited, as more and more content is stored in the memory unit, which lost more and more information.
In order to solve the capacity problem, you can use two methods. First, add extra memory cell, i.e., the external memory unit; and the second is selectively forgetting selective update, i.e. gating mechanism long-term memory network (LSTM) is.

10. A neural network model cycle
  recurrent neural network model:
the sequence pattern to the catalog,
the synchronization sequence to the sequence mode,
the asynchronous mode sequence to sequence.
(1) the sequence pattern to the catalog of
sequences to the category classification pattern is mainly used for sequence data: input sequence (T data), the output category (a data). Typical is the text classification tasks, input data word sequence (composed of a document), the output category for the text.
Suppose a sample x1: T = (x1, x2 , ..., xT) of a sequence of length T, the output is a category y∈ {1, 2, ..., C}. The input sample x at different times to the neural network into the circulation, can be hidden states at different times h1, h2, ..., hT, and then the entire sequence is regarded as the final hT represents the input to the classifier g (• ) do classification

It can also mean all the states of the entire sequence, with hidden state as represented by the average of the entire sequence

(2) the synchronization sequence to sequence pattern
sequence to the sequence mode is mainly used for synchronization task sequence labeling, i.e., both input and output each time, the same input sequence and an output sequence length. For example, speech tagging (Pos Tagging), each word you want to label its part of speech. Named Entity Recognition (Name Entity Recognition, NER) sequence can also be seen labeling problem, similar to the practice of POS tagging, characterized in that for a named entity named entity tag outputs it to replace the part of speech.
Suppose a sample x1: T = (x1, x2 , ..., xT) of a length sequence, the output sequence is T y1: T = (y1, y2 , ..., yT). The input sample x at different times to the neural network into the circulation, can be hidden states at different times h1, h2, ..., hT, and the hidden state of each time point is input to the classifier g (•), to give the current label time.

(3) asynchronous mode sequence to sequence
sequence sequence to the asynchronous mode is also referred to as coder - decoder (Encoder-Decoder) model, i.e., the input sequence and output sequence need not have a strict corresponding relationship, do not maintain the same length . Such as machine translation, source language input word sequence, the output of the target language word sequence.
In asynchronous mode, the sequence in a sequence, to enter a sequence of length T: x1: T = (x1, x2, ..., xT), the output of a sequence of length M: y1: M = (y1, y2 , ..., yM), achieved by decoding the first coded way.
First input sample x to a different time recurrent neural network (encoder) to give encoding a hT, then obtain another output sequence ý1 loop neural network (decoder) are: M. To establish the dependencies between the output sequence, usually non-linear autoregressive model in the decoder.

其中f1(•)和f2(•)分别表示用作编码器和解码器的循环神经网络,g(•)为分类器

Guess you like

Origin www.cnblogs.com/hello-bug/p/12524776.html