Understand the structural principles of RNN, LSTM, and Transformer in 10 minutes!

1. RNN

RNN is a recurrent neural network, which is the basic network framework for time series data processing such as NLP and language recognition. Different from image data, time series data refers to data collected at different times, and the state of such data is generally related to time. For a sentence, it is difficult to understand the overall meaning through a single word. Only by processing the entire sequence connected by these words can the overall information be better understood.

1.1 RNN basic architecture

As shown in the figure below, it is a basic architecture diagram of RNN. The input is a sentence "I dislike the boring movie", through the hidden layer H in the middle, and finally the output O is obtained through certain calculations. We can regard each vertical sub-module (shown in the red box) as a fully connected network, then along the time dimension, there are a total of 5 fully connected structures, and the parameters of these fully connected networks are shared.

![Insert picture description here](https://img-blog.csdnimg.cn/14c15e7edb364e818b705537d4959246.png
The difference is that the state of the middle hidden layer is not only affected by the input at the current moment, but also related to the hidden layer nodes at the previous moment. The formula is as follows:
insert image description here
Further, the formula is expanded as follows:
insert image description here
Through the following picture (case of emotion classification), the state change in the hidden layer can be seen more intuitively (shown in the blue box in the figure):
insert image description here

1.2 Three classic structures of RNN

According to the different input and output of RNN, it is often divided into three structures: Vector-to-Sequence (1 to many), Sequence-to-Vector (many to 1), Encoder-Decoder (many to many)

1.2.1 vector-to-sequence structure

Sometimes we have to deal with problems where the input is a single value and the output is a sequence, such as generating text from images. At this point, there are two main modeling approaches:

  • Method 1: Calculations can only be performed on one of the sequences. For example, the input calculation is performed on the first sequence . The modeling method is as follows:
    insert image description here

  • Method 2: The input information X is used as the input of each stage , and the modeling method is as follows:
    insert image description here

1.2.2 sequence-to-vector structure

When the input of the problem we want to deal with is a sequence, the output is a single value (such as sentiment classification, text classification problem), at this time, the output transformation is usually performed on the last sequence , and its modeling is as follows:
insert image description here

1.2.3 Encoder-Decoder structure

The RNN of the original sequence-to-sequence structure requires the sequence to be of equal length. However, most of the problem sequences we encounter are of unequal length. For example, in machine translation, sentences in the source language and the target language often do not have the same length. ​ Its structure consists of an encoder and a decoder:

  • Encoder: encodes the input data into a context vector ccc , this part is called Encoder, getccThere are many ways of c , the easiest way is to assign the last hidden state of Encoder toccc , you can also do a transformation on the final hidden state to getccc , can also transform all hidden states. Its schematic is as follows:
    insert image description here
  • Decoder: Decode it with another RNN network (we will call it Decoder). There are generally two ways to decode:
    • c ​ c​ in step 1c ​Input to Decoder as
      insert image description here
    • will ccc is used as the input of each step of the Decoder, the schematic diagram is as follows:
      insert image description here

1.3 Common areas of RNN

Depending on its structure, different scenarios are used:

  • Vector-to-Sequence (1-to-many): It is often used in fields such as image generation text, image generation voice or music, etc.
  • Sequence-to-vector (many to 1): commonly used in text classification, emotion recognition, video classification and other fields
  • Encoder-Decoder (many-to-many): It has a wide range of usage scenarios, including machine translation, text summarization, reading comprehension, language recognition and other fields

1.4 Advantages and disadvantages of RNN

advantage:

  • In the hidden layer, the state at time t is jointly determined by the input at time t and the state at time t-1, which helps to establish the connection between word contexts.
  • Each fully connected network in RNN shares a set of parameters, which greatly reduces the amount of network parameters and makes network training more efficient.

shortcoming:

  • In the above general Encoder-Decoder structure, Encoder encodes all input sequences into a unified semantic feature c ​ cc ​Decode again c ​ cc must contain all the information in the original sequence, and its length becomes the bottleneck that limits the performance of the model. Such as machine translation problems,when the sentence to be translated is long, a c ​ cc ​May not be able to store so much information, which will cause a drop in translation accuracy.
  • Since the unique memory of RNN will affect the characteristics of other RNNs in the later stage, the gradient is sometimes large and sometimes small, and the learning rate cannot be adjusted individually. As a result, the Loss of the RNN will fluctuate during the training process. ( You can set a critical value. When the gradient is greater than a certain critical value, it will be truncated directly. Use this critical value as the size of the gradient to prevent large shocks .)

1.5 Why does the gradient disappear in RNN

The function and derivative graph of the sigmoid function are as follows:
insert image description here

  • It can be seen from the above figure that the derivative range of the sigmoid function is (0,0.25], and the derivative range of the tanh function is (0,1], and their derivatives are not greater than 1 at most.
  • If the tanh or sigmoid function is used as the activation function, as the time series continues to deepen, the cumulative multiplication of the activation function derivative will cause the result to become smaller and smaller until it is close to 0, which is the phenomenon of "gradient disappearance".
  • In actual use, the tanh function will be preferred, because the tanh function has a larger gradient than the sigmoid function, the convergence speed is faster and the gradient disappears more slowly .

The gradient disappearance is caused by the infinite use of historical data, but the characteristic of RNN is that it can use historical data to obtain more available information. The main methods to solve the gradient disappearance in RNN are:

  • Choose a better activation function, such as the Relu activation function. The left derivative of the ReLU function is 0, and the right derivative is always 1, which avoids the occurrence of "gradient disappearance". However, a constant derivative of 1 can easily lead to "gradient explosion", but setting an appropriate threshold can solve this problem.
  • The advantages of adding the BN layer include accelerating convergence, controlling overfitting, using less or no dropout and regularization, reducing the network's insensitivity to initialization weights, and allowing the use of larger learning rates, etc.
  • Change the propagation structure, such as the LSTM structure below .

2. LSTM

LSTM is a long short-term memory network, which is a special RNN network designed to solve the long-term dependency problem . The so-called long dependence is because the calculation of the connection between nodes with a long distance will involve multiple multiplications of the Jacobian matrix, resulting in the disappearance of the gradient. LSTM can also be called threshold RNN, which solves this problem by changing the coefficients at different times and controlling the network to forget the information that has been accumulated so far.

2.1 Differences between LSTM and RNN

  • RNN has a chain form of repeated neural network modules. In standard RNN, this repeated module has only a very simple structure, such as a tanh layer (of course, it can also be a sigmoid activation function), as follows:
    insert image description here
  • In the repeating module in LSTM, unlike RNN, there are four neural network layers that interact in a very specific way:
    insert image description here

2.2 Diagram of the core idea of ​​LSTM

LSTM removes or adds information to the cell state by carefully designing the "gate" structure. A gate is a method for selectively passing information, which consists of a sigmoid neural network layer and a pointwise multiplication operation. The schematic diagram is as follows:
insert image description here
LSTM has three gates, namely: forget layer gate, input layer gate and output layer gate , to protect and control the cell state.

2.2.1 Forget Landing Doors

  • Object of action: cell state.
  • ​Function: Selectively forget the information in the cell state.
  • Operation steps: the gate will read ht − 1 h_{t-1}ht1and xt x_txt, output a value between 0 and 1 for each cell state C t − 1 ​ C_{t-1}Ct1​Numbers in 1 means "keep completely", 0 means "discard completely". The schematic diagram is as follows:
    insert image description here

2.2.2 Input layer gate

  • Object of action: cell state
  • Function: selectively record new information into the cell state.
  • Operation steps:​
    • Step one, the sigmoid layer called the "input gate layer" decides what values ​​we are going to update.
    • Step 2, the tanh layer creates a new candidate value vector C ~ t \tilde{C}_tC~tadded to the state. Its schematic diagram is as follows:
      insert image description here
    • Step 3: Put ct − 1 c_{t-1}ct1update to ct c_{t}ct. Compare the old state with ft f_tftMultiply, throwing away the information we know we need to throw away. Then add it ∗ C ~ t i_t * \tilde{C}_titC~tGet new candidate values, varying by how much we decide to update each state. Its schematic diagram is as follows:
      insert image description here

2.2.3 Output layer gate

  • Object of action: hidden layer ht h_tht
  • Function: Determine what value to output.
  • Steps:
    • Step 1: Use the sigmoid layer to determine which part of the cell state will be output.
    • Step 2: Put the cell state ct c_{t}ctGoing through tanh and multiplying it with the output of the sigmoid gate, we end up only outputting the part that we are certain of outputting.

Its schematic diagram is as follows:
insert image description here

2.3 LSTM application scenarios

3. Transformer

Transformer is a model architecture for natural language processing (NLP) and other sequence-to-sequence (sequence-to-sequence) tasks, proposed by Google in 2017. It has achieved major breakthroughs in machine translation tasks and has been widely used in the field of NLP.

Traditional sequence models, such as recurrent neural networks (RNNs), suffer from vanishing gradients and low computational efficiency when processing long sequences. The Transformer uses an attention mechanism (Attention Mechanism) to establish a global context, which effectively solves these problems.

3.1 The core of Transformer

The self-attention mechanism is the core of Transformer. It allows the model to weight the context information of other words in the entire input sequence when generating a representation of each word . Therefore, the model can better capture the dependencies and long-distance dependencies between words. The computational efficiency of the attention mechanism is improved by using matrix operations and parallel computing .

In addition to the attention mechanism, Transformer also introduces technologies such as Residual Connections and Layer Normalization to help model convergence and training stability.

Transformer applications include machine translation, text generation, question answering systems, language models, and more. It not only surpasses traditional sequence models in performance, but also has the advantage of parallel computing and can efficiently process long sequences.

The success of Transformer has triggered the development of a series of Transformer-based models, such as BERT and GPT. These models have achieved major breakthroughs in various NLP tasks and have become one of the important milestones in the field of natural language processing.

3.2 Transformer main structure

Taking the machine translation task as an example, it in the following sentence refers to The animal, which can only be obtained by understanding the context information, and Transformer can help us do this very well. How to do it? We also need to explore its attention mechanism in the following area !insert image description here
insert image description here

3.2.1 Overall structure

The Transformer model consists of an encoder (Encoder) and a decoder (Decoder).

  • The encoder vectorizes each word embedding in the input sequence , and performs information extraction and feature representation through multiple self-attention layers (Self-Attention) and feed-forward neural network layers.
  • The decoder further uses a self-attention layer on the basis of the encoder to generate the target sequence.
    insert image description here
    In actual use, multiple encoders and decoders are used in series, as shown in the following figure:
    insert image description here

3.2.2 Encoder

In the Transformer model, the encoder mainly includes the following structures:

  • Positional Encoding: Since the Transformer model does not use a convolutional or cyclic structure, it cannot use positional information to capture the order relationship of elements in the sequence. To address this, positional encodings are added to the input sequence to provide information about the position of each element relative to other elements.

  • Self-Attention: The self-attention mechanism is a core component of the Transformer model. Each attention head in the encoder computes an attention weight for each element in the input sequence, which represents how relevant that element is to other elements. Through the self-attention mechanism, the encoder can learn the dependencies between elements in the sequence at different levels.

  • Multi-Head Attention (Multi-Head Attention): In order to capture information of different attention points, encoders usually use multiple attention heads. Each attention head independently learns different attention weights and generates a weighted sum of attention values. Multi-head attention can make the model better capture the associations between different features.

  • Feed-forward Neural Network: Each attention sublayer in the encoder is usually followed by a feed-forward neural network. The feed-forward neural network is a fully connected forward propagation network , which is used to nonlinearly transform and map the output of the attention sublayer . With a feed-forward neural network, the encoder can introduce more nonlinearity and expressiveness.

  • Residual Connections and Layer Normalization: In order to stabilize training and speed up information transfer , residual connections and layer normalization are introduced in the encoder. Residual connections allow information to skip directly between different layers, helping to avoid the problem of vanishing or exploding gradients . Layer normalization is used to normalize the attention sublayer and the feedforward neural network to improve the training stability of the model.

Through the combination and stacking of the above results, the encoder is able to perform multi-level feature extraction and representation learning on the input sequence , thereby providing more accurate and rich representations for downstream tasks (such as machine translation, text classification, etc.).

insert image description here

(1) Input part

Through word embedding and position encoding , the word information is converted into a vector, and at the same time, the model can easily learn relative position information.

insert image description here
Position encoding formula:
insert image description here
insert image description here

(2) Attention Mechanism Structure

As shown in the figure, it is a schematic diagram of the structure of a single attention mechanism and a multi-head attention mechanism: the
insert image description here
calculation process of a single attention mechanism can be described as a formula: Q and K perform point multiplication to calculate vector similarity; then use softmax to convert it into a probability distribution; Finally, the probability distribution and V are weighted and summed . The overall formula is as follows:
insert image description here
Through the following picture, you can understand it more intuitively:

insert image description here

(3) Derivation process of attention mechanism

Specifically pushed to the process : Let's take the translation of the two words "Thinking Machines" as an example

  • First, embedding the word and adding it to the position encoding vector to get the input X of the self-attention layer
  • Initialize three weight kernels WQW^QWQ W K W^K WK W V W^V WV , perform matrix multiplication on X respectively to obtain query vector Q, key vector K, and value vector V
  • The attention is calculated by the Ateention(Q, K, V) formula to obtain the attention vector of the word, which reflects the weighted result of the context word
  • Since the Transformer adopts a multi-head attention structure, it is necessary to perform a concat operation on the output vector of each attention mechanism, and then pass through a fully connected layer to obtain the final output.
    insert image description here

Note: Since Q, K, and V are calculated from the same input X, it is usually called a self-attention layer

insert image description here
The main reason why Transformer adopts the multi-head attention mechanism is to eliminate WQW^QWQ W K W^K WK W V W^V WThe influence of the initial matrix value of V ; there is also another kind of speech, similar to CNN, to enhance the expression space.
insert image description here

3.2.3 Decoder

In the Transformer model, the decoder (Decoder) is the part responsible for generating the target sequence from the output of the encoder. The decoder mainly contains the following structures:

  • Self-Attention layer: Each attention head of the decoder computes an attention weight for each position in the target sequence, which represents the relevance of this position to other positions. Through the self-attention mechanism, the decoder can learn the dependencies between different positions in the target sequence at different levels.

  • Encoder-Decoder Attention Layer: In the decoder, in order to obtain contextual information related to the encoder output, the encoder-decoder attention layer is introduced. This layer computes attention weights between the target sequence position and the encoder output to capture the relevance of the encoder output to the current target position.

  • Feed-forward Neural Network: Each attention sublayer in the decoder is usually followed by a feed-forward neural network. The feed-forward neural network is a fully connected forward propagation network, which is used to nonlinearly transform and map the output of the attention sublayer.

  • Residual Connections and Layer Normalization: Similar to the encoder, residual connections and layer normalization are also introduced in the decoder. Residual connections allow information to skip directly between different layers, helping to avoid the problem of vanishing or exploding gradients. Layer normalization is used to normalize the attention sublayer and the feed-forward neural network.

In the decoder, it is common to use multiple decoder layers stacked together, each with the same structure and parameters . Such stacking enables the decoder to gradually generate the target sequence and gradually acquire more contextual information and semantic representation.

Through the combination and stacking of the above structures, the decoder can decode and generate target sequences from the output of the encoder, such as translating source language sentences into target language sentences in machine translation tasks. The decoder is designed to allow the introduction of contextual information and global dependencies when generating sequences to improve the quality and consistency of generated sequences .

3.2.4 Output layer

The output of the decoder is a vector, which needs to be solved by translation in the end. The whole process: first, it will pass through the Linear layer (that is, the fully connected network, the number of output nodes can be regarded as the total number of words in the lexicon, such as 2w), and then go through softmax conversion Output the probability, and finally output the word corresponding to the index value with the highest probability, which is the translation result.

insert image description here

3.3 Common models based on Transformer

There are many Transformer-based models, the following are some common Transformer-based models:

  • BERT (Bidirectional Encoder Representations from Transformers): BERT is a Transformer-based pre-trained language model, which is pre-trained on large-scale unlabeled data, and then can be used for various downstream tasks, such as text classification, named entity recognition , Sentence similarity, etc.

  • GPT (Generative Pre-trained Transformer): GPT is a Transformer-based generative pre-training model, which learns contextual information by pre-training on large-scale text data, and then can be used to generate text, machine translation and other tasks .

  • Transformer-XL: Transformer-XL is an extended Transformer model for language modeling, which solves the problem in modeling long text sequences by introducing a recurrent mechanism, and is able to capture longer distance dependencies in long texts.

  • T5 (Text-to-Text Transfer Transformer): T5 is a multi-task learning Transformer model that can unify various natural language processing tasks and convert both input and output into a common text form, which can be adapted by fine-tuning different tasks.

  • XLNet: XLNet is an autoregressive language model based on Transformer. It solves the problem of order bias in autoregressive models by learning permutation invariance, and has achieved excellent performance on multiple downstream tasks.

These models are based on the improvement and extension of the Transformer architecture. By making full use of the Transformer's self-attention mechanism and multi-head attention mechanism, they can effectively learn the dependencies and semantic representations in the sequence, thus achieving remarkable results in natural language processing tasks. performance improvement.


Due to the limited level, it is inevitable that there will be some mistakes in the blog. If there are any mistakes, please feel free to enlighten me!

insert image description here

Guess you like

Origin blog.csdn.net/wjinjie/article/details/131643496