Detailed Encoder-Decoder Model Architecture

overview

  • Encoder-Decoder is not a specific model, but a general framework.
  • The Encoder and Decoder parts can be any text, voice, image, video data
  • The model can be CNN, RNN, LSTM, GRU, Attention, etc.
  • The so-called encoding is to transform the input sequence into a fixed-length vector , and decoding means to convert the previously generated fixed vector into an output sequence.

important point:

  1. Regardless of the length of the input sequence and the output sequence, the length of the intermediate "vector c" is fixed. This is the shortcoming of the Encoder-decoder framework
  2. Different tasks can choose different encoders and decoders (RNN, CNN, LSTM, GRU).
  3. Encoder-Decoder is an End-to-End learning algorithm, powered by machine translation, which can translate French into English. Such a model can also be called Seq2Seq .

Seq2Seq( Sequence-to-sequence )

  • The length of the input sequence and output sequence is variable
  • Seq2Seq emphasizes the purpose, does not specifically refer to specific methods, and satisfies the purpose of input sequence and output sequence, which can be collectively referred to as Seq2Seq model.
  • The specific methods used by Seq2Seq basically belong to the category of the Encoder-Decoder model.

For example
insert image description here

  • In the training data set, we can append the special character "" (end of sequence) to each sentence to indicate the end of the sequence
  • The special character "" (begin of sequence) is used before each sentence to indicate the beginning of the sequence
  • The hidden state of the Encoder at the final time step is used as the input sentence representation and encoding information. ? ?
  • The Decoder uses the encoded information of the input sentence and the output of the previous time step and the hidden state as input at each time step. ? ?

Case: The process of translating English it is a cat. into Chinese.

  1. First, the entire source sentence is symbolized, and a fixed special mark is used as the start symbol and end symbol of the translation. At this point the sentence becomes it is a cat .
  2. Model the sequence to get the translated word with the highest probability, for example, the first word is "this". Add the generated words to the translation sequence, repeat the above steps, and iterate continuously.
  3. Until the termination symbol is selected by the model, the iterative process is stopped, and the desymbolization process is performed to obtain the translation.

Defects of Encoder-Decoder

The length of the "vector c" in the middle is fixed

  • The Encoder-Decoder model of the RNN structure has the problem of long-range gradient disappearance
  • For longer sentences, it is difficult for us to hope to convert the input sequence into a fixed-length vector and save all effective information
  • Even if LSTM adds a gating mechanism to selectively forget and remember, as the difficulty of the sentences to be translated increases, the effect of this structure is still not ideal

The introduction of the Attention mechanism

  • Attention is to solve the problem of information loss caused by too long information
  • In the Attention model, when we translate the current word, we will find the corresponding words in the source sentence, and then translate the next word in combination with the previously translated sequence .

insert image description here

How does Attention focus on exactly where it is concerned?

  • The degree of attention is calculated for the output of the RNN. By calculating the weight of the vector at the final moment and any i-time vector, the attention bias score is calculated through softmax. If special attention is paid to a certain sequence, the calculated bias score will be relatively large.
  • Calculate the hidden vector at each moment in the Encoder??
  • Weight the attention scores for the final output at each moment, and calculate how much attention should be given to the i vector at each moment
  • The decoder will input the attention weight of the ③ part into the Decoder at each moment. At this time, the inputs in the Decoder include: the attention-weighted hidden layer vector, the output vector of the Encoder, and the hidden vector of the previous moment of the Decoder
  • Decoder Through continuous iteration, Decoder can output the final translated sequence.

Under the Encoder-Decoder framework of Attention, the general process of completing the machine translation task is as follows:

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Encoder-Decoder in Transformer

  • The Attention in Transformer is Self-Attention (self-attention mechanism), and it is Multi-Head Attention (multi-head attention mechanism)

insert image description here

Attention mechanism

  • Source is composed of a series. At this time, given an element Query in Target, by calculating the similarity between Query and each Key, the weight coefficient of each Key to Value is obtained, and then the Value is weighted and summed to obtain the final Attention value.
    insert image description here

self-Attention

  • It refers not to the Attention mechanism between Target and Source, but the Attention mechanism that occurs between Source internal elements or between Target internal elements. It can also be understood as the Attention mechanism in the special case of Target = Source.
  • Source = Target, that is, Key = Value = Query
  • Self-Attention can capture some syntactic features (Figure 1: a phrase structure with a certain distance) or semantic features (Figure 2: the object Law it refers to) between words in the same sentence.
    insert image description here
  • After the introduction of Self-Attention, it will be easier to capture the long-distance interdependence features in the sentence, because Self-Attention directly connects any two words in the sentence during the calculation process
  • Self-Attention increases the parallelism of calculations due to the fact that it does not rely on the time series.

Multi-Head Attention

  • Divide the model into multiple heads to form multiple subspaces, allowing the model to focus on information in different directions
  • The specific layers of Transformer or Bert have unique functions, the bottom layer is more inclined to focus on syntax, and the top layer is more inclined to focus on semantics
  • In the same layer, there are always one or two headers that are unique and different from the Tokens that other headers focus on
  • The following are the different attentions shown when two Self-Attentions execute the same sentence. Using the multi-head mechanism, it is obvious to learn to take different attentions under different tasks.

insert image description here

The Encoder in Transformer consists of 6 identical layers, each layer contains 2 parts:

  • Multi-Head Self-Attention
  • Position-Wise Feed-Forward Network (fully connected layer)

Decoder is also composed of 6 identical layers, each layer contains 3 parts:

  • Multi-Head Self-Attention
  • Multi-Head Context-Attention
  • Position-Wise Feed-Forward Network

Each of the above parts has a residual connection (redidual connection), and then a Layer Normalization.
insert image description here

Encoder-decoder limitations:
The only link between encoding and decoding is a fixed-length semantic vector C. The encoder compresses the information of the entire sequence into a fixed-length vector.

  • Semantic vectors cannot fully represent the information of the entire sequence
  • The information carried by the content entered first will be diluted by the information entered later

Attention model :
When the model generates output, it will also generate a "attention range" to indicate which parts of the input sequence to focus on when outputting next, and then generate the next output according to the area of ​​attention, and so on.

insert image description here

The biggest difference of the attention model is that it no longer requires the encoder to encode all input information into a fixed-length vector.

  • The encoder needs to encode the input into a sequence of vectors, and when decoding, each step selectively selects a subset from the vector sequence for further processing.
  • In this way, when each output is generated, the information carried by the input sequence can be fully utilized. ,

Representation of the codec :
insert image description here

A few notes

  • The middle "vector c" has a fixed length regardless of the length of the input and output (this is its flaw).
  • Different encoders and decoders can be selected according to different tasks (for example, CNN, RNN, LSTM, GRU, etc.)
  • A distinctive feature of Encoder-Decoder is that it is an end-to-end learning algorithm.
  • As long as the model conforms to this framework structure, it can be collectively referred to as the Encoder-Decoder model.

The relationship between Seq2Seq and Encoder-Decoder
insert image description hereEncoder-Decoder emphasizes model design (a process of encoding-decoding), and Seq2Seq emphasizes task types (sequence-to-sequence problems).


The simplest decoding mode of the four modes of Encoder-Decoder :
insert image description here

Decode Mode with Output Feedback

insert image description here

Decode Mode with Encoded Vectors

insert image description here

Attentional Decoding Mode

insert image description here
Reference link:

  1. https://zhuanlan.zhihu.com/p/109585084
  2. https://blog.csdn.net/u014595019/article/details/52826423
  3. https://blog.csdn.net/u010626937/article/details/104819570

Guess you like

Origin blog.csdn.net/NGUever15/article/details/123198978