Attention Mechanism, Decoder and Encoder Improvements

Attention structure

Seq2seq is a very powerful framework with a wide range of applications. Here we will introduce the attention mechanism that further strengthens seq2seq. Based on the attention mechanism, seq2seq can focus on the necessary information just like us humans.

The problem with seq2seq,

In Seq2seq, the encoder is used to encode the time series data, and then the encoded information is passed to the decoder. At this time, the output of the decoder is a fixed-length vector. In fact, there is a big problem with the fixed-length vector, because the fixed-length A vector of means that, regardless of the length of the output sentence (no matter how long), it will be converted to a vector of the same length,

No matter how long the text is, the current encoder will convert it into a fixed-length vector, just like stuffing a large suit into the closet, the encoder will force the information into a fixed-length vector, but, like this Doing it sooner or later will hit a bottleneck, it will be like eventually the suit will fall out of the closet, useful and the information will overflow from the vector, now
let's improve seq2seq, first improve the encoder, then improve the decoder
insert image description here

Encoder improvements

So far, we have only passed the last hidden state of the lstm layer to the decoder, but the length of the output of the encoder should change accordingly according to the length of the input text, which is a place where the encoder can be improved, specifically , using the hidden state of the lstm layer at each moment
insert image description here

Using the hidden state of the lstm layer at each moment is denoted as hs here, you can obtain the same number of vectors as the input words, and input 5 words, at this time the encoder outputs 5 vectors, so that the encoder asks for a fixed length The vector mechanism, in many deep learning frameworks, when initializing the RNN layer, you can choose whether to return the hidden state vector at all times or the hidden state vector at the last time, what information is the hidden state of the lstm layer at each time
, One thing is certain is that the hidden state at each moment contains a lot of information about the input word at the current moment. For example, the output of the lstm layer when the cat is input is most affected by the input word cat at this time, so this hidden state vector can be considered Contains many cat components. According to this understanding, the hs matrix output by the encoder can be regarded as a set of vectors corresponding to each word
insert image description here

The encoder is processed from left to right. The current word vector contains the information of several words before the current moment. Considering the overall balance, bidirectional RNN or bidirectional LSTM that processes time series data from two directions is more effective.
Summary : The improvement to the encoder is only to take out all the hidden states of the encoder at all times. With this small change, the decoder can encode information proportionally according to the length of the input sentence

Improvements to the decoder

The encoder outputs the hidden state vector hs of the lstm layer corresponding to each word as a whole, and then this hs is passed to the decoder for time series conversion,
insert image description here

Only the encoder's last hidden state vector is passed to the decoder, specifically the last hidden state of the encoder's lstm layer into the initial hidden state of the decoder's lstm layer,
insert image description here

We improve the decoder so that it can use all hs
In the history of machine translation, how to deal with which words in the input and output are related to which words, many studies have used the knowledge of word correspondences such as cat = cat, such representation words The information of the corresponding relationship is called alignment. The earliest alignment was mostly done manually. Next, the attention technology will successfully marry you into seq2seq. This is also the evolution from manual operation to mechanical automation. Our goal
is Find out the translation source word information that has a corresponding relationship with the translation target word, and then use this information for translation. That is to say, our goal is to only focus on the necessary information and perform time-series conversion based on this information. This mechanism is called attention
insert image description here

We add a new layer for some kind of calculation. This kind of calculation receives the hidden state of the lstm layer of the decoder and the hs of the encoder at each moment, and then selects the necessary information from it and outputs it to the affine fully connected layer. The same as before, The final hidden state vector of the encoder is passed to the initial lstm layer of the decoder

The work done by the network in the above figure is to extract word alignment information. Specifically, it is to select the word vector corresponding to the output word of the decoder at each time from hs. For example, when the decoder outputs I, select wu from hs The corresponding vector, that is to say, we hope to realize this selection operation through some calculation, but there is a problem here, that is, the operation of selection (selecting several things from multiple things) is a neural network that cannot perform differential
calculations Learning is generally through the error backpropagation algorithm. Therefore, if the structural network can be differentially operated, it can be learned in the framework of error backpropagation, and if it does not use differentiable operations, there is basically no way to use error backpropagation. In the direction propagation method,
can the selection operation be replaced by a differentiable operation? In fact, the idea to solve this problem is very simple, just like the Columbus egg. It is difficult to think of it at first. To break the routine, the idea is , instead of single choice, it is better to choose multiple choices, we wake up and calculate the weight of the importance of each word
insert image description here

The weights representing the importance of each word (denoted as a) are used in the line. At this time, the probability distribution of a is the same as the probability distribution. Each element is a scalar from 0.0 to 1.0, and the sum is 1. Then, calculate the weight representing the importance of each word The weighted sum of hs, the target vector can be obtained,
insert image description here

Calculate the weighted sum of the word vectors. Here the result is called the context vector, denoted by the symbol c. By the way, if we observe carefully, we can find that the corresponding weight of me is 0.8, which means that the context vector c contains a lot of words. The composition of the vector, it can be said that this weighted sum basically replaces the operation of selecting the vector. The weight corresponding to the color and I is 1, and the weight corresponding to other words is 0, which is equivalent to selecting the I vector. The context vector c contains The information required for the transformation at the current moment, more precisely, the model needs to learn this ability from the data,

Improvements to the decoder

The weight a representing the importance of each word can be used to obtain the context vector through the weighted sum. So how to find this a? Of course, we don’t need to specify it manually. We only need to make preparations for the model to learn it automatically from the data
. The method of solving the weight a of a word, firstly, the flow chart from the processing of the encoder to the processing of the first lstm layer of the decoder outputting the hidden state vector

Use h to represent the hidden state vector of the lstm layer in the decoder. At this time, our goal is to use numerical values ​​to indicate how similar this h is to the individual word vectors of hs. There are several ways to do this. We use the simplest There are several ways to calculate the similarity of vectors. In addition to the inner product, there is also a method of using a small neural network to output the score.
insert image description here

Here, the similarity between each line of hs and h is calculated
by inner product Here, the similarity between h and each word vector of hs is calculated by vector inner product, and the result is expressed as s, but this s is the value before regularization , also known as the score, and then use the old softmax function to regularize s. The
calculation graph here consists of repeat nodes, X nodes representing the product of corresponding elements, sum nodes and softmax layers.
insert image description here

Decoder Improvement 3

insert image description here

The calculation graph for calculating the context vector
shows the overall picture of the calculation graph used to obtain the context vector c. We have divided it into the weight sum layer and the attention weight layer for implementation. To reiterate, the calculation performed here is the attention weight layer only pays attention to Each word vector hs output by the encoder is combined with the weight a of each word, and then the weight sum layer calculates the weighted sum of a and hs, and outputs the context vector c. We call a series of calculated layers the attention layer and insert them
here image description

The overall structure is as follows
insert image description here

As shown in the figure above, the output hs of the decoder is input to the attention layer at each moment. In addition, the hidden state vector of the lstm layer is input to the affine layer, and then the decoder is improved. We add the attention information to the decoding in the previous chapter. device
insert image description here

Implementation of seq2seq with attention

Bidirectional LSTM

insert image description here

Two-way lstm adds an lstm layer processed in the opposite direction to the previous lstm layer, and then stitches the hidden states of the two lstm layers at each moment, and finally uses it as the final hidden state vector

Through such bidirectional processing, the hidden state vectors corresponding to each word can aggregate information from two directions, so that these vectors encode more balanced information. The implementation of bidirectional lstm is very simple. One way of implementation is to prepare two
lstm Layers, and adjust the arrangement of words that output each layer. Specifically, the input sentence of one layer is the same as before, and the input sentence of the other lstm layer is input from right to left, and then the output of the two lstm layers is spliced , you can create a bidirectional lstm layer

The problem that needs to be solved by the deepening of Seq2seq
is more complicated. In this case, we hope that the seq2seq to be attenton has stronger expressive power. At this time, we consider deepening the RNN layer and the lstm layer. By deepening the layer, we can create a performance more powerful models,
insert image description here

As shown in the figure, seq2seq with 3 layers of lstm layer and attention is used. The lstm layer with the same number of layers is usually used in the decoder and the encoder. In addition, there are many variants of the method of using the attention layer. Here, the hidden state of the lstm layer of the decoder is input. Go to the attention layer, and then pass the output of the context vector attention layer to the multiple layers of the decoder (lstm layer and affine layer).
In addition, the deepening layer is used in another amount. The trick is residual linking, which is a cross-layer Simple Tips for Linking
insert image description here

The so-called residual link is to condense across layers. At the link of the residual link, two outputs are added at the two residual links. Please note that this addition, to be precise, is the addition of the corresponding elements. This Addition is very important, because addition will propagate the gradient as it is when backpropagating, so the gradient in the residual link can be propagated to the previous layer without any influence, so that even if the layer is deepened, the gradient can be normal Propagation, without gradient disappearance or gradient explosion, learning can proceed smoothly

In the time direction, the backpropagation of the RNN layer will have the problem of gradient disappearance or gradient explosion. The gradient disappearance can be dealt with by lstm, gru, etc., and the gradient explosion can be dealt with by gradient clipping. For the gradient disappearance in the depth direction, it is introduced here Residual connections work well

transformer

So far, we have used RNN, from language model to text production, from seq2seq to seq2seq with Attention and its components, using RNN can handle time series data of variable length sequences well, but RNN also has shortcomings, such as parallel processing The problem is that
RNN needs to be calculated step by step based on the calculation results of the previous moment, so it is basically impossible to calculate RNN in parallel in the time direction. When using GPU parallel computing for deep learning, this will become a big bottleneck, so we There is an incentive to avoid RNNs
insert image description here

Transformer is the method proposed in the paper Attention is all you need. As shown in the title of the paper, Transformer does not use RNN, but uses attention for processing. Here we briefly look at this transformer
insert image description here

The picture on the left is conventional attention, and the picture on the right is self attention.
Before that, we used attention to solve the correspondence between the translation of these two kinds of time series data. In the case of the left picture, the two inputs of the attention layer input different time series data. , in contrast, the two inputs of the self-attention on the right are the same time-series data. Like this, the corresponding relationship between the elements in a time-series data can be obtained.

Transformer's layer structure
Transformer uses attention instead of RNN. In fact, both the encoder and decoder use self attention. The feed-forward layer in the above figure represents a feed-forward neural network, and the activation function is a fully connected neural network of Relu. In addition, Nx in the figure indicates that the elements surrounded by the background of the gray part are stacked N times,

The above figure simplifies the Transformer. In fact, in addition to processing this architecture, techniques such as residual connection and layer normalization will also be used. Other common techniques include using multiple attentions in parallel, encoding the location information of time-series data, etc.

Using transformer can control the amount of calculation and make full use of the benefits of gpu parallel computing.

Summary
In translation, speech recognition and other tasks that convert one time series data into another time series data, there is often a correspondence between time series data. Attention
learns the correspondence between two time series data from the data.
Attention uses vector inner product calculations The similarity between vectors, and output the weighted vector of this similarity.
Because the operation used in attention is differentiable, it can be learned based on the error back propagation method. By
visualizing the weight of the attention calculation, you can observe the input and The correspondence between the outputs,

Guess you like

Origin blog.csdn.net/dream_home8407/article/details/131312979