A preliminary study on transformers

Most QA problems can be implemented using seq2seq. In other words, most NLP problems can be solved using the seq2seq model.

But the best way is to make specific model training for specific problems.

overview

Transformer is a seq2seq model.

Let's first look at the general framework of the seq2seq model (in fact, it is an encoder and a decoder):

image-20211110011108154

Encoder

Let's look at the encoder (and combined with the transformer) part first

image-20211110011238315

Let's refine the encode to see what it looks like inside. The picture below is the inside of the encode:

image-20211110011445224

It can be seen that encode is composed of several Block modules, and this module is composed of transformers.

**Note:** The implementation of self-attention of the above-mentioned Block is not so simple in the original model, but more complicated. The result of the implementation is shown in the following figure:

image-20211110012024616

Then the encode part of the transformer can be redefined as follows (the add and norm parts are explained, which are mentioned in the above figure):

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-g0DVomf0-1686470176975)(null)]

Progress 1 Learn about some small changes on this model

image-20211110012251512

decoder

The decoder mainly has two forms, one is Autoregressive.

Autoregressive

Put an animation first:

https://vdn3.vzuu.com/SD/cf255d34-ec82-11ea-acfd-5ab503a75443.mp4?disable_local_cache=1&auth_key=1636876035-0-0-1b8200d56431047742a6772de99b7384&f=mp4&bu=pico&expiration=1636876035&v=tx

Reference: https://www.zhihu.com/question/337886108/answer/893002189

a brief introdction

Here we directly see what the result looks like or how to output the result directly from the output of encode. Here we take translation as an example:

image-20211110215515474

BeginIt is used to judge the beginning of the input, which can facilitate positioning.

Next, let's see what the output is:

image-20211111172332837

According to different languages, the output result is a word point set vector (if it is Chinese, we can output 2000 common words; if it is English, the output result can be either 26 English letters or common words; So it depends on the situation).


Then let's take a look at how the output of the next step is obtained - the previous output is used as the input of the next step (that is, the current input is the output of the previous step):

image-20211111172612970

Speaking of which, there is a problem with this processing method: if the translation result of the previous step is wrong, isn't it that every subsequent step is wrong, that is, one step is wrong? So how to solve this situation?

Let's ignore this issue first, and talk about this issue later.


internal structure

Then let's see what the inside of this Decoder looks like:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-xhyXKXRL-1686470177458)(null)]

Comparison with Encode

image-20211111175849366

According to the above figure, it can be seen that the difference between Decoder and Encoder is that there is an additional shadow occlusion place, and one is added to the first multi-head attention place Masked. So what is this?

Masked

The following picture is the processing of the Self-attention mechanism we discussed earlier:

image-20211111180212444

Here is the so-called Masked Self-attentionprocessing method:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-fIp6fTKf-1686470176987)(null)]

As can be seen from the figure, the difference from Self-attention is that Masked Self-attention only considers the previous data, instead of considering all the inputs together to output like Self-attention.

To be more specific, let's take a look at how it is calculated:

This is the original, we need to remove the boxed place below:

image-20211111181027388

It now looks like this:

image-20211111181103063


So the question is, why do you do this? Why do a Masked?

This problem can be considered and analyzed from the implementation process of Decode.

Because in the output process of Decode, it is considered by relying on the previous output one by one as the input of the next step, that is to say, the current output only needs to consider the previous output, and there is no need to consider all of them. Entered.

how to stop

Now we think about the question - how does the output stop?

Because Self-attention is based on the previous output as input, that is to say, if the output does not stop, the input will not stop, then it will always output in the end:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-8OCsVnFj-1686470176965)(null)]

The solution is to add an end field in the output dictionary——"Stop Token"

image-20211111181910958

Through this Token, the output problem can be solved very well, so that it can be judged with a certain probability whether the output can be stopped.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-efOOqnv0-1686470176941)(null)]

Non-Autoregressive

First go directly to the picture above to see the difference between Autoregressive and Non-Autoregressive:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-gx2m5eqV-1686470176954)(null)]

As can be seen from the figure, Autoregressive is output one after another, but the output of Non-Autoregressive is based on all the inputs output together to get the best results.

So how does NAT determine the length of the output? There are two methods:

  • Input the input content of the Decoder into a predictor, and use this predictor to judge the final output length;
  • Send all the input to the Decoder, and then according to whether there is a "Stop token" in the output result, if there is one, all subsequent outputs will be discarded.

So what are the advantages of NAT?

  • Can be calculated in parallel;
  • You can control the length of the output;

In addition, NAT is now a popular research direction, so I won't elaborate here.

Encoder and Decoder Connections

At this point, we can consider how to connect, first look at the connection between the decoder and the encoder from a macro perspective:

image-20211111191132605

As can be seen from the figure above, the output of the encoder is used as the income of the decoder in the second stage.

Next, let's look at how Cross attention works:

The figure below is the initial processing, first receive a Begin Token to output the first word

image-20211111192539255

Next is the second output:

image-20211111193121693

Insert a sentence here, Cross Attention actually existed long before Self-attention, and it has been actually applied in prediction.

extended question

As mentioned earlier, the output of the Decoder is based on the previous output as the input, so there will be a problem here-one step is wrong, step by step is wrong: image-20211111235516449

The processing method is the method used in the above figure-not only giving the model the correct input, but also giving the wrong input, so that the model can learn better.

The following is the paper basis for this method (Scheduled Samping):

image-20211111235906211

additional materials

  • Here I have to mention the most classic and the best blog I have seen so far. The title is "The Illustrated Transformer" . This may not be clear to everyone, but if I put the following pictures, you may know that he is in many places:

​ I personally think that I understand this blog, and I almost understand the transformer; here I also have a [PPT in Chinese](https://pan.baidu.com/s/1LovEFd4Fswwk0jr8wIKkvA?pwd=ajh9 Extraction code: ajh9 ) is a good illustration of the blog's techniques.

  • There is also an introduction to HuggingFace - how Transformers work. The content inside is also good, and it is very concise, and the content is in Chinese.

Guess you like

Origin blog.csdn.net/c___c18/article/details/131154180