5 minutes to understand the decoder in the transformer

insert image description here
The decoder and encoder in the transformer are very similar.
先假设decoder是一个独立的块,研究它的输入和输出。

1. Autoregressive (autoregressive)

insert image description here

If the decoder is regarded as a black box, this black box first accepts a special symbol begin, indicating that the prediction starts, and the first output character is "machine", then A is input into the decoder to output "device", and so on. End with another special symbol end at the end. Such learning that takes the output as the input is called AT (autoregressive).

2. AT vs NAT

insert image description here
Non-autoregressive, as the name suggests, the input is no longer the output of the previous moment. Then this will lead to some thinking.
1. How to determine the length of non-autoregressive?
You can use a prediction network to predict the length of the output, and then output the sentence. You can also output all the time, and then use end to truncate the sentence.
2. What are the advantages of non-autoregressive?
Parallelized, that is, calculations can be parallelized. As shown in the figure, the input is all begin, which can be input together and then output together. Its output length is also controllable. If you input several begin, you can output several begin. If I want to slow down the sound output, I can double the begin.
3. The effect of NAT is often not as good as that of AT. I think this point is very intuitive. Without the time information, the effect will of course be worse.

实际上decoder并不是一个独立的块,它是与encoder连起来的,下面研究它是怎么连接起来的。

3. Connection with encoder

insert image description here
It can be seen that there are two arrows from the encoder, and one arrow is from the input of the decoder itself. The details are as follows. The so-called two arrows + one arrow are actually the vector key, value key, value
insert image description here
after six encoder outputskey,v a l u e and masked self-attentionqi q_{i}qiPerform similarity calculation (dot-product) to get vi v_{i}vi. To put it simply, self-attention with a mask means that each input only looks at the previous information and not the subsequent information. 当然事实上不一定要六个encoder再进入decoder,这个地方是flexible的,只不过这篇论文是这样做的
get a lot of vvAfter v , it becomes a vector input to the next self-attention through the fully connected network.

To sum up, take voice-to-text as an example. The transformer inputs a speech signal, then converts it into a vector representation, encodes these vectors through the encoder (feature extraction), and then inputs it into the decoder several times to output a text each time, and the final sentence is the result.

Guess you like

Origin blog.csdn.net/xiufan1/article/details/122571920