5 minutes to understand the encoder in the transformer

insert image description here

This article only deals with the structure of the network, not the training of the network.
The transformer consists of 6 encoders and 6 decoders.

1. self-attention

insert image description here
Skip the single-head self-attention directly, multi-head means q, k, vq,k,vq,k,There is more than one v , as shown in the figure, it is self-attention at both ends.
So why not single-head attention, maybe because everyqqq andkkThe correlations of k are different, so the calculation combining multiple correlations will be more robust.
insert image description here
Position code: To be clear,the aaa is already encodedaaa , so in fact, if you input one by one by time, this location information is not necessary, but due to the particularity of self-attention, you must add a location information. Its particularity is that self-attention is treated equally. When calculating the correlation, the first word and the last word will not have a small weight because they are far away. Such an operation actually discards the original position. information, so add it here.

2. Encoder

Encoders, as the name suggests, encode (transform) your input into vectors that a machine can easily learn. The author of transformer here believes that if a voice signal is input, the voice signal becomes easy to be learned by machine after passing through 6 encoders.
insert image description here
1. The input passes through an input embedding layer, which converts the sound signal or text into a vector form.
2. Add position code, these signals do not have position information, add position information here.
3. Through the multi-head attention layer
4. Add a residual to the output of the previous step and then perform a layer normalization (for all the features of a sample)
5. Through the MLP layer
6. Add the residual to the output of the previous step to perform layer normalization

The above six steps are performed six times to obtain an output, which is also a set of vectors or a sequence, but the encoded sequence extracts certain features, which may make it better to put it into the decoder later.

Guess you like

Origin blog.csdn.net/xiufan1/article/details/122552132