Transformer model architecture analysis

Reference:
https://www.bilibili.com/video/BV1UL411g7aX/?spm_id_from=333.880.my_history.page.click&vd_source=de203b26ba8599fca1d56a5ac83a051c
1. What is Transformer
insert image description here
    Transformer is different from RNN and CNN. The entire network structure is completely composed of Attention mechanism and front Feed neural network composition.
    The Transformer of the above graph can be said to be a Seq2seq model using "self attention" (original paper).

2. Detailed analysis of Transformer
(1)
Input of Encoder part : data input

Input Embedding : Form the input words into word vectors.
Such as: I (-1.2188, 1.1676, -1.0574, -0.1188)
    is (-0.9078, 0.3452, -0.5713, -0.2351)
     good (1.0076, -0.7529, -0.225, -0.4327)
    (4 dimensions are just for easy understanding, the actual High upper dimension)

Positional Encoding :
    Positional Encoding is a vector with the same dimension as the input, and its function is to use a small value to distinguish each word line vector. (In the traditional RNN model, the word vector of a sentence is understood word by word, while in the Transformer model, all the words in a sentence are input together). Specifically, the cosine and sine functions are used here to distinguish words with even and odd digits. After adding Input Embedding and Positional Encoding, a vector of the same dimension is output.

insert image description here
Multi-Head Attention :
    This is an attention mechanism. Multi-Head refers to multiple self-attention heads. There are two here, so you need to split the input word vector into two parts, specifically:
    I (-1.2188, 1.1676, -1.0574, -0.1188)
    is (-0.9078, 0.3452, -0.5713, -0.2351)
     good (1.0076 ,-0.7529,-0.225,-0.4327)
can be split (-1.2188,1.1676,),(-0.9078,0.3452),(1.0076,-0.7529)
and (-1.0574,-0.1188),(-0.5713,-0.2351), (-0.225,-0.4327)
is to cut off the data from one level.
    After splitting, copy the data into three copies, and pass them in as "value", "key", and "query" at the same time.

insert image description here

insert image description here

    After the "value", "key", and "query" three kinds of data are passed in, the first step is to enter a linear layer respectively, and then pass through a Scaled Dot-Product Attention module. The specific structure of this module here is:
insert image description here
    The sizes of "value", "key", and "query" are 3x2, 3x2, and 3x2 respectively. (Follow this example which I'm fine with). MatMul first performs matrix multiplication of query and key matrices to obtain a 3x3 matrix. After the Scale module, its effect is dmodel of all matrix elements/root signs. dmodel is the length of the word vector, which is 4 in this example. After softMax, perform matrix multiplication with value. to get a 3x2 matrix.
    Reconcat the matrix calculated by the two heads into a 3X4 matrix, and then pass through a linear layer to output.
    The Add&Norm module first adds the vector after Attention to the initial vector without Attention, then passes through a layer norm module, and finally outputs a 4x3 data (that is, the size of the matrix remains unchanged, and there are also trainable parameters in this module) .
insert image description here
    The next module, Feed Forward, is a feedforward network, which is a combination of many linear layers and activation layers, and then an Add&Norm module after feedforward.
insert image description here
    So the final output data is a 4x3 matrix.

(2) Decoder part
    First of all, we need to know that in this model, a sample is composed of a set of data. For example, this example looks like {I am fine, I am fine}. It stands to reason that when outputting, start from the initial Token and input word by word, but we can also set it as a matrix and input it together. For details, please refer to the description of the decoder in https://blog.csdn.net/anshiquanshu/article/details/112384896.
    Enter the starting Token -----> predict the first word (cross entropy loss with the first word) ----->Token and the first word -----> predict the second word (Do cross-entropy loss with the second word) ----->...
insert image description here
    This is similar to the first part of the Encoder, but here is Masked Attention. The mask here is a sequence mask, which is only used in the Decoder. It is to prevent the decoder from seeing the subsequent data, because decoding can only rely on the output before t and t, not after t.
insert image description here
    output, so we need to hide the following information. The specific method is as shown in the figure below.
insert image description here
    Here, the data in the decoder is used as the query, and the output in the encoder is used as the value and key. After Multi-Head Attention, Add&Norm, feed-forward network, and finally a linear layer and softmax layer to output prediction results.
insert image description here

Guess you like

Origin blog.csdn.net/daweq/article/details/129677448