Transformer details understanding

  1. What is d_model
    d_model is the word vector dimension of the one_hot vector after word embedding.

  2. The difference between batch normalization and layer normalization
    is that batch normalization normalizes multiple input samples in the batch dimension; layer normalization normalizes all the features of the sample itself in the dimension of the input vector.

  3. Why padding?
    Because the length of the input sequence of each batch is different, the sub-network cannot handle it, so a fixed length max_seq_len is used, and then padding is performed on vectors smaller than this length, filling with 0.
    The specific method is to add a very large negative number (negative infinity) to the value of these positions, so that after softmax, the probability of these positions will be close to 0!

  4. The role of the Sequence mask
    is because the transformer needs the attention of the encoder on the decoder side, but it cannot see future messages, or in other words, the decoding output should only depend on the output before time t. Sequence mask is used to hide the subsequent input information.
    The specific method is to generate an upper triangular matrix, the values ​​of the upper triangle are all 1, the values ​​of the lower triangle are all 0, and the diagonal is 1, because the output at the current moment is also related to the current moment.

  5. The role of Positional Embedding
    is to continue to encode the position of the input sequence, so that the transformer can extract the information of the sequence order. The specific method needs to be studied, and the one mentioned in the paper is not optimal. The obtained Positional Embedding vector is directly added to Word embedding.

  6. Self attention and encoder-decoder attention
    are understood literally, self attention is your own query and your own key-value, encoder-decoder attention is your own query and other people's key-value, so in transformer, it refers to the encoder-decoder The attention between.

  7. scaled dot-product attention?
    There are three points here,

  • The first is attention, then you need Q, K, V matrices;
  • Then scaled, which is to QKT QK^TQKThe scores of T are scaled and divided bydk \sqrt{d_k}dk , indicating the dimension of K, the default is 64 in the paper, but if the multi-head mechanism is not considered, it is 64*8=512, and the reason for adding this scaling factor is: for dk d_kdkWhen it is very large, the dimension of the result obtained by the dot product is very large, so that the result is in the area where the gradient of the softmax function is small;
  • The last is dot-product, indicating that it is multiplicative attention (there are many kinds of attention, the common ones are additive and multiplicative), instead of Q, KQ, KQ and K are spliced.

The general description of the formula is:
Attention ⁡ ( Q , K , V ) = softmax ⁡ ( QKT dk ) V \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{QK ^{T}}{\sqrt{d}_{k}}\right) VAttention(Q,K,V)=softmax(d kQKT)V

  1. Q, K, VQ, K, V without multi-head mechanismDimensions of Q , K , V ?
    First of all, we use batch norm, then the first dimension must be the size of the batchBBB

  2. The role of residual connection
    is to add an item x, then when the layer network calculates the partial derivative of x, there is an additional constant item 1. Therefore, in the process of backpropagation, the gradient multiplication will not cause the gradient to disappear! .

Guess you like

Origin blog.csdn.net/weixin_43335465/article/details/121255485