Task04: machine translation and related technology; attention mechanisms and Seq2seq model; Transformer

  1. If we want to be in English translation, for example, I am a Chinese translated 'i am Chinese'. This time we will find that there are five text input, and the output of only three English words. That is not equal to the length of the input the output of the length of this time we will introduce an encoder - decoder model is (encoder-decoder) first we pass encoder input 'I am Chinese' encode information, then the generated coded data input decoder. decoding the general encoder and decoder are used Recurrent neural networks.
    Here Insert Picture Description
  2. Of course, in order to make machine we know the end of the sentence will be increased one behind each sentence < e O s > <eos> Indicates the end of a sentence. So that the computer can be identified. We also usually add in the first stage of the decoder input in the training of < b O s > <bos> Represents the predicted start.
  3. Meanwhile, in order to maintain the same length of each sentence, the sentence we artificially predetermined length, if the length of the sentence has not reached, we will fill the sentence, so that its length reaches a predetermined length.
  4. As the input of the encoder, we generally use the C = q (h1, h2 ... ht) input as the first hidden layer, we can in general be used directly c = ht, that not all of the hidden layer information before.
  5. When the training we generally use compulsory education , which is not to predict the data y_hat1 as second encoder input, but directly with y1 tag data as input.
  6. When we use the greedy algorithm and then the y were softmax, we performed the best choice for y for each current output. The case could reach the global optimum
    as shown below - Auto references hands deep learning science
    Here Insert Picture Description
  7. This, of course this time we can solve this problem by brute-force method, through all candidate words, but the overhead is too large. This time we use the search beam (Beam Search) strategy is always to select the current time step conditional probability largest first k option. List all set as shown below

Here Insert Picture Description
Here due to inconsistencies different candidate sequences of length, so we add an L [alpha] or punishment of the length.

Attentional mechanisms

In fact encoder sentence different effects on each word decoder, but in seq2seq model, and not reflected, I am using C = q (h1, h2 ... ht) as the first hidden layer input.
We give example is the tom chase jerry. translated by Tom catch Jerry. Jerry translation for the word we can understand Jerry largest contribution to the translation of the word, and chase tom and Jerry contribution to the translation of these two relatively small. The right to enter in seq2seq model every translation is the same weight
so clever researchers, invented the attention mechanism, joined the weighting factor for each word to predict the right words in translation.
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here we introduce a query with the key-value.query = s t 1 s^{t-1} , Key = = value of the hidden layer encoder state. Is the time to predict when the next one output, we first key to generate a weighting factor to each query by weight, multiplied by the value obtained by weighting coefficients attention content rights.
Note that in general NLP jade bell our key value are equal.
The final time t from the attention content is currently on a hidden layer decoder state s t 1 s^{t-1} green state layer hidden costs layer prediction.

In transform the model we introduced self-attention is attention from layer

Here Insert Picture Description

Published 17 original articles · won praise 2 · Views 833

Guess you like

Origin blog.csdn.net/qq_43371456/article/details/104356748