Detailed illustration of the input and output model BERT

A, BERT overall structure

      BERT is mainly used in the Transformer Encoder, but did not use it Decoder, I think, because BERT is a pre-trained model, as long as the semantic relationship can be learned, do not need to decode the completion of specific tasks. Overall structure as shown below:

More Transformer Encoder layer by layer stacking up, it became BERT assembled, and in the paper, the authors were using 12 layers and 24 layers Transformer Encoder assembled two sets of BERT model, the total number of sets of model parameters were and 110M 340M.

Second, understand Transformer again in the Attention mechanism.

Chinese Attention mechanism called "attentional mechanisms", as the name suggests, its main role is to allow the neural network to "focus" on the part of the input, namely: to distinguish between the effects of different parts of the input to the output. Here, we express this from the perspective of semantic enhancements word / words to understand how the Attention mechanism.

We know that the meaning of the expression of a word / text in a word usually associated with its context. For example: look at "Hu" word, we may feel very strange (and even the pronunciation is nothing to remember it), but after seeing its context, "swan", it immediately familiar to them. Therefore, the context information word / words help to enhance its semantic representation. Meanwhile, in the context of different words / terms are often different for enhancing the role of the semantic representation. For example, in the above example, "hung" on the word Comprehension "Hu" maximum word, and the role of "the" word is relatively small. In order to enhance the use of text messages up and down the semantic distinction between objective words have expressed, you can use the Attention mechanism.

Attention mechanism mainly related to three concepts: Query, Key and Value. In the above enhanced word semantic representation of this scenario, the target word and context word has its own original the Value, the Attention mechanism as Query target word , the context of each word as a Key , Key and the respective similar Query as of weight to each word in the context of the original Value Value into the target word. As shown below, the Attention mechanism semantic vector and the target word of each word context representation as an input, the target word is first obtained Query vector by linear transformation, he said respective word vectors represent context Key Value and the original target words with respective word context represents, then calculates the similarity as a weight Query vector and the respective Key vector weight ( ultimately forming weights for each target word and its context word weight relationship, weight and 1 ), weighted fusion Value vector target word and the respective upper and lower characters Value vector (in fact, I have done a dot) , as output Attention, namely: enhanced semantic vector representation of the target word.

Guess you like

Origin www.cnblogs.com/gczr/p/11785930.html