Natural language processing-BERT

Transformer

self-attention + Feed Forward Neural Network

  1. RNN time ttt depends ont − 1 t-1t1. It cannot be parallelized, and there is still the problem of long-term dependence
  2. Transformer: Use the Attention mechanism to reduce the distance between any two positions in the sequence to a constant, and it is easy to parallelize

Enconder:

s e l f A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k V ) self Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}}V) selfAttention(Q,K,V )=softmax(dk QKTV )

F F N ( z ) = m a x ( 0 , z w 1 + b ) w 2 + b FFN(z)=max(0,zw_1+b)w_2+b F F N ( z )=max(0,with w1+b)w2+b

  • Multi-Head
  • Short-cut structure in residual network

Decoder:

  • Multi-Head Attention:
  • Encoder-Decoder Attention: Q Q Q comes from the previous output of the decoder,KKK andVVV comes from the output of the encoder.
  • Masked Attention:
  • + Position vector

ELMO

  1. Use char CNN to get basic embedding
  2. Using multi-layer bidirectional LSTM, the output of the top layer uses softmax to predict the next word.
  3. Combine multi-layer embedding as dynamic embedding of words

LSTM has weak ability to extract features, and weak ability to merge bidirectional features in splicing mode

GPT2

Traditional language model (predicting the next word) using a multi-layer transformer's decoder (without the encoder-decoder antention layer)

GPT is more suitable for natural language generation tasks (NLG) because it uses a traditional language model, because these tasks are usually based on current information to generate next moment information
while BERT is more suitable for natural language understanding tasks (NLU).

BERT:

  1. Use cloze model, mask some words to predict the word
  2. Use Transformer (Encoder) architecture, better interaction between words
  3. Use next sentence predict multi-task training to obtain contextual information

RoBERTa

  1. Remove the next sentence predict task
  2. Dynamic mask. bert is data preprocessing to get a mask for each sentence, here should be the dynamic mask after inputting the sentence
  3. Larger amounts of data, larger batches
  4. Text encoding. Byte-Pair Encoding (BPE) is a mixture of character-level and word-level representations, and supports the processing of many common words in the natural language corpus.

ERNIR (BERT-wwm)

Designed for Chinese, mask a Chinese word when mask

XLNET

Inside the transformer, among the above and below words of Ti, randomly select i-1 and place them in the above position of Ti, and hide the input of other words through the Attention mask.

  1. Permutation Language Modeling (XLNet randomly shuffles the order of the words in the sentence, so that for the word xi, its original context words may appear in the current above) (autoregressive language model, self-encoding language model)
  2. Two-Stream Self-Attention solves 1 problem
  3. Introduce Transformer-XL to obtain longer-distance word dependencies (fragment-level recursion mechanism)
  4. Relative Segment Encoding (determine whether two words are in the same segment, instead of judging which segment they belong to)

DistilBERT

Knowledge distillation is a model compression method, also called teacher-student learning. It trains a small model to replicate the behavior of a large model (or model integration).

  1. Removed the token type embedding and pooler (used for the next sentence classification task)
  2. The rest of the BERT architecture is retained, but the number of network layers is only 1/2 of the original version

ALBERT

  1. Word embedding parameter factorization word embedding< hidden layer (add project layer)
  2. Hidden layer parameter sharing
  3. Change next sentence predict to sentence order predict task

The model parameters are small, but the time is longer, and the effect is only good when the parameters are large.

Guess you like

Origin blog.csdn.net/lovoslbdy/article/details/104860635