Machine Learning: Bert and its family

insert image description here

Bert

insert image description here
First use unsupervised corpus to train the general model, and then conduct special training and learning for small tasks.

insert image description here

  • ELMo
  • Bert
  • ERNIE
  • Grover
  • Bert&PALS

Outline

insert image description here

Pre-train Model

insert image description here

First introduce the pre-training model, the role of the pre-training model is to represent some tokens as a vector

insert image description here
for example:

  • Word2vec
  • Glove

But for English, there are too many English words, and a single character should be encoded at this time:

  • FastText
    insert image description here

For Chinese, you can add radicals, or send Chinese characters as pictures to the network for output:
insert image description here
the problem with the above method will not consider that the same word in each sentence will have different meanings, and the same token will be generated:
insert image description here
contextualized word embedding
insert image description here
Similar to the encoder of the sequence2sequence model.
insert image description here
insert image description here
The same token gives different embeddings, and the above sentence contains the word Apple.

  • Bigger Modelinsert image description here
  • Smaller Model
    insert image description here
    focuses on ALBERT, the technology to make the model smaller:
    insert image description here
    network architecture design:
    insert image description here
    let the model read a long content, not just an article, it may be a book.
  • Transformer-XL
  • Reformer
  • Longformer

The computational complexity of self-attention is O ( n 2 ) O(n^2)O ( n2)

How to fine-tune

How to pre-train
insert image description here
insert image description here

  • Input:
    one sentence or two sentences, [sep] for segmentation.
    insert image description here
  • Output part:
    output a class, add a [cls], and generate an embedding related to the entire sentence.
    insert image description here
    If there is no cls, it is to put all the embeddings together and send them to the model to get an output.
    insert image description here
    The second is to give each token a class, which is equivalent to a class for each embedding. How does
    insert image description here
    the Extraction-based QA
    insert image description here
    insert image description here
    insert image description here
    General Sequence be used to generate text? The encoder of the above structure cannot be used well

    insert image description here

insert image description here
Use the pre-trained model as an encoder. After generating a word each time, send it to the model to continue generating until the eos terminator is generated.

insert image description here
There are two methods of fine-tuning:

  • The first one: the pre-training model does not move, and the embedding generated by it is trained for specific tasks, and only the upper model is fine-tuned;
  • The second method: the pre-training model and the specific task model are combined for training, and the consumption will be higher; the
    second method will achieve better results than the first method, but there are some problems encountered in training the entire model :
    insert image description here
  • After training, the pre-training model has also changed, which means that each task will have a different pre-training model, and each model is relatively large, which is very wasteful.

For the above problems, solutions:

  • Adapter: Only train a small amount of parameter structure APT

insert image description here
insert image description here
When fine-tune, only the parameters of the APT structure will be adjusted, but it will be inserted into the transformer structure to deepen the network:
insert image description here

  • Weighted Features
    synthesizes the embedding of each layer and sends them to specific tasks for learning, and the weight parameters can be learned.
    insert image description here
    insert image description here
    insert image description here
    insert image description here
    The loss of the model, the generalization ability. From start-point to end-point, the wider the distance between the two points and the shallower the concave, the more general the generalization ability; the closer the distance between the two points, the deeper the concave, the better the generalization ability.

How to pre-train

How to pre-train:

insert image description here

translation task

  • Context Vector (Cove)
    insert image description here
    sends the input sentence A to the encoder, and then the decoder gets the sentence B, which requires a lot of pair data

Self-supervised Learning

insert image description here
The input and output of self-supervised are generated by themselves.

Predict Next Token

Given an input, predict the next token

insert image description here
With w1 predicting w2, using w1, w2 to predict w3, and then using w1, w2, w3 to predict w4, but the data on the right cannot be used to predict the data on the left: the infrastructure network uses LSTM
insert image description here
insert image description here
:

  • LM
  • ELMo
  • ULMFiT

Some subsequent algorithms replace LSTM with Self-attention
insert image description here

  • GPT
  • Megatron
  • Turing NLG

Note: Control the scope of Attention

insert image description here
Can be used to generate articles: talktotransformer.com

insert image description here
insert image description here
insert image description here
insert image description here

insert image description here
If only the occurrence relationship on the left is considered, why not the text on the right?

Predict Next Token-Bidrectional

The contexts generated on the left and right sides are combined as the final representation:
insert image description here

But the problem is that the left can only see the left, but cannot see the end of the right, and the right can only see the right, but cannot see the beginning of the left.

Masking input

insert image description here

Randomly cover a certain word, it is to see the complete sentence to predict what the word is.
Pushing this idea forward, it is very similar to the previous cbow:
insert image description here
The difference between Bert and cbow is that the length of the left and right sides can be infinite, instead of having a window length.

Is a random mask good enough? There are several mask methods:

  • wwm
  • ERNIE
  • TeamBert
  • SBO
    insert image description here
    covers an entire sentence or covers several words. Or find out the Entity first, and then cover these words:
    insert image description here
    insert image description here
    the length of the cover is according to the probability of appearance in the above picture.
    insert image description here
    Cover the embeddings on the left and right sides to predict, and the input index to restore which word in the middle.
    The design of SBO expects the token embedding on the left and right to contain the embedding information on the left and right.

XLNet

The structure is not using Transformer, but using Transformer-XL

insert image description here

Randomly shuffle the order and train a token with a variety of different information.

Bert's training corpus is relatively regular:
insert image description here
Bert is not good at doing Generative tasks, because Bert gives the whole sentence during training, while Generative only gives a part, and then predicts the next token from left to right
insert image description here

MASS/BART

insert image description here
Do some damage to w1, w2, w3, w4, otherwise the model can't learn anything, the method of destruction:
insert image description here
insert image description here

  • mask (random mask)
  • delete (delete directly)
  • permutation
  • rotation (change the starting position)
  • Text Infilling (insert another misleading, missing a mask)

turn out:
insert image description here

UniLM

insert image description here
insert image description here
UniLM for multiple training

Replace or Not

  • ELECTRA avoids the things that need to be trained and generated, and judges which position is replaced. The training is very simple, and every output is used.
    * insert image description here
    The replacement words are not easy to get, if it is replaced casually, it will be easy to know. So with the following results, use a small bert predicted result as the replacement result. The small bert effect should not be too good, otherwise the predicted result will be the same as the real one, and the replacement effect will not be obtained, because the replacement result is exactly the same of.
    insert image description here
    insert image description here
    Only a quarter of the calculation is needed to achieve the effect of XLNet.

Sentence Level

insert image description here
The embedding of the entire sentence is required.
insert image description here

  • Using skip thought, if the prediction results of two sentences are more similar, then the two input sentences are more similar.
  • Quick thought, if the output of two sentences is connected, the closer the similar sentences are, the better.
    The above approach avoids doing the generated task.

The original Bert actually has a task NSP, predicting whether two sentences are connected or not. Separate the two sentences with the sep symbol.
insert image description here

  • nsp: the effect is not good
  • Roberta: Average effect
  • sop: Forward is connected, reverse is not connected, used in ALBERT
  • structBert:Alice,
    insert image description here

T5 Comparison

insert image description here
5 Ts are called T5,
4 Cs are called C4

ERNIE

I hope to add knowledge during the train

insert image description here
Audio Bert
insert image description here


Multi-lingual BERT

Multilingual BERT
insert image description here

insert image description here
Using multiple languages ​​to train a Bert model in
insert image description here
104 languages ​​can achieve zero-shot reading comprehension.
insert image description here
Trained on English corpus, but on Chinese QA tasks, the effect is not bad

insert image description here
insert image description here
After translating Chinese into English, and then conducting English training, it was found that there was no model directly trained in Chinese.
insert image description here

  • NER
  • Pire: part-of-speech tagging

Both the NER task and the part-of-speech tagging task conform to the above rules, training in one language, and then performing task processing on another voice.

Is it possible to handle Oracle?
insert image description here

Cross-lingual Alignment

The Chinese rabbit embedding is relatively close to the English rabbit embedding. The model may have removed the characteristics of the speech and only considered the meaning.

insert image description here
insert image description here
Year ranks first, month ranks third, and the corresponding score is the reciprocal of rank

insert image description here
insert image description here
The amount of data needs to be very large to have better results, as can be seen from the results of BERT200k and BERT1000k.
The same experiment was carried out on the traditional algorithms GloVe and Word2Vec, and it was found that Bert's effect is still better than the previous algorithm.
insert image description here

How alinment happens

insert image description here
insert image description here
Use fake-english instead of real english, and then train. The cross-language ability does not require the existence of intermediary voice.

insert image description here
Bert knows the language information, but doesn't care much about the language type.

insert image description here
Each string of text represents a language, and there are still some gaps in the language.
insert image description here
insert image description here
insert image description here
Yellow is the English code, blue is the Chinese code, the two are fused together and controlled by α:
insert image description here

Perform fine-tune on English, and then test on Chinese to make the embedding more like Chinese. In the testing phase, adding blue vectors will improve the effect.

Guess you like

Origin blog.csdn.net/uncle_ll/article/details/131869537