Machine Learning: Bert and its family

insert image description here

Bert

insert image description here
First use unsupervised corpus to train the general model, and then conduct special training and learning for small tasks.

insert image description here

ELMo
Bert
ERNIE
Grover
Bert&PALS

Outline

insert image description here

Pre-train Model

insert image description here

First introduce the pre-training model, the role of the pre-training model is to represent some tokens as a vector

insert image description here
for example:

Word2vec
Glove

But for English, there are too many English words, and a single character should be encoded at this time:

FastText

For Chinese, you can add radicals, or send Chinese characters as pictures to the network for output:
insert image description here
the problem with the above method will not consider that the same word in each sentence will have different meanings, and the same token will be generated:

contextualized word embedding

Similar to the encoder of the sequence2sequence model.

The same token gives different embeddings, and the above sentence contains the word Apple.

Bigger Model
Smaller Model

focuses on ALBERT, the technology to make the model smaller:

network architecture design:

let the model read a long content, not just an article, it may be a book.
Transformer-XL
Reformer
Longformer

The computational complexity of self-attention is $O(n^2)$

How to fine-tune

How to pre-train
insert image description here

Input:
one sentence or two sentences, [sep] for segmentation.
Output part:
output a class, add a [cls], and generate an embedding related to the entire sentence.

If there is no cls, it is to put all the embeddings together and send them to the model to get an output.

The second is to give each token a class, which is equivalent to a class for each embedding. How does

the Extraction-based QA

General Sequence be used to generate text? The encoder of the above structure cannot be used well

insert image description here
Use the pre-trained model as an encoder. After generating a word each time, send it to the model to continue generating until the eos terminator is generated.

insert image description here
There are two methods of fine-tuning:

The first one: the pre-training model does not move, and the embedding generated by it is trained for specific tasks, and only the upper model is fine-tuned;
The second method: the pre-training model and the specific task model are combined for training, and the consumption will be higher; the
second method will achieve better results than the first method, but there are some problems encountered in training the entire model :
After training, the pre-training model has also changed, which means that each task will have a different pre-training model, and each model is relatively large, which is very wasteful.

For the above problems, solutions:

Adapter: Only train a small amount of parameter structure APT

insert image description here

When fine-tune, only the parameters of the APT structure will be adjusted, but it will be inserted into the transformer structure to deepen the network:

Weighted Features
synthesizes the embedding of each layer and sends them to specific tasks for learning, and the weight parameters can be learned.

The loss of the model, the generalization ability. From start-point to end-point, the wider the distance between the two points and the shallower the concave, the more general the generalization ability; the closer the distance between the two points, the deeper the concave, the better the generalization ability.

How to pre-train

How to pre-train:

insert image description here

translation task

Context Vector (Cove)

sends the input sentence A to the encoder, and then the decoder gets the sentence B, which requires a lot of pair data

Self-supervised Learning

insert image description here
The input and output of self-supervised are generated by themselves.

Predict Next Token

Given an input, predict the next token

insert image description here
With w1 predicting w2, using w1, w2 to predict w3, and then using w1, w2, w3 to predict w4, but the data on the right cannot be used to predict the data on the left: the infrastructure network uses LSTM

:

LM
ELMo
ULMFiT

Some subsequent algorithms replace LSTM with Self-attention
insert image description here

GPT
Megatron
Turing NLG

Note: Control the scope of Attention

insert image description here
Can be used to generate articles: talktotransformer.com

insert image description here

insert image description here
If only the occurrence relationship on the left is considered, why not the text on the right?

Predict Next Token-Bidrectional

The contexts generated on the left and right sides are combined as the final representation:
insert image description here

But the problem is that the left can only see the left, but cannot see the end of the right, and the right can only see the right, but cannot see the beginning of the left.

Masking input

insert image description here

Randomly cover a certain word, it is to see the complete sentence to predict what the word is.
Pushing this idea forward, it is very similar to the previous cbow:
insert image description here
The difference between Bert and cbow is that the length of the left and right sides can be infinite, instead of having a window length.

Is a random mask good enough? There are several mask methods:

wwm
ERNIE
TeamBert
SBO

covers an entire sentence or covers several words. Or find out the Entity first, and then cover these words:

the length of the cover is according to the probability of appearance in the above picture.

Cover the embeddings on the left and right sides to predict, and the input index to restore which word in the middle.
The design of SBO expects the token embedding on the left and right to contain the embedding information on the left and right.

XLNet

The structure is not using Transformer, but using Transformer-XL

insert image description here

Randomly shuffle the order and train a token with a variety of different information.

Bert's training corpus is relatively regular:
insert image description here
Bert is not good at doing Generative tasks, because Bert gives the whole sentence during training, while Generative only gives a part, and then predicts the next token from left to right

MASS/BART

insert image description here
Do some damage to w1, w2, w3, w4, otherwise the model can't learn anything, the method of destruction:

mask (random mask)
delete (delete directly)
permutation
rotation (change the starting position)
Text Infilling (insert another misleading, missing a mask)

turn out:
insert image description here

UniLM

insert image description here

UniLM for multiple training

Replace or Not

ELECTRA avoids the things that need to be trained and generated, and judges which position is replaced. The training is very simple, and every output is used.
*
The replacement words are not easy to get, if it is replaced casually, it will be easy to know. So with the following results, use a small bert predicted result as the replacement result. The small bert effect should not be too good, otherwise the predicted result will be the same as the real one, and the replacement effect will not be obtained, because the replacement result is exactly the same of.

Only a quarter of the calculation is needed to achieve the effect of XLNet.

Sentence Level

insert image description here
The embedding of the entire sentence is required.

Using skip thought, if the prediction results of two sentences are more similar, then the two input sentences are more similar.
Quick thought, if the output of two sentences is connected, the closer the similar sentences are, the better.
The above approach avoids doing the generated task.

The original Bert actually has a task NSP, predicting whether two sentences are connected or not. Separate the two sentences with the sep symbol.
insert image description here

nsp: the effect is not good
Roberta: Average effect
sop: Forward is connected, reverse is not connected, used in ALBERT
structBert：Alice，

T5 Comparison

insert image description here
5 Ts are called T5,
4 Cs are called C4

ERNIE

I hope to add knowledge during the train

insert image description here
Audio Bert

Multi-lingual BERT

Multilingual BERT
insert image description here

insert image description here
Using multiple languages to train a Bert model in

104 languages can achieve zero-shot reading comprehension.

Trained on English corpus, but on Chinese QA tasks, the effect is not bad

insert image description here

After translating Chinese into English, and then conducting English training, it was found that there was no model directly trained in Chinese.

NER
Pire: part-of-speech tagging

Both the NER task and the part-of-speech tagging task conform to the above rules, training in one language, and then performing task processing on another voice.

Is it possible to handle Oracle?
insert image description here

Cross-lingual Alignment

The Chinese rabbit embedding is relatively close to the English rabbit embedding. The model may have removed the characteristics of the speech and only considered the meaning.

insert image description here

Year ranks first, month ranks third, and the corresponding score is the reciprocal of rank

insert image description here

The amount of data needs to be very large to have better results, as can be seen from the results of BERT200k and BERT1000k.
The same experiment was carried out on the traditional algorithms GloVe and Word2Vec, and it was found that Bert's effect is still better than the previous algorithm.
insert image description here

How alinment happens

insert image description here

Use fake-english instead of real english, and then train. The cross-language ability does not require the existence of intermediary voice.

insert image description here
Bert knows the language information, but doesn't care much about the language type.

insert image description here
Each string of text represents a language, and there are still some gaps in the language.

Yellow is the English code, blue is the Chinese code, the two are fused together and controlled by α:

Perform fine-tune on English, and then test on Chinese to make the embedding more like Chinese. In the testing phase, adding blue vectors will improve the effect.