Brief analysis of language model and Word2vec and Bert

The language model can estimate the rationality probability of a piece of text, and plays an important role in information retrieval, machine translation, speech recognition and other tasks. Based on the previous study notes, this article briefly summarizes the NLP language model word2vec and bert to share with you. Please point out any omissions. Later, various language model theories and applications will be analyzed in detail, so stay tuned.

1. Introduction to language models

The language model can estimate the rationality probability of a piece of text, and plays an important role in information retrieval, machine translation, speech recognition and other tasks.

1.1 What is a language model

  • Definition of language model

Given a word sequence, W = w1, w2, ... , wm, calculate the joint probability P(W) that this word sequence can form a reasonable natural language:

img

Decompose the probability of a sentence into the product of the conditional probabilities of each word. If the text is long, the estimation of the conditional probability will be very difficult (dimension disaster), so it is stipulated that the current word is only related to the n words in front of it. The word has nothing to do, each word only calculates the conditional probability based on the previous N words - N-gram language model, generally N is between 1 and 3.

  • Popular understanding of language model

Given a set of word sequences, combine the word arrangement into a piece of text, and judge whether the text is a human sentence:

Word sequence: [I, dog, was, got, bitten]

Combined into sentences:

(1) I was bitten by a dog P=0.5

(2) Bitten by my dog ​​P=0.2

(3) The dog was bitten by me P=0.03

1.2 Classification of language models

There are two main types of language models commonly used: statistical language models and neural network language models

  • statistical language model

HAL (Hyperspace Analogue to Language method)

LSA (Latent Semantic Analysis)

N-Gram

  • neural language model

CBOW (Continuous Bag-Of-Words Model)

Skip-gram (Continuous Skip-gram Model)

2. Neural language model Word2vec and Bert

The neural language model can be directly used for NLP tasks. With the rapid development of deep learning, the language model is also more applied to the pre-training of the NLP model. The commonly used pre-training models are:

  • Word2vec (Google)
  • Glove (Facebook)
  • ELMO (AllenNLP Allen Institute for Artificial Intelligence)
  • GPT (OpenAI)
  • BERT (Google)
  • RoBERTa (FaceBook)
  • AlBert (Google)
  • MT-DNN (Microsoft)

2.1 Vectorization of words

  • Word set method (one-hot): Count the total number of words in the document and create a dictionary with a length of N, express the word as an N-dimensional highly sparse vector, the value of the element corresponding to the word is 1, and the others are all 0;
  • bag of words: Count the total number of words in the document N, express the word as an N-dimensional highly sparse vector, the corresponding position element of the word is the word frequency of the word in the document, and the value of other position elements is 0;
  • distributed representation of words: Represent words as low-dimensional, dense vectors, which are mainly obtained by training language models through neural networks, such as word2vec, glove, BERT, etc.

2.2 Word2vec

2.2.1 Training method of Word2vec

As the name suggests Word2Vec isConvert words to vectors, it is essentially a method of word clustering, and it is a means to achieve the purpose of word semantic speculation and sentence sentiment analysis.

Word2Vec has two training methods: using the language model CBOW and Skip-gram.

1) The core idea of ​​CBOW is to remove a word from a sentence, and use the context and context of the word to predict the word that has been removed;

2) Skip-gram and CBOW are just the opposite, input a word and ask the network to predict its context words.

After the training is completed, we generally do not need to use it for prediction, but take outParameters between the model input layer and the projection layer, to be used as the vector representation (word vector) of each of our words, for the input of nlp downstream tasks or word embeddings for NLP models.

image-20230517190701136

CBOW model:

  • **Input: **Input layer, one-hot vector representation of contextual words;
  • **Projection: **Projection layer, which sums the word vectors corresponding to the input words and accumulates them;
  • **Output: **Output layer, predicted as a list of probabilities for each word.

2.2.2 Advantages and disadvantages of Word2vec

  • advantage

(1) Map the highly sparse one-hot word vectors to the semantic vectors of the bottom dimension, which effectively solves the shortcomings of high sparseness and high redundancy of one-hot word vectors;

(2) It can represent the semantic information of words more completely, effectively solving the shortcomings of one-hot word vectors without semantics.

  • shortcoming

The static representation of words cannot solve the problem of synonyms. For example, the word vector representation of "apple" in fruit and "apple" in Apple Company is the same, but in fact the meanings of these two words are completely different.

2.3 Bert

for language understandingPre-training of Deep Bidirectional Transformers Model:论文《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》

img

2.3.1 Bert's input representation

img

  • Token Embedding: == word feature (word vector) == embedding, for Chinese, currently only supports word feature embedding;
  • Segment Embedding:sentence-level features of wordsFeature embedding, for double-sentence input tasks, do sentence A, B embedding, for single-sentence tasks, only do sentence A embedding;
  • Position Embedding:word positional features, for Chinese, the current maximum length is 512;

2.3.2 Bert's pre-training

Bert 's pre-training consists of two tasks: language model and sentence pair relationship determination.

  • **Two-way language model:** Randomly cover 15% of the words in the input, and predict the covered words through other words (this is a typical language model). Through iterative training, you can learn the contextual features and syntactic features of the words etc., to ensure the comprehensiveness of feature extraction, which is particularly important for any NLP task.
  • **Sentence pair judgment:** Input sentence A and sentence B, and judge whether sentence B is the next sentence of sentence A. Through iterative training, you can learn the relationship between sentences, which is especially important for text matching tasks.

2.3.3 Characteristics of Bert

  • **True two-way: **Using a two-way Transformer, the context information of the current word can be used for feature extraction at the same time, which is essentially different from the two-way RNN using the upper and lower information of the current word for feature extraction. CNN specifies the context information in a window of limited size for feature extraction, which is fundamentally different;
  • **Dynamic representation: **Using the context information of words for feature extraction,Dynamically adjust the word vector according to the different context information, which solves the problem of polysemy in word2vec
  • **Parallel computing capability: **The Transformer component internally uses a multi-headed attention (Multi-headed attention) mechanism, which can extract the features of each word in the input sequence in parallel at the same time, which is different from the fact that RNN relies on time-based one-way serial extraction Essential difference;
  • **Easy to transfer learning:**Using the pre-trained BERT, you only need to load the pre-trained model as the word embedding layer of your current task or directly use it for NLP tasks, without a lot of modification or optimization of the code.

2.3.4 How to use Bert

img

Bert can be regarded as a text encoder, which can be used as a text embedding layer in the construction of various NLP upstream and downstream task networks. As shown above (a) text matching task; (b) text classification task; (c) extractive question answering task; (d) sequence labeling task. The specific usage of Bert fine-tuning:

  • Sequence annotation
  1. Load the pre-trained Bert model;
  2. Get the output word vector:
    embedding = bert_model.get_sequence_output();
  3. Subsequent networks are then constructed.
  • Text Classification and Text Matching Tasks
  1. Load the pretrained BERT model;
  2. Take the output sentence vector:
    output_layer=bert_model.get_pooled_output();
    rt_model.get_sequence_output();
  3. Subsequent networks are then constructed.
  • Text Classification and Text Matching Tasks
  1. Load the pretrained BERT model;
  2. Get the output sentence vector:
    output_layer=bert_model.get_pooled_output();
  3. Subsequent networks are then constructed.

Guess you like

Origin blog.csdn.net/linjie_830914/article/details/130733660