NLP BigModel

NLP basics

It is recommended to read [CS224N 2023] to lay the foundation
[Introduction to NLP] 1. n-gram model / recurrent neural network
[Introduction to NLP] 3. Word2Vec / GloVe

  • Language Model : The Markov hypothesis of the language model (the probability of each word appearing only depends on the previous word), is a 自回归model (same as decoder-only). ① According to the previous text, predict that the next word is wn w_nwnThe conditional probability P ( wn ∣ w 1 , w 2 , . . . , wn − 1 ) P(w_n | w_1, w_2, ..., w_{n-1})P(wnw1,w2,...,wn1) , the ability of language modeling, ② predict the joint probability P ( w 1 , w 2 , . . . , wn ) P(w_1, w_2, ..., w_n)of a word sequence forming a sentenceP(w1,w2,...,wn) , language understanding ability.
  • N-gram Model : Perform a sliding window operation of size N on the basis of bytes in the text, forming a sequence of byte fragments of length N, and perform statistics on the sequence of length n in the text. Each byte fragment is called a gram. The frequency of occurrence of all grams is counted and filtered according to the preset threshold to form a key gram list, which is the vector feature space of this text. Each type in the list Gram is a feature vector dimension. This model is based on the assumption that the occurrence of the Nth word is only related to the previous N-1 words and is not related to any other words. The probability of the entire sentence is the product of the occurrence probabilities of each word. These probabilities can be obtained by directly counting the number of times N words appear simultaneously from the corpus. But behind n-gram is the one-hot idea, the similarity of different words is zero, and the text similarity of synonyms is not considered.
  • Vocabulary : When word embedding, according to the vocabulary, the seen words are encoded and mapped to vector, and the unseen words are mapped to UNK. In order to solve the word-levelsparse problem and infinite vocabulary problem (out of vocabulary), char-levelbut is used instead. The granularity is too fine, so subword-levelencodings (word prefixes, prefixes, etc.) can be used. But vocabulary embedding does not consider the relationship between words in the sentence.
  • Word2Vec : When word embedding, a pre-trained neural network maps words to vector space. Words with text similarity are close in the feature space. Later, we have to take over LM to do downstream tasks.
  • Pre-train Whole Model : Pre-training the whole model is a technique for training large-scale neural network models with large-scale text data sets, directly unifying word embedding and down stream.
  • BigModel : The mechanism of large model work, Transfer Learning (unsupervised pre-training + supervised fine-tuning).
    Insert image description here

Insert image description here
Insert image description here

Pretraining Language Model (PLM)

PLM is mainly divided into two categories:

  • Feature based model : The output of the pre-trained model is used as the input of downstream tasks, such as Word2Vec is the first PLM.
  • Fine-tune based model : The pre-training model is the downstream task model, and the model parameters will also be updated, which is the Pre-train Whole Model, such as GPT, BERT, etc.

Language pre-training model review: BERT & GPT & T5 & BART
summarize nearly 30 latest models from T5, GPT-3, Chinchilla, PaLM, LLaMA, Alpaca, etc.
[Transformer 101 series] Preliminary exploration of the LLM base model--encoder-only , encoder-decoder and decoder-only
[Transformer 101 Series] How is ChatGPT made?
[Transformer 101 Series] The road to multi-modal unification

Fine-tune based model

Insert image description here

Insert image description here

The main representatives of Fine-tune based models are:

  • Encoder-only(Mask LM模型/双向理解,也叫Auto-Encoder): Pink branch, all input tokens can be seen in the output token, and tasks such as mask language model and next sentence prediction are used for pre-training. Such as BERT, RoBERTa, etc.
  • Decoder-only(L2R LM模型/生成任务,也叫Auto-Regressive): Blue branch, the output token can see all input tokens, and uses generative pre-training. Such as GPT series, LLaMA, PaLM, OPT, LaMDA, CHINCHILLA, BLOOM, etc.
  • Encoder-Decoder(Text2Text/MLM模型): Green branch, all input tokens can be seen in the output token. Such as T5, BART, GLM, etc.

### Encoder-only

Insert image description here

Insert image description here

GPT: 12-layer transformer decoder autoregressively performs generative pre-training from left to right as a generative model.
GPT-2: Increases the amount of GPT model parameters and pre-training data, demonstrating zero-shot (only prompts are needed to process unseen data) and in-context learning (given examples to imitate cats and tigers) capabilities.
BERT: transformer encoder bidirectional mask cloze pre-training Mask LM, adding CLS token for downstream tasks.
T5: The transformer encoder-decoder architecture completes text2text and completes MLM pre-training.
BART: encoder-decoder architecture. Compared with T5, it uses a variety of pre-training methods to add noise to text.

Insert image description here

Insert image description here

Encoder-only

LLMs with Encoder-only architecture are better at analyzing and classifying text content , including sentiment analysis and named entity recognition. Here, Bert is used as an example to explain in detail. roBerta is upgraded based on Bert, such as expanding the batch size, training on larger data, and eliminating Bert's next-sentence prediction task training method.

Insert image description here

Bert's pre-training is based on next-sentence prediction taskand mask language modeling:

  • next-sentence prediction taskIt is to scramble the original sentence into sentences in different orders, and let Bert find the original sentence in the correct word order.

  • mask language modelingIt is to randomly cover 15% of the data in a large text corpus and let Bert predict the content of the mask based on the contextual content. (Replaced with [MASK] token 80% of the time, random token 10% of the time, unchanged 10% of the time)

Insert image description here

Decoder-only

The main purpose of the Decoder is to predict what the next output content/token will be, and to use the previously output content/token as context learning . In fact, the decoder-only model is as effective as the encoder-only LLM in analyzing classification.

The decoder layer of Decoder-only is similar to the encoder, but uses a mask on the position . It uses a "mask" technology to prevent the model from paying attention to the position after position i, which means that the self-attention of the decoder is being performed. During operation, the attention distribution (ie, attention weight) of position i will only be at position i and the position before it, not at the position after position i. This way, we can ensure that the output at position i only depends on the known output before position i.

Insert image description here

Two ways of decoder pre-training:

  1. By training the last token of the autoregression (only the last position in the autoregressive structure can get the attention of all inputs), input the FC layer to do the classification task: y = A h T + by = Ah_T+by=AhT+b
    Insert image description here
  2. Make a natural LM sequnec generative pre-training task: wt = A ht − 1 + b w_t=Ah_{t-1}+bwt=Aht1+b

Insert image description here

Encoder-Decoder

Encoder-only means that all output tokens can see all input tokens in the past and future. This is naturally friendly for NLU tasks, but for seq2seq tasks, such as machine translation, this structure is not particularly suitable because it is difficult to directly use it as translation results. of generation.

A direct way is to add a decoder for prediction generation, which forms the encoder-decoder architecture.

Insert image description here

LLMs of this architecture usually take full advantage of the above two types and use new technologies and architectural adjustments to optimize performance. The Encoder can not only enjoy the bidirectional context to understand the input content NLU, but also use the Decoder to process and generate the content NLG . It is especially good at processing tasks with complex mapping relationships between input and output sequences, and it is crucial to capture the relationship between elements in the two sequences. important tasks.

Insert image description here

  • The first MHA of the decoder becomes masked-MHA, using the casual attention mask method mentioned above, so that each current output token can only see the tokens generated in the past.
  • The decoder adds a second MHA, and K and V come from the output of the encoder, so that the full text of the original input can be seen.

At this point we can sort out the two methods of encoder-decoder:
Insert image description here

  • The two separate, standard original structures. Among them, A and B use fully-visible attention mask, and C is casual attention mask.
  • The two are integrated, the first half is fully-visible, and the second half is casual. Among them, D is the attention mask of casual with prefix.

Fine tuning

Large model fine-tuning paradigm:
Insert image description here

Prompt project:

Insert image description here

Lora fine-tuning:
Insert image description here

RAG: After the business gets the data, it builds a knowledge base and fine-tunes the large model:

Insert image description here
Insert image description here
Audit system:
Insert image description here
intelligent error correction:

Insert image description here
Insert image description here

Model compression acceleration:
Insert image description here
model quantization:
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/133133938