Large language model LLM

Table of contents

1. Development of language model

The goal of the language model (Language Model, LM) is to model the probability distribution of natural language. The specific goal is to construct the probability distribution of word sequences w1, w2,..., wm, that is, to calculate the probability that a given word sequence appears as a sentence Size P(w1w2...wm). However, the parameter quantity of the joint probability P is very large N^m (m represents the length of the sentence, N represents the number of possible words), a simplified idea is to use the generation process of the sentence sequence from left to right to decompose the joint probability:

That is to say, the generation process of the word sequence w1w2...wm is regarded as the generation of words one by one, assuming that the probability of the i-th word depends on the previous i-1 words. It should be pointed out that this decomposition itself does not reduce the number of parameters required for the model, but this transformation provides a path for the subsequent simplification.

  • Statistical language model : N-gram model, that is, metagrammar model. According to the above assumptions, the probability of a word is affected by the previous i-1 words, which is called historical influence, and the easiest way to estimate this probability is to calculate the frequency of word sequences appearing in the corpus based on the corpus.
  • Neural Network Language Models (NNLM): Overcoming the dimensionality disaster of the n-gram language model, a distributed representation of the language model FFN, RNN, LSTM represented by the word vector (Word Embedding) appeared
  • Pre-trained language model (Pre-trained Model, PLM): ELMo first pre-trained bidirectional LSTM network, Transformer is based on the attention mechanism, and BERT uses the mask mechanism to construct a pre-training task based on context-based prediction of intermediate words
Encoder-only Pre-trained Models: such as BERT model Decoder-only Pre-trained Models: such as GPT model Encoder-decoder Pre-trained Models based on codec architecture: Seq2Seq models such as the BART model, the pre-training method used is to input text containing various noises, and then the model performs denoising and reconstruction

  • Large Language Model (Large Language Model, LLM)

In 2019, Google released T5. In January 2020, OpenAI published the paper "Scaling Laws for Neural Language Models", which studied the empirical scaling law of language model performance based on cross-entropy loss, and found that the efficiency of large models using samples is significant Higher, so the optimal way to train efficiently is to train very large models on moderate datasets and stop early before significant convergence.

Some key techniques for LLM success:

  • Model scaling (Scaling) : The expansion of the model size, the quality of the training data plays a key role in achieving good performance
  • Model Training : Use various optimization frameworks to deploy parallel algorithms, such as DeepSpeed ​​and Megatron-LM
  • Ability Eliciting : Use appropriate task steps and prompts to guide large language models to solve complex tasks
  • Alignment Tuning : The large language model has been trained with a large amount of data, so it is possible to generate erroneous or even harmful content due to low-quality data. GPT adopts the strategy of reinforcement learning with human feedback (RLHF) to avoid this problem by integrating humans into the training loop through carefully designed labels.
  • Tools Manipulation : External tools such as plug-ins can expand the capabilities of large language models

LLM can be roughly divided into two types, namely basic LLM and instruction fine-tuning LLM. Basic LLM is based on textual training data, training a model to predict the ability of the next word, which is usually trained on a large amount of data from the Internet and other sources. The training of instruction-tuned LLMs usually starts from already trained base LLMs that have been trained on a large amount of text data. It is then fine-tuned with a dataset whose input is the instructions and output is the result it should return, asking it to follow those instructions. This is then further improved, often using a technique called RLHF (reinforcement learning from human feedback), to make the system more helpful in following instructions

1. LLM training technique

Basic process:

The source of pre-training corpus can be roughly divided into two categories: general corpus and professional corpus.

  • General-purpose corpus : such as web pages, books, and conversational texts, whose large, diverse, and accessible features can enhance the language modeling and generalization capabilities of large language models.
  • Professional corpus : In view of the excellent generalization ability of large language models, there are also studies on extending the pre-training corpus to more professional data sets, such as multilingual data, scientific data and codes, to endow large language models with specific task-solving capabilities.

The pre-trained language model represented by BERT needs to be fine-tuned according to the task data . This paradigm can be applied to pre-trained models with parameters ranging from millions to hundreds of millions. But for billions or even tens of billions of large models, the computational overhead and time cost of fine-tuning each task are almost unacceptable.

Therefore, the Instruction Finetuning method was created to unify a large number of various types of tasks into a generative natural language understanding framework, and construct training corpus for fine-tuning.

Through instruction fine-tuning, the large model has learned how to respond to human instructions, and can directly generate reasonable answers according to the instructions. Since a large number of tasks are trained in the instruction fine-tuning stage, the task capabilities of the large model can be generalized to tasks that have not been seen before, which makes the model initially capable of answering any instructions proposed by people. This capability is crucial for large models that perform well in open domains.

Although the model fine-tuned by the instructions has excellent performance in open field tasks, the results of the model output are usually very different from human answers, in short, it is "inhuman". Therefore, it is necessary to further optimize the model so that the output of the model is aligned with human habits. One of the most representative and hugely successful methods is Reinforcement Learning from Human Feedback (RLHF), developed by OpenAI and shaping ChatGPT.

2. LLM memory saving method

  • fp16
  • you8
  • LoRA : Low-Rank Adaptation of Large Language Models handles large model fine-tuning. LoRA proposes to freeze the weights of pre-trained models and inject trainable layers (_rank-decomposition matrix_) in each Transformer block . Since gradients do not need to be computed for most of the model weights, this greatly reduces the number of parameters that need to be trained and reduces GPU memory requirements.
  • Textual Inversion : Through several concept pictures, these concepts are represented by learning the pseudo-words (pseudo-word) in the Text Embedding space of the text image generation model. Then combine these pseudo-words into natural language sentences to guide personalized generation. That is, the picture prompt
  • Gradient checkpointing
  • Torch FSDP
  • CPU offloading

2.1 LoRA low-rank adaptive fine-tuning

LoRA is mainly used to deal with the fine-tuning of large models. Large models with strong capabilities (such as GPT-3) currently exceeding billions of parameters usually present a huge overhead in fine-tuning to adapt to their downstream tasks. LoRA proposes to freeze the weights of pre-trained models and inject trainable layers (_rank-decomposition matrix_) in each Transformer block. Since gradients do not need to be computed for most of the model weights, this greatly reduces the number of parameters that need to be trained and reduces GPU memory requirements. The researchers found that by focusing on the large model's Transformer attention block , fine-tuning using LoRA was of comparable quality to full-model fine-tuning while being faster and requiring less computation.

Reparameterize, only train A and B

 Advantages of LoRA:

  • Pretrained models can be shared and used to build many small LoRA modules for different tasks. By replacing the matrices A and B in Figure 1, the shared model parameters can be frozen and tasks can be switched efficiently, which significantly reduces storage requirements and task switching overhead.
  • When using an adaptive optimizer, LoRA makes training more efficient and lowers the hardware entry barrier by a factor of 3 because there is no need to compute gradients or maintain optimizer state for most parameters. Instead, only the much smaller low-rank matrix injected is optimized .
  • The simple linear design allows combining trainable matrices with frozen weights at deployment time, by construction, without introducing inference delays compared to fully fine-tuned models.

For example, suppose ΔW is the weight update of the weight matrix W∈R^A×B. Then, the weight update matrix can be decomposed into two smaller matrices: ΔW=WA·WB, where WA∈R^A×r, and WB∈R^r×B. Here, the original weights W are kept frozen and only the new matrices WA and WB are trained.

When reparameterizing, use random Gaussian initialization for A and zero initialization for B , so ΔW=BA is zero at the beginning of training. Then, scale ΔWx by α/r, where α is a constant in r. When optimizing with Adam , tuning α is roughly the same as tuning the learning rate if the initialization is scaled appropriately.

2.2 Model Quantization

insert image description here
FP32 Compression FP16

2.3 Textual Inversion

Text embedding

  • Each word ( word ) or sub-word ( sub-word ) in the input string is converted into a token ( Token ), which is an index in a predefined dictionary (see BPE algorithm);
  • Then each Token corresponds to a unique embedding vector ( embedding ), these embedding vectors are usually learned as part of the text encoder;
  • Select this embedding vector space as the target of Inversion , and specify a placeholder string  S*  to represent the new concept you want to learn;
  • Replace the embedding vector associated with the new concept with the newly learned embedding vector  v*  , i.e. inject the concept into the vocabulary;
  • As with other words, use this concept character to form a new sentence, for example: A photo of  S* , A painting in the style of  S*.

Text inversion  minimizes image reconstruction loss ( LDM ) for optimization

[AIGC Part 3] Textual Inversion: Image Generation Technology Based on Prompt tuning

 

 

Several ways to train a conversational large language model for vertical domain issues + knowledge documents (according to the order of resources from small to large)

  1. p-tuning v2: Deep Prompt Encoding and Multi-task Learning. (Extended version of Prefix-tuning)
  2. lora
  3. fine tune

Big Language Model Project

  • ChatGLM : 6B open source, Chinese-English bilingual model with Q&A, multi-round dialogue and code generation functions
  1. SwissArmyTransformer : A Transformer unified programming framework, ChatGLM-6B has been implemented in SAT and can perform P-tuning fine-tuning.
  2. ChatGLM-MNN : An MNN-based ChatGLM-6B C++ inference implementation that supports automatic allocation of computing tasks to GPU and CPU according to memory size
  3. JittorLLMs : Minimum 3G memory or no graphics card can run ChatGLM-6B FP16, support Linux, windows, Mac deployment
  • Colossal AI: SFT supervised fine-tuning -> reward model (RM) training -> reinforcement learning (RLHF)
  • LLaMA: stanford-alpaca
  • alpaca-lora: Using LoRA technology, which PEFT library is called to achieve low-cost and efficient fine-tuning
  • Dolly: Fine-tuning on GPT-J-6B using the Alpaca dataset
  • PaLM-rlhf-pytorch: Based on Google's large model PaLM architecture
  • Vicuna: fine-tuning for LLaMA
  • LMFlow: low-cost training, open web experience
  • ChatRWKV
  • Fudan MOSS
  • GPTrillion
  • Koala
  • StackLLaMA

Chinese-LangChain Dismantling Explanation-Knowledge

Large language model dataset Dataset

  • RefGPT : Generate a large number of real and customized dialogue data sets based on RefGPT, Chinese
  • Alpaca-CoT: Unifies a wealth of IFT data (such as CoT data, which is still expanding), multiple training efficiency methods (such as lora, p-tuning), and multiple LLMs, interfaces on three levels, creating a convenient for researchers Get started with the LLM-IFT research platform

Large language model evaluation Evaluation

  • FlagEval (Libra) large-scale model evaluation system and open platform: a three-dimensional evaluation framework of "ability-task-indicator" has been constructed, the cognitive ability boundary of the basic model is described in a fine-grained manner, and the evaluation results are visualized
  • C-Eval:  A Knowledge Evaluation Benchmark for Constructing Large Chinese Models
  • AGIEval: A new benchmark released by Microsoft that selects 20 official, public, high-standard daily and qualifying exams for ordinary human candidates

Applied Integration LLM

  • koishi : Create cross-platform, scalable, high-performance bots

Data Repository on GitHub

References

http://fleet.sv1.k9s.run:2271/

Guess you like

Origin blog.csdn.net/m0_64768308/article/details/131175016
Recommended