A Comprehensive Overview of Large Language Models | A Comprehensive Overview of Large Language Models

A Comprehensive Overview of Large Language Models

Return to thesis and information directory

Paper address

1.Introduction

Compared with theLarge Model Review published by Renmin University of China in April this year, this review focuses more on the implementation of large models. More hard-core, more suitable for in-depth understanding of some details of large models.

2. Introduction

The figure below shows the trend of large open source or closed source models in recent years. It can be seen that except for the decline in closed-source large model work in 2023, the large models in these years have steadily increased regardless of open source, closed source, or total volume. However, this paper does not take into account many large model works, such as the Baichuan large model, ChatGLM3, Puyu large model, etc. Since this year, the real scene is a battle between hundreds of models!
Insert image description here
The figure below gives the author a timeline of representative large models seen in recent years.

Insert image description here

The figure below is a comprehensive review structure diagram of the large model, including 1. Training 2. Inference 3. Evaluation 4. Application 5. Challenge.
Insert image description here

3. Relevant basics

1.Tokenization (word element encoding)

What Tokenization does is convert normal text into a list of IDs input to the large model, which is a necessary preprocessing step. Readers can refer tothis blog for learning.

2. Attentions (attention mechanism)

Self-Attention: The attention mechanism of the original Transformer.

Cross Attention: The input of Cross-attention comes from different sequences, and the input of Self-attention comes from the same sequence. For example, you can say that the information is Q after encoding the picture, and K and V are obtained after encoding the text. Then calculate the result in the same way as Self-attention.

Full Attention: The attention mechanism of the original Transformer is the same as Self-attention.

Sparse Attention: Originally, a score matrix of full-attentions will be obtained from self-attention, which represents the relationship between each word and other words. In Sparse Attention, the scores between some words will be set to 0. In this way, the context length of the model can be expanded. For example, it can be processed in the following way.
Insert image description here
Among them, Insert image description here
Flash Attention: The principle of the attention mechanism is the same as Self-Attention, there is no change. What has changed is the way Attention is calculated in the GPU, which can reduce the amount of data accessed in the memory. The calculation results are the same.

3. Encoding Positions

After tokenization, the model adds positional coding to the input text. This step is necessary (although some recent research says it is not important). There are two ways of thinking:

  1. Absolute: This is the most straightforward way to add sequence order information, by assigning a unique identifier to each position of the sequence before passing it to the attention module. For exampleAlibi
  2. Relative encoding: In order to convey the relative dependence information of different markers appearing at different positions in the sequence, the relative position encoding is calculated through some kind of learning. Two well-known relative encoding types areRoPE

4.Activation Functions (activation function)

Common activation functions are as follows:

  1. ReLU:ReLU(x) = max(0, x)
  2. GeLU: It is a combination of ReLU, Dropout and Zoneout, and is the most widely used in LLM.
  3. GLU:LLM会使用GLU(x, W, V, b, c) = (xW + b) ⊗ σ(xV + c) 的变体,包括 1. ReGLU(x, W, V, b, c) = max(0, xW + b)⊗, 2. GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c) , 3. SwiGLU(x, W, V, b, c, β) = Swishβ(xW + b) ⊗ (xV + c)

5.Layer Normalization (regularization)

  1. LayerNorm: Dimensions different from BatchNorm. where n n nhere l l lThe original quantity of the tier, a i l a^l_i ailhere l l llayer middle i i The sum of the inputs of i neurons.
    Insert image description here

  2. RMSNorm: Based on changes from LayerNorm, it is proposed that the same performance benefits as LayerNorm can be obtained by using a computationally efficient and fast exchange recenter invariant standardization technology.
    LayerNorm gives the l l The normalized summation input of the l layer is as follows
    Insert image description here
    where g i l g^l_i < /span>gil is the gain parameter. RMSNorm will a l i l_i lichange into
    Insert image description here

  3. Pre-Norm and Post-Norm: Note that these two are not a Normalization technology, but refer to whether Normalization is performed before or after residual connection. It is usually called Post-Norm originally proposed in Transformer: x t + 1 = N o r m ( x t + F t ( x t ) ) x_{t+1}=Norm(x_t+F_t(x_t)) xt+1=Norm(xt+Ft(xt)), the sequence is as shown in the figure below.
    Insert image description here
    Recently it was found that the order becomes Pre-Norm: x t + 1 = x t + F t ( N o r m ( x t ) ) x_{t+1}=x_t+F_t(Norm(x_t)) xt+1=xt+Ft(Norm(xt)), the training can be more stable. The sequence is as shown in the figure below. It should be noted that res here connects the output before LN and the output of attention.
    Insert image description here

  4. DeepNorm: Solve the problem of early layers having larger gradients than the bottom.

6. Distributed LLM Training (LLM distributed training)

Those who have trained large models on multiple machines and multiple cards should understand that distributed parallel training is a science!
Suppose we have 8 GPUs, and a batch contains 16 pieces of data. The model mainly has an 8-layer decoder structure, and the hidden layer dimension of the decoder is 512.

  1. Data Parallelism
    In data parallelism, we divide the entire training data into multiple small batches, and each GPU is responsible for calculating the gradient of a part of the data. Specifically, for 8 GPUs and a batch size of 16, each GPU will process 2 pieces of data. After the calculation is complete, the gradients will be summarized and the model parameters will be updated. In this way, the entire batch calculation process is distributed to various GPUs, speeding up the training process.

  2. Tensor Parallelism
    Tensor parallelism is a strategy that divides the weights of the model into different GPUs for calculation. In our example, the weights in the 8-layer decoder structure can be split onto different GPUs. Each GPU is responsible for calculating the gradient corresponding to a part of the weights. Such partitioning allows large models to be run on smaller GPUs.

  3. Pipeline Parallelism
    Pipeline parallelism allocates different parts of the model to different GPUs, and each GPU is responsible for processing a part of the entire model. In our case, each GPU computes the result of one decoder layer and then passes it to the next GPU. Such pipeline processing can reduce the size of the model on each GPU, allowing larger models to fit into limited GPU memory.

  4. Model Parallelism
    Model parallelism is a combination of tensor and pipeline parallelism. In this strategy, the model is decomposed into multiple parts, and each part is assigned to a different device for computation. This strategy is often used for larger models where one GPU cannot accommodate the entire model.

  5. 3D Parallelism
    3D parallelism is a comprehensive parallel strategy that combines data parallelism, model parallelism, and time parallelism (usually used to process sequence data). In our case, parallelization in the temporal dimension (e.g., different time steps of the sequence) can be considered to further increase the training speed.

  6. Optimizer Parallelism
    Optimizer parallelism, also known as zero-redundancy optimizer, realizes the division of optimizer state, gradient and parameters between devices. To reduce memory consumption and reduce communication costs as much as possible. This strategy is especially useful when working with large models.

7. Commonly used libraries

train

  1. Transformers

Transformers is a powerful natural language processing (NLP) library developed by Hugging Face. It provides a variety of pre-trained models covering multiple tasks from text generation to sentiment analysis, providing rich resources for the NLP community.

  1. DeepSpeed

DeepSpeed is a deep learning training library developed by Microsoft Research, aiming to improve the training speed and efficiency of large-scale models. Its features include mixed precision training, model parallelization, and data parallelization.

  1. Megatron-LM

Megatron-LM is a large-scale deep learning library developed by NVIDIA Research, focusing on the training of large language models. It supports model parallelism and data parallelism and is optimized for multi-GPU systems.

  1. JAX

JAX is a numerical calculation library launched by Google Research. It can automatically derive derivatives and perform high-performance GPU/TPU acceleration. JAX is characterized by its concise API and support for functional programming.

  1. Colossal-AI

Colossal-AI is a deep learning training library for large-scale models. It supports distributed training, model parallelism and data parallelism, and is designed to solve the performance bottleneck when training large models.

  1. BMTrain

BMTrain is a deep learning training library for medical image segmentation tasks. It provides a set of tools specifically designed to cope with the data complexity and task specificity in the medical field.

  1. FastMoE

FastMoE is a deep learning library launched by the research team of the University of Munich, focusing on fast deep model training. It uses the Mixture-of-Experts (MoE) structure to improve training speed.

frame

  1. MindSpore

MindSpore is a deep learning framework developed by Huawei. It supports data parallelism and model parallelism, and provides easy-to-use Python API and graph mode training.

  1. PyTorch

PyTorch is a deep learning framework developed by Facebook and is known for its dynamic computation graph and intuitive API. PyTorch is widely used in academia and industry and supports both dynamic and static graphs.

  1. TensorFlow

TensorFlow is a deep learning framework developed by Google. It supports static images and dynamic images and is widely used in deep learning research and practical applications.

  1. MXNet

MXNet is an open source deep learning framework that has the advantages of dynamic graphs and static graphs. MXNet supports multiple programming languages ​​and performs well when training large models.

8.Data PreProcessing (data preprocessing)

  1. Quality filtering: Method ① Based on classification, train a model to judge the quality; ② Based on heuristics, manually determine some rules for filtering, such as language, indicators, statistics and keywords.
  2. Deduplication
  3. less privacy

9. Architectures

  1. Encoder Decoder:transformer
  2. Causal Decoder:decoder-only
  3. Prefix Decoder: encoder first, then decoder

10. Model fine-tuning

The model fine-tuning framework is shown in the figure below:
Insert image description here

Alignment fine-tuning

In the generation process of large-scale language models (LLMs), there is the problem of generating erroneous, biased and harmful text. To make these models more beneficial, realistic, and harmless, researchers use human feedback for model alignment. Alignment involves letting LLMs generate unexpected responses and then avoiding these responses by updating their parameters, ensuring that the text generated by the model is consistent with human intentions and values.

Criteria for Aligned Models: HHH - Helpful, Honest, Harmless: A model defined as "aligned" must meet three criteria, namely Helpful, Honest and Harmless, or so called "HHH" standard. This ensures that LLMs operate in line with human intentions and values.

Reinforcement Learning with Human Feedback (RLHF) for Alignment: Researchers use reinforcement learning with human feedback (RLHF) for model alignment. In RLHF, the model fine-tuned on the demonstration is further trained through reward modeling (RM) and reinforcement learning (RL). Below we briefly discuss the RM and RL processes in RLHF.

Reward Modeling (RM): Reward Modeling trains a model to rank generated responses based on human preferences using a classification objective. To train the classifier, humans annotate the responses generated by the LLMs according to the HHH criteria.

Reinforcement Learning (RL): Combined with the reward model, RL is used for alignment in the next stage. A previously trained reward model classifies the responses generated by LLMs into preferred and non-preferred, and then uses proximal policy optimization (PPO) to align the model to this. This process is repeated iteratively until convergence.

With RLHF, researchers can effectively align LLMs to ensure that the text they generate is more consistent with human expectations while remaining helpful, authentic, and harmless. This alignment process is critical to ensuring that potential problems do not arise in practical applications of large language models.

Efficient parameter fine-tuning method

When training large language models (LLMs), huge memory and computing resources are required. In order to train while using fewer resources, researchers have proposed various parameter-efficient fine-tuning techniques to achieve fine-tuning by updating a small number of parameters, either by adding new parameters to the model or updating existing parameters. Here are some commonly used methods:

Prompt Tuning
Prompt Tuning is a technology that introduces trainable prompt token embeddings. Only fine-tune these embedding parameters by adding the hint token embedding as a prefix or free style to the input token embedding, while keeping the rest of the weights frozen. During fine-tuning of downstream tasks, only these embedding parameters are trained, and the remaining weights remain unchanged. This approach helps fine-tune language models more efficiently while using limited resources.

Prefix Tuning
Prefix Tuning is another parameter-efficient fine-tuning method that introduces task-specific trainable prefix vectors into the Transformer layer. In this approach, only the prefix parameters are fine-tuned while the rest of the model remains frozen. The input sequence token can follow these prefixes and act as a virtual token. In this way, only the prefix parameters need to be trained during fine-tuning, resulting in more efficient use of resources.

Adapter Tuning
Adapter Tuning introduces an encoder-decoder structure that is placed after the attention and feedforward layers in the Transformer block, or in parallel attention and feedforward layer. In this approach, only these layers are fine-tuned while the rest of the model is kept frozen. By keeping most of the model parameters frozen,

Efficient fine-tuning methods for these parameters become particularly important under resource constraints. By fine-tuning specific parts of the model, researchers are able to maximize performance without sacrificing resource efficiency.

3. Large model

1. Common pre-training models

Insert image description here
Insert image description here

2. Fine-tuning of large models

  1. Manually crafted dataset fine-tuning
  2. LLM generated dataset fine-tuning
  3. Aligning human preferences: RLHF, RLAIF (RL from AI feedback)
  4. Continuously pretrain

3. Add context window

  1. Insert position code
  2. Use efficient attention mechanisms
  3. Not available for reading: Reference LM-Infinite Sum PCW.

4. Robot

employmentcalculation/examination, operation/behaviorJapaneseGuidance/Route.

5.Multimodality

MLLM can refer to another multimodal review.

6. Tool-enhanced LLM

  1. Search enhancement: Based on database and other tools, enhance the capabilities of LLM. Because this is more important, I listed it separately as a small point.

Insert image description here
2. Tool enhancement: Enhancement with the help of external tools. This part is a lot of work.
Insert image description here

4.Model configuration

Insert image description here
Insert image description here

Insert image description here
Insert image description here
Insert image description here

5. Datasets and Evaluation

Insert image description here
Insert image description here

Insert image description here
Insert image description here
Insert image description here

6. Summary

Insert image description here

Guess you like

Origin blog.csdn.net/a1920993165/article/details/134396598