A brief introduction to the deep learning large model framework (a basic introduction to the principles behind ChatGPT)

I. Introduction

There are many mainstream deep learning basic frameworks: tensorflow, pytorch, paddlepaddle, keras, caffee, etc.

With the birth of NLP pre-training language models represented by the Bert and GPT series, the research on NLP language models has moved towards large-scale pre-training.

In the field of CV, with the combination of GAN, Diffusion Model, Transformers and traditional CV technology, it has gradually moved towards the road of large models. The explosion effect of DALL·E2 is also based on "strengthening miracles".

In the multimodal domain, model parameters such as CLIP are also large.

This seems to indicate that only " strengthening miracles " is the future of strong artificial intelligence. The principle behind the extremely popular ChatGPT is that the super-large-scale model GPT-3 is trained using STF+RLHF ( Supervised Fine-Tuning Reinforcement Learning Human Feedback ).

Therefore, based on the basic deep learning framework, it is essential to learn and understand the large model deep learning framework. The research in this area is more of a combination of system direction and deep learning .

2. A brief introduction to the development of the deep learning system

After 2014, deep learning started to catch fire. The reason is that in 2012, the AlexNet model caused a revolution in computer vision image classification tasks. Since then, the field of DL has been very complicated, the models have been very large, and the research on the system direction has emerged in an endless stream.

2.1 Parameter Server (Parameter Server)

Paper Address: Parameter Server: Li Mu
This work was completed by Li Mu in 2014. Its core is to propose a data parallel (Data Parallel) parameter server, with the purpose of enabling large-scale machine learning models to be completed in the industry train.

Its significance lies in: it provides a reference for the large-scale deep learning framework ZeRO in the future.

(Follow-up, I will conduct intensive reading of this paper and post a blog. Interested children's shoes can follow)

2.2 GPipe

Paper address: Gpipe: 2019

This paper proposes a new large-scale deep learning framework. It uses a pipeline parallel approach to train larger models with less video memory. These large models include language models and some models of CV.

Its significance is to propose a pipeline parallel method to perform slice training between different layers of the large model.

Gpipe model pipeline example diagram
It has two key technologies:
(1) micro-batch
cuts the L-layer neural network layer by layer (outside the layer), and the total is K fast. Each block is then put on a GPU for computation. As shown in Figure (b) above. But this method, in terms of time, is no different from the time for single-card training. In order to solve this problem, each small batch of data samples can be divided into micro-batches (micro-batch), and then only one micro-batch is sent to the GPU for training at each moment, as shown in Figure © above. This enables simple pipeline parallelism. The Bubble shape is only related to the number of GPUs. Increasing the number of micro-batches will further increase the memory utilization of the GPU.
(2)
The gradient data calculated in the middle of each layer of re-materialization (active checkpoint) will occupy a large amount of GPU memory.
Simple Neural Network Formula Example
In fact, when the echelon descent algorithm updates the gradient, it calculates the partial derivative gradient for the parameter W. The reason why the partial derivative of y to x is required here is that this value (usually called: activation ) is needed when backpropagating . And if this value has been obtained during forward propagation and saved in video memory, the speed of back propagation will be accelerated. But increased memory usage.

The re-materialization technology refers to that after the L-layer large-scale neural network is divided into several blocks, the relevant forward-propagation gradient data of the network layer in each fast-connected place is saved, while other layers are re-propagated during back-propagation. calculate. ( time for space )

This saves a lot of GPU memory.

This enables larger models to be trained on a small number of GPUs or using less video memory.

Recalculation takes up one-third of the total computation time. (There is no paper discussing the reason yet)

Other work: PipeDream (Microsoft work)

2.3 Megatron-LM (tensor parallel TP)

Paper address: Megatron-LM: 2019
This model proposes a special model parallel (Model Parallel) method, that is, intra-layer model parallelism is also called Tensor Parallel (Tensor Parallel).

The biggest contribution of the framework is: open source + simplicity.

This has led to subsequent deep learning large model open source frameworks being more or less improved and modified on the basis of this framework.
Bert layer normalization modification comparison chart
The Megatron-LM framework is only for language models. Mainly GPT, Bert, T5 and other models.

For Bert, the framework has modified its layer normalization as shown above, so that the large-scale Bert model can be converged.

TP segmentation structure diagram
For tensor parallelism, there are only two segmentation methods, one is the segmentation of MLP, and the other is the segmentation of Self-Attention.

2.4 Zero (Offload Infinity)

Paper address: ZeRO

The framework is built on the basis of Megatron, and its open source framework is DeepSpeed

This is a relatively easy-to-use large-scale pre-trained language model framework. It not only realizes ZeRO, but also includes ZeRO Offload and ZeRO Infinity.

ZeRO three state segmentation renderings
In fact, the idea adopted by ZeRO is basically the same as that of the parameter server, which uses data parallelism for the parameters, optimizer state, gradient, etc. generated during the training process of the ultra-large-scale deep neural network model, thereby reducing redundancy. Train larger models with less memory usage.

ZeRO needs to perform data parallel optimization on three pieces of data:
optimizer states (Optimizer states) os
gradient data (gradients) g
hyperparameter data (parameters) p

Where: Pos=ZeRO1, Pos+g=ZeRO2, Pos+g+p=ZeRO3

Before talking about these three data, we should talk about a key point: mixed precision training (mixed-precision)

2.4.1 mixed-precision

Nvidia's card will be faster in half-precision training, that is, fp16-bit floating point numbers.

Reason: In terms of hardware, each bit must correspond to the calculation logic unit of the hardware, that is, the physical gate to help the operation. When the normal floating-point operation is reduced from 32 bits to half, a large number of physical gate circuit logic can be left computing unit. It means that more physical units capable of parallel computing can be placed on a chip of the same size. Therefore, from the perspective of computing density, fp16 is higher than fp32.

Using half-precision training means that the input and output (activations) of w (parameters) and intermediate result data for each layer of the model are trained using fp16

The calculation process of w*x=y is fp16, but due to the insufficient precision of fp16, it will cause an explosion. That is, a very small number will become 0. This situation will appear when accumulating weights. Weight refers to the continuous addition of gradient things. If the weight is also fp16, it cannot be added for half a day. Therefore, when updating the weight, fp32 is used.

There is also an additional fp32 copy of the weight. When doing gradient updates, it needs to use the accuracy of fp32 for calculation. After the calculation, it will be converted to fp16, and then participate in the forward propagation and back propagation algorithms.

2.4.2 Calculation of the amount of data maintained during training

Assume that the amount of storage occupied by the parameters of a model is Y

Then, the parameter quantity (parameter) of fp16 in the calculation process of forward propagation and backward propagation needs to maintain 2Y (bytes), and the gradient (gradients) of fp16 needs to maintain 2Y (bytes).

The optimizer (ADAM) needs to maintain three fp32 data (for gradient update, the accuracy of fp32 is used for calculation). Copy parameter (parameter): 4Y (bytes), momentum (bytes), variance (bytes)

These are a total of 16Y of data storage. If a GPT2(1.5B) model is trained, the amount of data to be saved will be expanded to 1.5*16B.

(The related content of Offload and Infinity will be added later)

2.5 Pathways 2022

Thesis Address: Pathways: 2022

Large models based on Google's Tensorflow series
Large model training methods under different deep learning architectures
lead to Jeff Dean's predictions for the next generation of deep learning frameworks:

Multimodal, sparse, dynamic routing

2.6 InstructGPT

Paper address: InstructGPT

This model is one of the models behind ChatGPT. The current ChatGPT model paper has not yet come out, and it is expected to take a few months. But referring to the core idea of ​​this article, we can already determine which direction ChatGPT is developing in.

ChatGPT training method

2.6.1 Dataset collection

(1) Hire some staff to compile some datasets of questions and corresponding answers.
(2) Use the first batch of data sets to train the first InstructGPT model, and then make predictions on related issues, and expand a larger data set based on this.

2.6.1 Supervised Fine-Tuning (SFT) prompt

The idea is very simple, which is to use GPT-3 to train on human-labeled question-answer pairs. Of course, the idea of ​​prompt learning is also adopted.

2.6.1 RLHF (Reinforcement Learning Human Feedback)

To put it simply, it is to use reinforcement learning to sort different answers to a certain question. The training goal is to make the sorting predicted by the model consistent with the human sorting answer. Also known as a Reward Model. And that feedback comes from humans.

3. Summary

At present, there are two main technical routes for training large-scale language models: TPU + XLA + TensorFlow/JAX ( Pathways ) and GPU + PyTorch + Megatron-LM + DeepSpeed. The former is dominated by Google. Due to the deep binding between TPU and its own cloud platform GCP, non-Googlers can only watch from a distance but not play with it. Behind the latter is the blessing of NVIDIA, Meta, and MS. The community atmosphere is active and more Welcomed by the masses.

Guess you like

Origin blog.csdn.net/weixin_42529594/article/details/128921397