Basic Calculus of Transformer Model

309cb329a7c2422ceb324b00036761e9.jpeg


作者 | Quentin Anthony、Stella Biderman、Hailey Schoelkopf

OneFlow compilation

Translation | Jia Chuan, Xu Jiayu, Yang Ting

1

introduction

Many basic and important information of the Transformer language model can be obtained through simple calculations. Unfortunately, these calculation formulas are not widely known in the natural language processing (NLP) community. EleutherAI, an AI non-profit research organization, collects and organizes these formulas, and introduces the source and importance of these formulas.

Note: This article focuses on training costs dominated by video memory (VRAM). See the previously published Inference Calculus for Large Language Models for a similar discussion of inference cost and latency .

(This article is compiled and published by OneFlow after being authorized. Please contact OneFlow for authorization to reprint the translation. Original text: https://blog.eleuther.ai/transformer-math/ )

2

computing needs

The training cost of the Transformer model can be calculated by the following basic formula:

in:

  •  C represents the amount of computation required to train the Transformer model, in units of total floating-point operations.

  •   Indicates the total throughput of the hardware setting ( =(No.GPUs)x(Actual FLOPs/GPU)), in FLOPs.

  •  T represents the time required to train the model in seconds.

  •  P represents the number of parameters in the Transformer model.

  •  D represents the size of the data set, in token.


These formulas are provided by OpenAI's paper "Scaling Laws for Neural Language Models" ( https://arxiv.org/abs/2001.08361 ) and DeepMind's paper "Training Compute-Optimal Large Language Models" ( https://arxiv.org/abs /2203.15556 ) proposed and confirmed by experiments.

It is worth mentioning the unit problem of C. C is a measure of the total amount of computation and can be expressed in various units. For example:

  • FLOP-seconds, the unit is  

  • GPU-hours, the unit is [No.GPUs] x [Hours]

  • Scaling laws papers usually use PetaFLOP-days, or total floating-point operations, to report values.

Actual FLOPs is a useful concept worth noting. GPU accelerator white papers often advertise their theoretical FLOPs, but these values ​​are often unachievable in real-world situations (especially in distributed settings). The "Computational Cost" section below indicates commonly reported values ​​for Actual FLOPs in distributed training settings.

Note: We used the throughput-time version of the cost formula described in an article on the cost of LLM training ( https://medium.com/@dzmitrybahdanau/the-flops-calculus -of-language-model-training-3b19c1f025e4 ) is used.

Parameter and Dataset Tradeoffs (Tradeoffs)

Strictly speaking, you can train a Transformer model with any number of tokens. However, the number of tokens trained can have a great impact on the computational cost and the performance of the final model, so it is very important to find an appropriate balance.

Let's start with the key problem "compute optimal" language models. The theory of "Chinchilla scaling laws" on the number of parameters points out that the number of parameters and the size of the data set possessed by a computationally optimal language model need to satisfy an approximate equation: D=20P.

This is optimal in a specific sense: assuming that running 1 hour with 1000 GPUs costs the same as running 1000 hours with 1 GPU, if your goal is to maximize model performance while minimizing GPU training cost, You should then use the formula mentioned above.

We do not recommend training large language models with less than 200B tokens. While this is "Chinchilla optimal" for many models, the resulting model is often very poor. For all applications, we recommend determining an acceptable inference cost for your use case, and then training the largest model possible so that training on as many tokens as possible stays below this inference cost.

Engineering Essentials for Calculating Costs

Transformer models usually express computational costs in terms of GPU-hours or FLOP-seconds.

  • GPT-NeoX achieves 150 TFLOP/s/A100 under Normal Attention, and 180 FLOP/s/A100 under Flash Attention. This is in line with the scale of other highly optimized libraries such as Megatron-DS reporting between 137 and 163 TFLOP/s/A100.

  • As a rule of thumb, we should always be able to achieve 120 FLOP/s/A100 compute performance. If the calculation performance is lower than 115 FLOP/s/A100, then it is likely that there is something wrong with the model or hardware configuration.

  • Using high-quality interconnect technologies (such as InfiniBand), we can achieve linear or sublinear scaling in the data parallel dimension (sublinear scaling, that is, increasing the degree of data parallelism should increase the overall throughput nearly linearly). The figure below shows the results of testing the GPT-NeoX library on Oak Ridge National Lab's Summit supercomputer. Note: V100 is on the x-axis, but most of the numerical examples in the article are for A100.

c119894ee8e843f0547aefb4f3094b1d.png

3

memory requirements

Usually, the size of a Transformer model is based on the number of its parameters. However, when determining whether a model is suitable for a particular computing resource, we need to know how many bytes the model will occupy. This is how to figure out how large a model is suitable for inference on the local GPU, or how large a model can be trained with a certain amount of total accelerator memory.

reasoning

model weight

19b374391a289621b803f68531f951b5.png

Most Transformers are trained with mixed precision, either fp16+fp32 or bf16+fp32. This reduces the amount of memory required to train the model, and the amount of memory required to run inference. We can convert language models from fp32 to fp16 or even int8 without substantial performance loss. These numbers refer to the size in bits required for a single parameter. Since a byte (Byte) has 8 bits, we divide this number by 8 to see how many bytes are required for each parameter

  • you8, 

  • fp16 and bf16,  

  • fp32,   

There is also a small amount of additional overhead here, but generally not relevant for determining the largest model the GPU is compatible with. As a rule of thumb, this overhead is ≤20%.

total inference memory

In addition to the memory required to store the model weights, there is a small amount of additional overhead during the actual forward pass. Empirically, this overhead is ≤20% and is usually irrelevant for determining the largest model the GPU is compatible with.

In general, the following formula can be used to "judge whether a model is suitable for reasoning":

This article will not explore the sources of this overhead, which will be covered in detail in other articles. Next, we will continue to talk about the memory of model training. If you want to learn more about the calculations required for inference, check out the previously published " Inference Calculus for Large Language Models" .

train

In addition to model parameters, training also requires storing optimizer state and gradients in device memory. That's why when asked "how much memory do I need to fit model X", the immediate answer is "it depends on training or inference". Typically, training requires more memory than inference.

Model parameters

First, the model can be trained in pure fp32 or fp16:

  • pure fp32,  

  • pure fp16,  

In addition to the common model weight data types discussed in Inference, training also introduces mixed-precision training, such as AMP. This technique is designed to maximize the throughput of GPU tensor cores while maintaining convergence. Mixed-precision training is often used in modern deep learning training because: 1) fp32 training is stable, but has high memory overhead and cannot take advantage of NVIDIA GPU tensor cores; 2) fp16 training is stable but hard to converge. Note: mixed precision requires the fp16/bf16 and fp32 versions of the model to be stored in memory, thus requiring:

  • mixed precision (fp16/bf16 and fp32),  

Plus a copy of the model computed in the optimizer state, a copy of size (4 bytes/param) • (#params).


optimizer state

Adam is amazing, but it's very memory inefficient. In addition to the requirement to have copies of the model parameters and gradient parameters, three additional copies of the gradient parameters need to be kept. so,

  • For vanilla AdamW,  

    • fp32 parameter copy: 4 bytes/parameter

    • Momentum: 4 bytes/argument

    • Variance: 4 bytes/parameter

  • For 8-bit optimizers like bitsandbytes,

    • fp32 parameter copy: 4 bytes/parameter

    • Momentum: 1 byte/argument

    • Variance: 1 byte/parameter

  • For an SGD-like optimizer with momentum,  

    • fp32 parameter copy: 4 bytes/parameter

    • Momentum: 4 bytes/argument

gradient

Gradients can be stored in fp32 or fp16 (note that the gradient data type usually matches the model data type. Therefore, when training with fp16 mixed precision, the gradients are usually stored in fp16), so their contribution to the memory overhead is:

  • fp32, 

  • fp16,  

Activations and batch size

For the training of large language models, modern GPUs are usually limited by memory rather than floating point operations. Therefore, activation recomputation (also known as checkpointing) is a very popular method that can trade reduced memory cost for additional computational cost.

Activation recomputation works by recalculating the activations of certain layers instead of storing them in GPU memory. The memory reduction depends on the choices we make when deciding which layers to clear, but Megatron's selective recomputation scheme is shown in the following diagram:

6a720c2a1fa100d1ffe5b93279d0eeb6.png

Among them, the red dotted line indicates the memory capacity of the A100-80GB GPU, and "present work" indicates the memory requirement after application selective activation and recalculation. For more information and the derivation of the following formulas, see Reducing Activation Recomputation in Large Transformer Models

https://arxiv.org/abs/2205.05198)。

The basic formula for storing the memory required to store Transformer model activations is as follows:

in:

  •   is the sequence length, in token

  •   is the batch size per GPU

  •   Is the dimension of hidden size in each Transformer layer

  •   is the number of layers in the Transformer model

  •   is the number of attention heads in the Transformer model

  •   is the tensor parallelism used (1 if not)

  • Assuming no sequence parallelism is used

  • Assume activations are stored in fp16

Also, the extra recomputation required also depends on the choice of method, but it is limited to a full extra forward pass. Therefore, the update cost of the forward pass is given by:

total training memory

So, a good heuristic answer to "is the model suitable for training" is as follows:

distributed training

Fragmentation (Sharded) optimizer

The large memory overhead of the optimizer is the main motivation for shard optimizers such as ZeRO and FSDP. This sharding strategy reduces the optimizer overhead by a factor of No.GPUs, which is why a given model configuration may be suitable for large scale but OOM is suitable for small scale.

If you want to calculate the memory overhead required for training with the shard optimizer, you need to include the formula in the figure below. See the diagram below in the ZeRO paper for an example calculation of shard optimization (note that and are usually denoted ZeRO-1, ZeRO-2, ZeRO-3, respectively. ZeRO-0 usually means "ZeRO disabled"):

0592a688b0eb31aa19a04e7247592c39.png

In this blog post (assuming mixed precision and Adam optimizer):

  • ZeRO-1,

  • ZeRO-2,

  • ZeRO-3,

(DP Degree) is just (No.GPUs) unless pipeline parallelism and/or tensor parallelism are applied. See the Sharding Optimizer + 3D Parallelism section for details

https://eleutherai.notion.site/Transformers-Math-101-d2fcfc7a25d446388fde97821ad2412a#9c476d020d7641a299fb6be6ae82e9f8)。

Note that ZeRO-3 introduces a set of real-time parameters, because ZeRO-3 introduces a set of configuration options (stage3_max_live_parameters, stage3_max_reuse_distance, stage3_prefetch_bucket_size, stage3_param_persistence_threshold), which are used to control the amount of parameters in GPU memory at a time (the larger the parameter, the required memory more, but requires less communication). Such parameters can have a significant impact on total GPU memory.

Note that ZeRO can also partition activations on data-parallel ranks via ZeRO-R. This will also enable on top of tensor parallelism. For more details, read the related ZeRO paper ( https://arxiv.org/abs/1910.02054 ) and configuration options ( https://www.deepspeed.ai/docs/config-json/#activation-checkpointing ) ( Note that in GPT-NeoX, this is the partition_activations flag). If you're training huge models and want to trade some memory overhead for additional communication cost, activations can become a bottleneck. An example of using ZeRO-R with ZeRO-1 is as follows:

3D Parallel

LLM has 3 main parallelisms:

Data parallelism: splitting data between copies of the model (possibly model-parallel copies)

Pipeline or Tensor/Model Parallelism: These parallel methods divide the parameters of the model among the GPUs. Such schemes require significant communication overhead, but their memory reduction is approximately:

Note that this equation is an approximation for the following reasons:

(1) Pipeline parallelism will not reduce the memory usage of activation;

(2) Pipeline parallelism requires all GPUs to store micro-batches activations in all transfers, which is very important for large models;

(3) The GPU needs to temporarily store the additional communication buffers required by the parallel scheme.

Shard Optimizer + 3D Parallelism

When ZeRO is combined with tensors and/or pipelines in parallel, the resulting parallel strategy forms a grid as shown below:

35fc9a4683b81b3af060ff325a73f89d.png

An important digression: the degree of data parallelism (DP degree) is crucial for computing the global batch size for training. The degree of data parallelism depends on the number of full model replicas:

While pipeline parallelism and tensor parallelism are compatible with all phases of ZeRO (for example, tensor parallel ZeRO-3 slices tensors and then applies ZeRO-3 in each tensor parallel unit), only ZeRO-1 Tends to perform well in combination with tensor and/or pipelined parallelism, due to conflicting gradient parallelism strategies (pipelined parallelism and ZeRO-2 are partition gradients) (tensor parallelism and ZeRO-3 are partitioned model parameters), thus Causes a lot of overhead for communication.

Putting everything together, for a typical 3D parallel ZeRO-1 run with activation partitions:

4

Summarize

Engineers at EleutherAI often use the above heuristics to plan efficient model training and debug distributed runs. We wish to clarify these often overlooked implementation details.

everyone else is watching

Try OneFlow: github.com/Oneflow-Inc/oneflow/

7f97abe52558079406db871bedea0879.png

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/130652895