Popular understanding of Megatron-DeepSpeed: the technology behind the 100 billion parameter model BLOOM

foreword

This article can be regarded as the technology behind "100 Billion Parameter Open Source Large Model BLOOM , this is the original English text " and study notes of related papers, but some details and errors have been corrected, and a large number of explanations have been added to make it easier to read Clear and easy to understand

The first part of BLOOM and the Megatron-DeepSpeed ​​behind it

1.1 BLOOM training details: hardware/Checkpoints/dataset

The model architecture of BLOOM   is very similar to GPT3, but some improvements have been added. The training of the 176B BLOOM model will take about 3.5 months to complete (about 1 million calculation hours) from March to July 2022. The following is its training some details

training hardware

  • GPU: 384 NVIDIA A100 80GB GPUs (48 nodes) + 32 spare GPUs
  • 8 GPUs per node, 4 NVLink inter-card interconnects, 4 OmniPath links
  • CPU: AMD EPYC 7543 32-core processor
  • CPU memory: 512GB per node
  • GPU memory: 640GB per node
  • Connection between nodes: Omni-Path Architecture (OPA) network card is used, and the network topology is a non-blocking fat tree
  • NCCL - Communications Network: a fully dedicated subnetwork
  • Disk IO network: GPFS shared with other nodes and users

Checkpoints

  • main checkpoints
  • Each checkpoint contains an optimizer state with a precision of fp32 and a weight with a precision of bf16+fp32, and occupies a storage space of 2.3TB. If only the weight of bf16 is saved, only 329GB of storage space will be occupied.

data set

1.2 Megatron-DeepSpeed:

 The 176B BLOOM model is trained using  Megatron-DeepSpeed , which combines two main techniques:

The DeepSpeed ​​team combined the first item below with the last three

  • Tensor Parallelism in Megatron-LM (Tensor Parallelism, which can be understood as a type of model parallelism),
    each tensor is divided into multiple blocks, so each slice of the tensor is located on its designated GPU, rather than Let the entire tensor reside on a single GPU. During processing, each shard is processed separately and in parallel on a different GPU, and the results are synchronized at the end of the step. This is called horizontal parallelism, because the horizontal split
  • The Zero Redundancy Optimizer (ZeRO for short, which is the core of the Microsoft DeepSpeed ​​library)
    also performs tensor sharding similar to TP, but the entire tensor will be reconstructed in time for forward or reverse calculations, so there is no The model needs to be modified. It also supports various offloading techniques to compensate for limited GPU memory
  • Data Parallelism (Data Parallelism)
    The same settings and models are replicated in multiple copies, and each copy is fed a different copy of data each time. Processing is done in parallel, with all shares synchronized at the end of each training step
  • Pipeline Parallel (also known as Pipeline Parallelism)
    models are split vertically (ie, by layer) across multiple GPUs so that only one or more model layers are placed on a single GPU. Each GPU processes different stages of the pipeline in parallel and processes a portion of the batch

Developed a 3D parallel implementation, which is Megatron-Deepspeed, which makes the distributed training of large-scale language models with more than 100 billion parameters, such as BLOOM, easier, more efficient and effective

Note that the BLOOM team's BigScience version of Megatron-DeepSpeed  ​​is based on the original  Megatron-DeepSpeed  ​​code base, but adds quite a few codes on top of it

The following table lists which components of each of the two frameworks we used to train BLOOM

components DeepSpeed Megatron-LM
ZeRO Data Parallel yes
Tensor Parallel yes
Pipeline Parallel yes
BF16 optimizer yes
CUDA fusion kernel function yes
DataLoader yes

Note that both Megatron-LM and DeepSpeed ​​have pipelined parallelism and BF16 optimizer implementations, but we use DeepSpeed's implementation because they are integrated into ZeRO


The second part is Tensor Parallelism (Tensor Parallelism, a type of model parallelism)

In Tensor Parallelism (TP), each GPU processes only a portion of a tensor, and aggregation operations are triggered only when certain operators require the full tensor.

As we all know, there are two main modules of Transformer: 一个自注意力层 + 残差连接,a fully connected layer MLP + residual connection

In 2019, NVIDIA passed  the Megatron-LM  paper ( Efficient Large-Scale Language Model Training on GPU Clusters ), and its dot product part can be written as Y = GeLU(XA), where  X and  Y are input and output vectors, and  A are weight matrices.

It is easy to see how matrix multiplication can be split across multiple GPUs if represented in matrix form, as shown in the following diagram (denoted as Figure 1 ):

2.1 Parallelization of MLP: the weight A matrix is ​​cut vertically and the B matrix is ​​cut horizontally and finally MERGE

Don't underestimate the schematic diagram above. In fact, there are many details in it, which are worth pondering over and over again. The details are shown in the following figure (denoted as Figure 2 )

  1. For the input X, its number of rows is the batch bsize by the sequence length l, and the number of columns is the width of the hidden layer, that is, k
    the module of its hidden layer is actually two fully connected layers.
  2. Assuming that the weight of the first hidden layer is A( the number of rows is K, the number of columns is K', K' is generally 4 times K ), then do matrix multiplication first X\cdot A, and then connect an activation function \sigmasuch as GELU ( GELU is similar to The inflection point of ReLU goes down smoothly )
  3. Assuming that the weight of the second hidden layer is B( number of rows K', number of columns K ), the final\sigma (X\cdot A)B = Y
  4. Next, let’s see how to do splitting. It’s better to use multiple GPUs for parallelism.
    If the input data is relatively large, then choose to do data parallelism first, that is, split the input. XIf
    the model itself is relatively large, then choose to do it first. The model is parallel, that is, the matrix Ais ​​split, and the split method is divided into two
    \rightarrow  types. The first ( corresponding to the lower part of Figure 1 above ): the matrix Ais ​​split horizontally by row ( Xthen the corresponding column is split vertically, resulting in The result is that communication between the two GPUs is required )
    \rightarrow  The second type ( corresponding to the upper part of Figure 1 above ): the matrix Ais ​​disassembled vertically by column, as shown in Figure 2 above, one column is blue and the other is green ( then XCorrespondingly split by row and horizontally, or Xthere must be a copy on both GPUs, and no additional communication is required at this time )
  5. After confirming the second split method ( that is, the matrix Ais ​​split vertically by column ), multiply Xto get a large matrix X\cdot A, and then Bcut the matrix by row ( as shown in Figure 2 above, the second matrix is ​​matrix B, The blue one is placed on GPU 0, and the green one is placed on GPU 1 ), and the final matrix SHAHcolumn and matrix row Bdo matrix multiplication , and the obtained size Yis consistent with the size of , but the result is only 1/N( Nreferring to the number of GPUs )
    in other words , by performing matrix multiplication to get first XA_1 , XA_nand then get  N an output vector  Y_1 Y_2 \cdots Y_n , they can be independently input into GeLU and finally merge the results of step 5 above to get a complete one. Through the above operations, we can update the MLP of any depth, just in each Synchronize the GPU after the sequence. The author of the Megatron-LM paper provided a nice illustration for this (denoted as Figure 3):\left[Y_{1}, Y_{2}\right]=\left[\operatorname{GeLU}\left(X A_{1}\right), \operatorname{GeLU}\left(X A_{2}\right)\right]
    Y

    拆列-拆行

    Here  f is the identity operator in the forward pass and all reduce in the backward pass, g but all reduce in the forward pass and the identity in the backward pass.

2.2 Parallelization of multi-head attention layer: each head calculates separately

Parallelizing multi-head attention layers is even simpler since they are inherently parallel due to multiple independent heads! As shown in the figure below (marked as Figure 4)

  1. For the input Xmatrix, the number of rows is still the batch bsize by the sequence length l(assuming the batch size is 1), and the number of columns is k, in the self-attention mechanism ( if you forget what self-attention mechanism is, please read this article for details The third part of Transformer Notes ), the input Xwill be copied into three copies, corresponding to: Xvector Q K Vmatrix (similar to three clones)
  2. As for multi-head attention, the dimension of the head is k/h, assuming , after that, for each word vector h=2in the input matrix of each head , it will do a scaling dot product with the vector of the respective context, then do softmax to get an attention score or weight, and then do it with Weighted sum to get an output and finally multiplied by a projection to get a resultXQKVl \times k/h
    k\times kl \times k
  3. The calculation process of the second head is similar
    . You will find that the calculation of each head is independent and parallel without affecting each other, which means that one head can be placed on GPU 0 ( indicated in blue indicated in green

    The whole process is shown in the figure below (denoted as Figure 5)

Special consideration needs to be given to:

  1. Since there are two all-reduces per layer in forward and backward propagation, TP requires a very fast interconnection between devices. Therefore, unless you have a very fast network, it is not recommended to do TP across multiple nodes. In our hardware configuration for training BLOOM, the speed between nodes is much slower than PCIe. In fact, if the node has 4 GPUs, a maximum TP degree of 4 is better. If you need a TP degree of 8, you need to use a node with at least 8 GPUs
  2. This component is implemented by Megatron-LM. Megatron-LM has recently expanded the tensor parallel capability and added a sequence parallel capability for operators that are difficult to use the aforementioned segmentation algorithm. For example, the LayerNorm Reducing Activation Recomputation in Large Transformer Models  paper provides detailed information on this technology. Sequence parallelism was developed after training BLOOM, so BLOOM training does not use this technique

2.3 Parallelization for input and output

Next, let's look at the parallelization for input and output, as shown in the figure below (denoted as Figure 6)

  1. For the input, Xit is a bmatrix lby the batch size and the length of the sequence, which stores sentences line by line, and the embedding layer is a vocabulary with rows of vocab (equivalent to a vocabulary) and columns of K. The entire dictionary can be
    horizontal Cut (for example, put the upper part on GPU 0 with a blue mark, and put the lower half on GPU 1 with a green mark), and get an output by looking up the b\times l \times ktable
  2. For the output, the number of rows is and b\times lthe number of columns is K. After passing through the vocabulary, an b\times l \times voutput is obtained, the left half of which can be placed on GPU 0, and the right half can be placed on GPU 1.
    Each b\times l \times vline of the output can be horizontal Add up, but V may be relatively large, such as the size of tens of thousands. Of course, each GPU can count its own part

Part III Data Parallel and ZeRO

Most users with only a few GPUs are probably familiar with  DistributedDataParallel(DDP), which is the corresponding  PyTorch documentation . In this approach, the models are fully replicated to each GPU, and then all models synchronize their states with each other after each iteration. This method can speed up training and solve problems by investing more GPU resources. But it has the limitation that it only works if the model fits on a single GPU.

3.1 ZeRO data parallelism

3.1.1 ZeRO 1

In 2020, the Microsoft DeepSpeed ​​team proposed the Zero Redundancy Optimizer (ZeRO for short) through the paper " ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ", but optimization is an eternal topic, so the DeepSpeed ​​team has published three ZeRO-related papers in the past few years (The method proposed in the latest article is referred to as ZeRO 3 ), which proposes methods such as removing redundant parameters, introducing CPU and memory, and introducing NVMe, etc., with one goal from beginning to end: to carry out video memory optimization to the end

Figure 7 below is a good description of ZeRO data parallelism (from this  blog post )

It seems to be relatively tall, which may make it difficult for you to concentrate on understanding, but in fact, the concept is very simple. This is just the usual DDP, except that instead of each GPU replicating the full model parameters, gradients, and optimizer state, each GPU stores only a portion of it. During subsequent runs, when the full layer parameters for a given layer are required, all GPUs synchronize to provide each other with their missing pieces - nothing more.

3.1.2 ZeRO 2

// to be updated

3.1.3 ZeRO 3

to be updated..


The following is to be changed , 8.25 at 4pm..

The fourth part of the pipeline is parallel

Naive pipeline parallelism (naive PP) is to distribute model layers in groups across multiple GPUs and simply move data from GPU to GPU as if it were one large composite GPU. The mechanism is relatively simple - you bind the desired layer  .to() method to the corresponding device, and now whenever data enters or exits these layers, the layers will switch the data to the same device as the layer, and the rest remains the same.

This is actually vertical model parallelism, because if you remember how we draw the topology of most models, we actually split the layers of the model vertically. For example, if the image below shows an 8-layer model:

===================  ===================
|  0 | 1 | 2 | 3  |  |  4 | 5 | 6 | 7  |
===================  ===================
        GPU0                 GPU1

We cut it vertically into 2 parts, placing layers 0-3 on GPU0 and layers 4-7 on GPU1.

Now, when data is passed from layer 0 to layer 1, layer 1 to layer 2, and layer 2 to layer 3, it's just like normal forward pass on a single GPU. But when data needs to pass from layer 3 to layer 4, it needs to be transferred from GPU0 to GPU1, which introduces communication overhead. If the participating GPUs are on the same compute node (eg, the same physical machine), the transfer is very fast, but if the GPUs are on different compute nodes (eg, multiple machines), the communication overhead can be much larger.

Then layers 4 to 5 to 6 to 7 are like normal models again, and when layer 7 is done we usually need to send data back to layer 0 where the labels are (or send the labels to the last layer). Now the loss can be calculated and the optimizer can be used to update the parameters.

question:

  • Why is this method called naive pipeline parallelism, and what are its defects? Mainly because the scheme has all but one GPU idle at any given moment. So if you use 4 GPUs, you're almost quadrupling the amount of memory on a single GPU, and other resources (like compute) are pretty much useless. Add in the overhead of copying data between devices. So 4 6GB cards in parallel using naive pipeline will be able to hold the same size model as 1 24GB card which trains faster because it has no data transfer overhead. But, for example, if you have a 40GB card, but need to run a 45GB model, you can use 4x 40GB cards (which is just enough, because there are also gradients and optimizer states that require video memory).

  • Sharing embeddings may require copying back and forth between GPUs. The pipelined parallelism (PP) we use is almost the same as the naive PP above, but it solves the GPU idling problem by chunking incoming batches into micros batches and artificially creating pipelines that allow different GPUs to participate in the computation process simultaneously.

The figure below is from  the GPipe paper , the upper part represents the naive PP scheme, and the lower part is the PP method:

mp-pp

From the bottom half of the figure it is easy to see that PP has less dead zone (meaning the GPU is idle), ie less "bubbles".

The degree of parallelism of the two schemes in the figure is 4, that is, the pipeline is composed of 4 GPUs. So there are four forward paths of F0, F1, F2, and F3, and then the reverse path of B3, B2, B1, and B0.

PP introduces a new hyperparameter to tune, called  块 (chunks). It defines how many blocks of data are sent sequentially through the same pipe level. For example, in the bottom half of the figure, you can see  chunks = 4. GPU0 executes the same forward path on chunks 0, 1, 2, and 3 (F0,0, F0,1, F0,2, F0,3) and then waits until the other GPUs finish their work before GPU0 starts working again , execute the backward path for blocks 3, 2, 1, and 0 (B0,3, B0,2, B0,1, B0,0).

Note that this is conceptually the same as gradient accumulation steps (GAS). PyTorch calls it  , and DeepSpeed ​​calls it  GAS.

Because  , PP introduces the concept of micro-batches (MBS). DP splits the global batch size into small batch sizes, so if the DP degree is 4, the global batch size 1024 will be split into 4 small batch sizes, and each small batch size is 256 (1024/4). And if   the number (or GAS) is 32, we end up with a micro batch size of 8 (256/32). Each tube stage processes one micro batch at a time.

The formula to calculate the global batch size for the DP + PP setting is:  mbs*chunks*dp_degree ( 8*32*4=1024).

Let's go back and look at the picture again.

Using  chunks=1 what you end up with is naive PP, which is very inefficient. And with very large   numbers, you end up with small micro-batch sizes, which are probably not very efficient either. Therefore, one must experiment to find   the number that makes the most efficient use of the GPU.

The graph shows that there are bubbles of "dead" time that cannot be parallelized because the last  forward stage has to wait for  backward the pipeline to finish. Then, the problem of finding the optimal   number so that all participating GPUs can achieve high concurrent utilization is actually transformed into minimizing the number of bubbles.

This scheduling mechanism is called  全前全后. Some other options are  tandem  and  staggered tandem .

While both Megatron-LM and DeepSpeed ​​have their own implementations of the PP protocol, Megatron-DeepSpeed ​​uses the DeepSpeed ​​implementation because it is integrated with other features of DeepSpeed.

Another important issue here is the size of the word embedding matrix. While generally word embedding matrices require less memory than transformer blocks, in the case of BLOOM with a 250k vocabulary, the embedding layer requires 7.2GB for bf16 weights, compared to only 4.9GB for the transformer block. Therefore, we had to make Megatron-Deepspeed treat the embedding layer as a transformer block. So we have a pipeline of 72 stages, 2 of which are dedicated to embedding (first and last). This allows us to balance the memory consumption of the GPU. If we didn't do this, we would have the first and last stages consume a lot of GPU memory, and 95% of the GPU memory usage would be very little, so the training would be very inefficient.

DP+PP

There is a diagram in the DeepSpeed ​​Pipeline  Parallel Tutorial  that demonstrates how to combine DP and PP, as shown below.

dp-pp-2d

The important thing to understand here is that DP rank 0 cannot see GPU2, and DP rank 1 cannot see GPU3. For DP, there are only GPUs 0 and 1, and data is fed to them. GPU0 uses PP to "secretly" offload some of its load to GPU2. Likewise, GPU1 will also get help from GPU3.

Since at least 2 GPUs are required for each dimension, at least 4 GPUs are required here.

DP+PP+TP

For more efficient training, PP, TP, and DP can be combined, called 3D parallelism, as shown in the figure below.

dp-pp-tp-3d

This figure is from the blog post " 3D Parallelism: Scaling to Trillion Parameter Models "), which is also a good article.

Since you need at least 2 GPUs per dimension, here you need at least 8 GPUs for full 3D parallelism.

ZeRO DP+PP+TP

One of the main features of DeepSpeed ​​is ZeRO, which is a super-scalable enhanced version of DP, which we   have discussed in the section ZeRO Data Parallelism . Usually it is an independent function and does not require PP or TP. But it can also be combined with PP, TP.

When ZeRO-DP is combined with PP (and thus TP), it typically only enables ZeRO phase 1, which only shards the optimizer state. ZeRO stage 2 also shards gradients, and stage 3 also shards model weights.

While it is theoretically possible to use ZeRO stage 2 with pipeline parallelism, it can have a bad impact on performance. Each micro batch requires an additional reduce-scatter communication to aggregate gradients before sharding, which adds potentially significant communication overhead. According to the parallel nature of the pipeline, we will use small micro batches, and focus on the trade-off between arithmetic intensity (micro batch size) and minimizing pipeline bubbles (number of micro batches). Therefore, the increased communication overhead hurts pipeline parallelism.

Also, due to PP, the number of layers is already less than normal, so it doesn't save much memory. PP has reduced the gradient size  1/PP, so the gradient slice on this basis does not save much memory compared to pure DP.

ZeRO stage 3 can also be used to train models of this size, however, it requires more communication than DeepSpeed ​​3D in parallel. A year ago, after careful evaluation of our environment, we found that Megatron-DeepSpeed ​​3D parallelism performed best. The performance of ZeRO Phase 3 has improved significantly since then, and if we were to re-evaluate it today, maybe we would choose Phase 3.

BF16Optimizer

Training huge LLM models with FP16 is a no-no.

 We've demonstrated this for ourselves  by spending months  training the 104B model , which, as you can see from Tensorboard  , was a total failure. In the process of fighting against the ever-diverging lm-loss, we learned a lot:

104B-fail

We also got the same suggestion from the Megatron-LM and DeepSpeed ​​teams after they trained  the 530B model  . The recently released  OPT-175B  also reported that they trained very hard on FP16.

So back in January we knew we were going to train on the A100 which supports the BF16 format. Olatunji Ruwase developed one for training BLOOM  BF16Optimizer.

If you're not familiar with this data format, take a look at its  bit layout . The key to the BF16 format is that it has the same number of exponents as FP32, so it won't overflow, but FP16 often overflows! FP16 has a maximum value range of 64k, you can only multiply smaller numbers. For example you can do  250*250=62500, but if you try  255*255=65025, you will overflow, which is the main cause of problems with training. This means your weights must be kept small. A technique called loss scaling helps alleviate this problem, but FP16's small range can still be an issue when models get very large.

The BF16 doesn't have this problem, you can do it easily  10_000*10_000=100_000_000, no problem at all.

Of course, since BF16 and FP16 are the same size, 2 bytes, there is no free lunch, and the tradeoff when using BF16 is that it has very poor precision. However, you should remember that the stochastic gradient descent method and its variants we used in training, this method is a bit like staggering, if you don't find the perfect direction at this step, it's okay, you will correct it in the next step Own.

Whether using BF16 or FP16, there is a copy of the weights that is always in FP32 - this is what is updated by the optimizer. So the 16-bit format is only used for calculations, the optimizer updates the FP32 weights with full precision, and then converts them to 16-bit format for the next iteration.

All PyTorch components have been updated to ensure that they perform any accumulation in FP32, so no loss of precision occurs.

A key issue is gradient accumulation, which is one of the main features of pipeline parallelism, since the gradients processed by each micro-batch are accumulated. Implementing gradient accumulation in FP32 for training accuracy is critical, and this is exactly  BF16Optimizer what was done.

Among other improvements, we believe that using BF16 mixed-precision training turned a potential nightmare into a relatively smooth process, as can be seen in the following lm loss plot:

176B - Loss

CUDA fusion kernel function

The GPU mainly does two things. It can write data to and read data from video memory and perform computations on that data. When the GPU is busy reading and writing data, the computing units of the GPU are idle. If we want to utilize the GPU efficiently, we want to keep idle time to a minimum.

A kernel function is a set of instructions that implement a specific PyTorch operation. For example, when you call it  torch.add , it goes through a  PyTorch scheduler , which decides what code it should run based on the values ​​of the input tensors and other variables, and finally runs it. CUDA kernels use CUDA to implement these codes and therefore only run on NVIDIA GPUs.

Now, when computing with the GPU  c = torch.add (a, b); e = torch.max ([c,d]) , typically what PyTorch will do is launch two separate kernels, one that does the  addition a of  b the sum  c and  the other that takes d the maximum of the two. In this case, the GPU fetches the sum from its video memory  a ,  bperforms the addition, and then writes the result back to video memory. It then takes the sum  c and  d does a max operation, then writes the result back to video memory again.

If we were to fuse these two operations, i.e. put them into a "fused kernel function", and then launch that kernel, instead of writing the intermediate result  c to video memory, we would keep it in GPU registers, and only need Get  d to do the final calculation. This saves a lot of overhead and prevents the GPU from idling, so the whole operation is much more efficient.

The fusion kernel function does just that. They primarily replace multiple discrete computations and data movement to and from video memory with fused computations with very little data movement. Additionally, some fusion kernels mathematically transform operations so that certain combinations of calculations can be performed faster.

In order to train BLOOM quickly and efficiently, it is necessary to use several custom CUDA fused kernel functions provided by Megatron-LM. In particular, there is a LayerNorm fusion kernel and kernels for various combinations of fusion scaling, masking, and softmax operations. Bias Add is also integrated with GeLU through PyTorch's JIT function. These operations are all memory-bound, so it is important to fuse them together to maximize the amount of computation after each video memory read. So, for example, executing Bias Add while executing a GeLU operation whose bottleneck is in memory will not increase the running time. These kernel functions can be found in  the Megatron-LM repository  code base.

data set

Another important feature of Megatron-LM is the efficient data loader. Before the first training starts, each sample in each dataset is divided into samples of fixed sequence length (BLOOM is 2048), and an index is created to number each sample. Based on the training hyperparameters, we will determine the number of epochs that each dataset needs to participate in, and based on this, create an ordered list of sample indices, and then shuffle it. As an example, if a dataset has 10 samples that should be trained for 2 epochs, the system first sorts the  [0, ..., 9, 0, ..., 9] sample indices in order, then shuffles the order to create the final global order for the dataset. Note that this means that the training will not simply iterate over the entire dataset and repeat, you may see the same sample twice before seeing another, but at the end of training the model will only see each sample twice Second-rate. This helps ensure a smooth training curve throughout training. These indices, including the offset of each sample in the original dataset, are saved to a file to avoid recomputing them each time training is started. Finally, several of these datasets can be blended with different weights into the final data used for training.

Embed LayerNorm

In our efforts to prevent the divergence of the 104B model, we found that adding an additional LayerNorm after the first word embedding layer made the training more stable.

This insight comes from experiments with bitsandbytes , which  has an  StableEmbedding operation that is a normal embedding with a LayerNorm initialized with a uniform xavier function.

location code

Based on the paper  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , we also replace the normal positional embeddings with AliBi, which allows extrapolation of input sequences longer than the input sequences used to train the model. Therefore, even though we train with sequences of length 2048, the model can handle longer sequences during inference.

difficulties in training

With the architecture, hardware and software in place, we were able to start training in early March 2022. Since then, however, things have not been all smooth sailing. In this section, we discuss some of the main obstacles we encountered.

Before training begins, there are a lot of questions to figure out. In particular, we found several issues that only appeared after we started training on 48 nodes, not at small scales. For example, to  CUDA_LAUNCH_BLOCKING=1 prevent the framework from hanging, we need to divide the optimizer group into smaller groups, otherwise the framework will hang again. You can   read more about these in the pre-training chronicle .

The main type of problems encountered during training are hardware failures. Since this is a new cluster with about 400 GPUs, on average we experience 1-2 GPU failures per week. We save a checkpoint every 3 hours (100 iterations). As a result, we lose an average of 1.5 hours of training per week to hardware crashes. Jean Zay system administrator will then replace the faulty GPU and restore the node. In the meantime, we have spare nodes available.

We've also had various other issues that resulted in 5-10 hour downtime multiple times, some related to deadlock bugs in PyTorch, others due to insufficient disk space. See  the training chronicles if you're interested in specifics .

All of this downtime was planned for in the feasibility analysis of training this model, and we chose the appropriate model size and the amount of data we wanted the model to consume accordingly. So, even with these downtime issues, we managed to complete the training within the estimated time. As mentioned earlier, it takes about 1 million compute hours to complete.

Another problem is that SLURM was not designed to be used by a group of people. SLURM jobs are owned by a single user, and if they are not around, other members of the group cannot do anything with the running job. We have a termination scheme that allows other users in the group to terminate the current process without the presence of the user who started the process. This works great on 90% of the problems. If the SLURM designers read this, please add the concept of a Unix group so that a SLURM job can be owned by a group.

Since the training runs 24/7, we need someone on call - but since we have people in Europe and the west coast of Canada, there's no need for someone to carry a pager and we're pretty good at backing each other up. Of course, weekend training has to be watched. We automate most things, including automatically recovering from hardware crashes, but sometimes human intervention is still required.


important link

Papers and Articles

It is impossible for us to explain everything in detail in this article, so if the techniques presented here pique your curiosity and make you want to learn more, please read the following papers:

Megatron-LM:

DeepSpeed:

Megatron-LM and Deepspeedeed combined:

ALiBi:

BitsNBytes:

  • 8-bit Optimizers via Block-wise Quantization  (we used the embedding LaynerNorm in this paper, but other parts of the paper and its techniques are also very good, the only reason we didn't use 8-bit optimizers is that we already used DeepSpeed-ZeRO to save optimizer memory).

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/132462452