foreword
This article can be regarded as the technology behind "100 Billion Parameter Open Source Large Model BLOOM , this is the original English text " and study notes of related papers, but some details and errors have been corrected, and a large number of explanations have been added to make it easier to read Clear and easy to understand
The first part of BLOOM and the Megatron-DeepSpeed behind it
1.1 BLOOM training details: hardware/Checkpoints/dataset
The model architecture of BLOOM is very similar to GPT3, but some improvements have been added. The training of the 176B BLOOM model will take about 3.5 months to complete (about 1 million calculation hours) from March to July 2022. The following is its training some details
training hardware
- GPU: 384 NVIDIA A100 80GB GPUs (48 nodes) + 32 spare GPUs
- 8 GPUs per node, 4 NVLink inter-card interconnects, 4 OmniPath links
- CPU: AMD EPYC 7543 32-core processor
- CPU memory: 512GB per node
- GPU memory: 640GB per node
- Connection between nodes: Omni-Path Architecture (OPA) network card is used, and the network topology is a non-blocking fat tree
- NCCL - Communications Network: a fully dedicated subnetwork
- Disk IO network: GPFS shared with other nodes and users
Checkpoints
- main checkpoints
- Each checkpoint contains an optimizer state with a precision of fp32 and a weight with a precision of bf16+fp32, and occupies a storage space of 2.3TB. If only the weight of bf16 is saved, only 329GB of storage space will be occupied.
data set
- 41.5TB of heavily deduplicated and cleaned text in 46 languages, converted to 350B tokens
- The model's vocabulary contains 250,680 tokens
- For more details, please refer to The BigScience Corpus A 1.6TB Composite Multilingual Dataset
1.2 Megatron-DeepSpeed:
The 176B BLOOM model is trained using Megatron-DeepSpeed , which combines two main techniques:
- Megatron-LM is a large-scale and powerful transformer model framework developed by NVIDIA's applied deep learning research team. The corresponding paper is: " Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism ", this is Li Mu's video Interpretation , which is the text/code interpretation of it
-
DeepSpeed is a deep learning optimization library developed by Microsoft that makes distributed training simple, efficient and effective
The DeepSpeed team combined the first item below with the last three
- Tensor Parallelism in Megatron-LM (Tensor Parallelism, which can be understood as a type of model parallelism),
each tensor is divided into multiple blocks, so each slice of the tensor is located on its designated GPU, rather than Let the entire tensor reside on a single GPU. During processing, each shard is processed separately and in parallel on a different GPU, and the results are synchronized at the end of the step. This is called horizontal parallelism, because the horizontal split - The Zero Redundancy Optimizer (ZeRO for short, which is the core of the Microsoft DeepSpeed library)
also performs tensor sharding similar to TP, but the entire tensor will be reconstructed in time for forward or reverse calculations, so there is no The model needs to be modified. It also supports various offloading techniques to compensate for limited GPU memory - Data Parallelism (Data Parallelism)
The same settings and models are replicated in multiple copies, and each copy is fed a different copy of data each time. Processing is done in parallel, with all shares synchronized at the end of each training step - Pipeline Parallel (also known as Pipeline Parallelism)
models are split vertically (ie, by layer) across multiple GPUs so that only one or more model layers are placed on a single GPU. Each GPU processes different stages of the pipeline in parallel and processes a portion of the batch
Developed a 3D parallel implementation, which is Megatron-Deepspeed, which makes the distributed training of large-scale language models with more than 100 billion parameters, such as BLOOM, easier, more efficient and effective
Note that the BLOOM team's BigScience version of Megatron-DeepSpeed is based on the original Megatron-DeepSpeed code base, but adds quite a few codes on top of it
The following table lists which components of each of the two frameworks we used to train BLOOM
components | DeepSpeed | Megatron-LM |
---|---|---|
ZeRO Data Parallel | yes | |
Tensor Parallel | yes | |
Pipeline Parallel | yes | |
BF16 optimizer | yes | |
CUDA fusion kernel function | yes | |
DataLoader | yes |
Note that both Megatron-LM and DeepSpeed have pipelined parallelism and BF16 optimizer implementations, but we use DeepSpeed's implementation because they are integrated into ZeRO
The second part is Tensor Parallelism (Tensor Parallelism, a type of model parallelism)
In Tensor Parallelism (TP), each GPU processes only a portion of a tensor, and aggregation operations are triggered only when certain operators require the full tensor.
As we all know, there are two main modules of Transformer: 一个自注意力层 + 残差连接,
a fully connected layer MLP + residual connection
In 2019, NVIDIA passed the Megatron-LM paper ( Efficient Large-Scale Language Model Training on GPU Clusters ), and its dot product part can be written as , where and are input and output vectors, and are weight matrices.
It is easy to see how matrix multiplication can be split across multiple GPUs if represented in matrix form, as shown in the following diagram (denoted as Figure 1 ):
2.1 Parallelization of MLP: the weight A matrix is cut vertically and the B matrix is cut horizontally and finally MERGE
Don't underestimate the schematic diagram above. In fact, there are many details in it, which are worth pondering over and over again. The details are shown in the following figure (denoted as Figure 2 )
- For the input , its number of rows is the batch size by the sequence length , and the number of columns is the width of the hidden layer, that is,
the module of its hidden layer is actually two fully connected layers. - Assuming that the weight of the first hidden layer is ( the number of rows is K, the number of columns is K', K' is generally 4 times K ), then do matrix multiplication first , and then connect an activation function such as GELU ( GELU is similar to The inflection point of ReLU goes down smoothly )
- Assuming that the weight of the second hidden layer is ( number of rows K', number of columns K ), the final
- Next, let’s see how to do splitting. It’s better to use multiple GPUs for parallelism.
If the input data is relatively large, then choose to do data parallelism first, that is, split the input. If
the model itself is relatively large, then choose to do it first. The model is parallel, that is, the matrix is split, and the split method is divided into two
types. The first ( corresponding to the lower part of Figure 1 above ): the matrix is split horizontally by row ( then the corresponding column is split vertically, resulting in The result is that communication between the two GPUs is required )
The second type ( corresponding to the upper part of Figure 1 above ): the matrix is disassembled vertically by column, as shown in Figure 2 above, one column is blue and the other is green ( then Correspondingly split by row and horizontally, or there must be a copy on both GPUs, and no additional communication is required at this time ) - After confirming the second split method ( that is, the matrix is split vertically by column ), multiply to get a large matrix , and then cut the matrix by row ( as shown in Figure 2 above, the second matrix is matrix B, The blue one is placed on GPU 0, and the green one is placed on GPU 1 ), and the final matrix column and matrix row do matrix multiplication , and the obtained size is consistent with the size of , but the result is only ( referring to the number of GPUs )
in other words , by performing matrix multiplication to get first , and then get an output vector , they can be independently input into GeLU and finally merge the results of step 5 above to get a complete one. Through the above operations, we can update the MLP of any depth, just in each Synchronize the GPU after the sequence. The author of the Megatron-LM paper provided a nice illustration for this (denoted as Figure 3)::
拆列-拆行
Here is the identity operator in the forward pass and all reduce in the backward pass, but all reduce in the forward pass and the identity in the backward pass.
2.2 Parallelization of multi-head attention layer: each head calculates separately
Parallelizing multi-head attention layers is even simpler since they are inherently parallel due to multiple independent heads! As shown in the figure below (marked as Figure 4)
- For the input matrix, the number of rows is still the batch size by the sequence length (assuming the batch size is 1), and the number of columns is , in the self-attention mechanism ( if you forget what self-attention mechanism is, please read this article for details The third part of Transformer Notes ), the input will be copied into three copies, corresponding to: vector matrix (similar to three clones)
- As for multi-head attention, the dimension of the head is , assuming , after that, for each word vector in the input matrix of each head , it will do a scaling dot product with the vector of the respective context, then do softmax to get an attention score or weight, and then do it with Weighted sum to get an output and finally multiplied by a projection to get a result
- The calculation process of the second head is similar
. You will find that the calculation of each head is independent and parallel without affecting each other, which means that one head can be placed on GPU 0 ( indicated in blue indicated in green
The whole process is shown in the figure below (denoted as Figure 5)
Special consideration needs to be given to:
- Since there are two all-reduces per layer in forward and backward propagation, TP requires a very fast interconnection between devices. Therefore, unless you have a very fast network, it is not recommended to do TP across multiple nodes. In our hardware configuration for training BLOOM, the speed between nodes is much slower than PCIe. In fact, if the node has 4 GPUs, a maximum TP degree of 4 is better. If you need a TP degree of 8, you need to use a node with at least 8 GPUs
- This component is implemented by Megatron-LM. Megatron-LM has recently expanded the tensor parallel capability and added a sequence parallel capability for operators that are difficult to use the aforementioned segmentation algorithm. For example, the LayerNorm Reducing Activation Recomputation in Large Transformer Models paper provides detailed information on this technology. Sequence parallelism was developed after training BLOOM, so BLOOM training does not use this technique
2.3 Parallelization for input and output
Next, let's look at the parallelization for input and output, as shown in the figure below (denoted as Figure 6)
- For the input, it is a matrix by the batch size and the length of the sequence, which stores sentences line by line, and the embedding layer is a vocabulary with rows of vocab (equivalent to a vocabulary) and columns of K. The entire dictionary can be
horizontal Cut (for example, put the upper part on GPU 0 with a blue mark, and put the lower half on GPU 1 with a green mark), and get an output by looking up the table - For the output, the number of rows is and the number of columns is K. After passing through the vocabulary, an output is obtained, the left half of which can be placed on GPU 0, and the right half can be placed on GPU 1.
Each line of the output can be horizontal Add up, but V may be relatively large, such as the size of tens of thousands. Of course, each GPU can count its own part
Part III Data Parallel and ZeRO
Most users with only a few GPUs are probably familiar with DistributedDataParallel
(DDP), which is the corresponding PyTorch documentation . In this approach, the models are fully replicated to each GPU, and then all models synchronize their states with each other after each iteration. This method can speed up training and solve problems by investing more GPU resources. But it has the limitation that it only works if the model fits on a single GPU.
3.1 ZeRO data parallelism
3.1.1 ZeRO 1
In 2020, the Microsoft DeepSpeed team proposed the Zero Redundancy Optimizer (ZeRO for short) through the paper " ZeRO: Memory Optimizations Toward Training Trillion Parameter Models ", but optimization is an eternal topic, so the DeepSpeed team has published three ZeRO-related papers in the past few years (The method proposed in the latest article is referred to as ZeRO 3 ), which proposes methods such as removing redundant parameters, introducing CPU and memory, and introducing NVMe, etc., with one goal from beginning to end: to carry out video memory optimization to the end
Figure 7 below is a good description of ZeRO data parallelism (from this blog post )
It seems to be relatively tall, which may make it difficult for you to concentrate on understanding, but in fact, the concept is very simple. This is just the usual DDP, except that instead of each GPU replicating the full model parameters, gradients, and optimizer state, each GPU stores only a portion of it. During subsequent runs, when the full layer parameters for a given layer are required, all GPUs synchronize to provide each other with their missing pieces - nothing more.
3.1.2 ZeRO 2
// to be updated
3.1.3 ZeRO 3
to be updated..
The following is to be changed , 8.25 at 4pm..
The fourth part of the pipeline is parallel
Naive pipeline parallelism (naive PP) is to distribute model layers in groups across multiple GPUs and simply move data from GPU to GPU as if it were one large composite GPU. The mechanism is relatively simple - you bind the desired layer .to()
method to the corresponding device, and now whenever data enters or exits these layers, the layers will switch the data to the same device as the layer, and the rest remains the same.
This is actually vertical model parallelism, because if you remember how we draw the topology of most models, we actually split the layers of the model vertically. For example, if the image below shows an 8-layer model:
=================== ===================
| 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 |
=================== ===================
GPU0 GPU1
We cut it vertically into 2 parts, placing layers 0-3 on GPU0 and layers 4-7 on GPU1.
Now, when data is passed from layer 0 to layer 1, layer 1 to layer 2, and layer 2 to layer 3, it's just like normal forward pass on a single GPU. But when data needs to pass from layer 3 to layer 4, it needs to be transferred from GPU0 to GPU1, which introduces communication overhead. If the participating GPUs are on the same compute node (eg, the same physical machine), the transfer is very fast, but if the GPUs are on different compute nodes (eg, multiple machines), the communication overhead can be much larger.
Then layers 4 to 5 to 6 to 7 are like normal models again, and when layer 7 is done we usually need to send data back to layer 0 where the labels are (or send the labels to the last layer). Now the loss can be calculated and the optimizer can be used to update the parameters.
question:
-
Why is this method called naive pipeline parallelism, and what are its defects? Mainly because the scheme has all but one GPU idle at any given moment. So if you use 4 GPUs, you're almost quadrupling the amount of memory on a single GPU, and other resources (like compute) are pretty much useless. Add in the overhead of copying data between devices. So 4 6GB cards in parallel using naive pipeline will be able to hold the same size model as 1 24GB card which trains faster because it has no data transfer overhead. But, for example, if you have a 40GB card, but need to run a 45GB model, you can use 4x 40GB cards (which is just enough, because there are also gradients and optimizer states that require video memory).
-
Sharing embeddings may require copying back and forth between GPUs. The pipelined parallelism (PP) we use is almost the same as the naive PP above, but it solves the GPU idling problem by chunking incoming batches into micros batches and artificially creating pipelines that allow different GPUs to participate in the computation process simultaneously.
The figure below is from the GPipe paper , the upper part represents the naive PP scheme, and the lower part is the PP method:
From the bottom half of the figure it is easy to see that PP has less dead zone (meaning the GPU is idle), ie less "bubbles".
The degree of parallelism of the two schemes in the figure is 4, that is, the pipeline is composed of 4 GPUs. So there are four forward paths of F0, F1, F2, and F3, and then the reverse path of B3, B2, B1, and B0.
PP introduces a new hyperparameter to tune, called 块 (chunks)
. It defines how many blocks of data are sent sequentially through the same pipe level. For example, in the bottom half of the figure, you can see chunks = 4
. GPU0 executes the same forward path on chunks 0, 1, 2, and 3 (F0,0, F0,1, F0,2, F0,3) and then waits until the other GPUs finish their work before GPU0 starts working again , execute the backward path for blocks 3, 2, 1, and 0 (B0,3, B0,2, B0,1, B0,0).
Note that this is conceptually the same as gradient accumulation steps (GAS). PyTorch calls it 块
, and DeepSpeed calls it GAS
.
Because 块
, PP introduces the concept of micro-batches (MBS). DP splits the global batch size into small batch sizes, so if the DP degree is 4, the global batch size 1024 will be split into 4 small batch sizes, and each small batch size is 256 (1024/4). And if 块
the number (or GAS) is 32, we end up with a micro batch size of 8 (256/32). Each tube stage processes one micro batch at a time.
The formula to calculate the global batch size for the DP + PP setting is: mbs*chunks*dp_degree
( 8*32*4=1024
).
Let's go back and look at the picture again.
Using chunks=1
what you end up with is naive PP, which is very inefficient. And with very large 块
numbers, you end up with small micro-batch sizes, which are probably not very efficient either. Therefore, one must experiment to find 块
the number that makes the most efficient use of the GPU.
The graph shows that there are bubbles of "dead" time that cannot be parallelized because the last forward
stage has to wait for backward
the pipeline to finish. Then, the problem of finding the optimal 块
number so that all participating GPUs can achieve high concurrent utilization is actually transformed into minimizing the number of bubbles.
This scheduling mechanism is called 全前全后
. Some other options are tandem and staggered tandem .
While both Megatron-LM and DeepSpeed have their own implementations of the PP protocol, Megatron-DeepSpeed uses the DeepSpeed implementation because it is integrated with other features of DeepSpeed.
Another important issue here is the size of the word embedding matrix. While generally word embedding matrices require less memory than transformer blocks, in the case of BLOOM with a 250k vocabulary, the embedding layer requires 7.2GB for bf16 weights, compared to only 4.9GB for the transformer block. Therefore, we had to make Megatron-Deepspeed treat the embedding layer as a transformer block. So we have a pipeline of 72 stages, 2 of which are dedicated to embedding (first and last). This allows us to balance the memory consumption of the GPU. If we didn't do this, we would have the first and last stages consume a lot of GPU memory, and 95% of the GPU memory usage would be very little, so the training would be very inefficient.
DP+PP
There is a diagram in the DeepSpeed Pipeline Parallel Tutorial that demonstrates how to combine DP and PP, as shown below.
The important thing to understand here is that DP rank 0 cannot see GPU2, and DP rank 1 cannot see GPU3. For DP, there are only GPUs 0 and 1, and data is fed to them. GPU0 uses PP to "secretly" offload some of its load to GPU2. Likewise, GPU1 will also get help from GPU3.
Since at least 2 GPUs are required for each dimension, at least 4 GPUs are required here.
DP+PP+TP
For more efficient training, PP, TP, and DP can be combined, called 3D parallelism, as shown in the figure below.
This figure is from the blog post " 3D Parallelism: Scaling to Trillion Parameter Models "), which is also a good article.
Since you need at least 2 GPUs per dimension, here you need at least 8 GPUs for full 3D parallelism.
ZeRO DP+PP+TP
One of the main features of DeepSpeed is ZeRO, which is a super-scalable enhanced version of DP, which we have discussed in the section ZeRO Data Parallelism . Usually it is an independent function and does not require PP or TP. But it can also be combined with PP, TP.
When ZeRO-DP is combined with PP (and thus TP), it typically only enables ZeRO phase 1, which only shards the optimizer state. ZeRO stage 2 also shards gradients, and stage 3 also shards model weights.
While it is theoretically possible to use ZeRO stage 2 with pipeline parallelism, it can have a bad impact on performance. Each micro batch requires an additional reduce-scatter communication to aggregate gradients before sharding, which adds potentially significant communication overhead. According to the parallel nature of the pipeline, we will use small micro batches, and focus on the trade-off between arithmetic intensity (micro batch size) and minimizing pipeline bubbles (number of micro batches). Therefore, the increased communication overhead hurts pipeline parallelism.
Also, due to PP, the number of layers is already less than normal, so it doesn't save much memory. PP has reduced the gradient size 1/PP
, so the gradient slice on this basis does not save much memory compared to pure DP.
ZeRO stage 3 can also be used to train models of this size, however, it requires more communication than DeepSpeed 3D in parallel. A year ago, after careful evaluation of our environment, we found that Megatron-DeepSpeed 3D parallelism performed best. The performance of ZeRO Phase 3 has improved significantly since then, and if we were to re-evaluate it today, maybe we would choose Phase 3.
BF16Optimizer
Training huge LLM models with FP16 is a no-no.
We've demonstrated this for ourselves by spending months training the 104B model , which, as you can see from Tensorboard , was a total failure. In the process of fighting against the ever-diverging lm-loss, we learned a lot:
We also got the same suggestion from the Megatron-LM and DeepSpeed teams after they trained the 530B model . The recently released OPT-175B also reported that they trained very hard on FP16.
So back in January we knew we were going to train on the A100 which supports the BF16 format. Olatunji Ruwase developed one for training BLOOM BF16Optimizer
.
If you're not familiar with this data format, take a look at its bit layout . The key to the BF16 format is that it has the same number of exponents as FP32, so it won't overflow, but FP16 often overflows! FP16 has a maximum value range of 64k, you can only multiply smaller numbers. For example you can do 250*250=62500
, but if you try 255*255=65025
, you will overflow, which is the main cause of problems with training. This means your weights must be kept small. A technique called loss scaling helps alleviate this problem, but FP16's small range can still be an issue when models get very large.
The BF16 doesn't have this problem, you can do it easily 10_000*10_000=100_000_000
, no problem at all.
Of course, since BF16 and FP16 are the same size, 2 bytes, there is no free lunch, and the tradeoff when using BF16 is that it has very poor precision. However, you should remember that the stochastic gradient descent method and its variants we used in training, this method is a bit like staggering, if you don't find the perfect direction at this step, it's okay, you will correct it in the next step Own.
Whether using BF16 or FP16, there is a copy of the weights that is always in FP32 - this is what is updated by the optimizer. So the 16-bit format is only used for calculations, the optimizer updates the FP32 weights with full precision, and then converts them to 16-bit format for the next iteration.
All PyTorch components have been updated to ensure that they perform any accumulation in FP32, so no loss of precision occurs.
A key issue is gradient accumulation, which is one of the main features of pipeline parallelism, since the gradients processed by each micro-batch are accumulated. Implementing gradient accumulation in FP32 for training accuracy is critical, and this is exactly BF16Optimizer
what was done.
Among other improvements, we believe that using BF16 mixed-precision training turned a potential nightmare into a relatively smooth process, as can be seen in the following lm loss plot:
CUDA fusion kernel function
The GPU mainly does two things. It can write data to and read data from video memory and perform computations on that data. When the GPU is busy reading and writing data, the computing units of the GPU are idle. If we want to utilize the GPU efficiently, we want to keep idle time to a minimum.
A kernel function is a set of instructions that implement a specific PyTorch operation. For example, when you call it torch.add
, it goes through a PyTorch scheduler , which decides what code it should run based on the values of the input tensors and other variables, and finally runs it. CUDA kernels use CUDA to implement these codes and therefore only run on NVIDIA GPUs.
Now, when computing with the GPU c = torch.add (a, b); e = torch.max ([c,d])
, typically what PyTorch will do is launch two separate kernels, one that does the addition a
of b
the sum c
and the other that takes d
the maximum of the two. In this case, the GPU fetches the sum from its video memory a
, b
performs the addition, and then writes the result back to video memory. It then takes the sum c
and d
does a max operation, then writes the result back to video memory again.
If we were to fuse these two operations, i.e. put them into a "fused kernel function", and then launch that kernel, instead of writing the intermediate result c
to video memory, we would keep it in GPU registers, and only need Get d
to do the final calculation. This saves a lot of overhead and prevents the GPU from idling, so the whole operation is much more efficient.
The fusion kernel function does just that. They primarily replace multiple discrete computations and data movement to and from video memory with fused computations with very little data movement. Additionally, some fusion kernels mathematically transform operations so that certain combinations of calculations can be performed faster.
In order to train BLOOM quickly and efficiently, it is necessary to use several custom CUDA fused kernel functions provided by Megatron-LM. In particular, there is a LayerNorm fusion kernel and kernels for various combinations of fusion scaling, masking, and softmax operations. Bias Add is also integrated with GeLU through PyTorch's JIT function. These operations are all memory-bound, so it is important to fuse them together to maximize the amount of computation after each video memory read. So, for example, executing Bias Add while executing a GeLU operation whose bottleneck is in memory will not increase the running time. These kernel functions can be found in the Megatron-LM repository code base.
data set
Another important feature of Megatron-LM is the efficient data loader. Before the first training starts, each sample in each dataset is divided into samples of fixed sequence length (BLOOM is 2048), and an index is created to number each sample. Based on the training hyperparameters, we will determine the number of epochs that each dataset needs to participate in, and based on this, create an ordered list of sample indices, and then shuffle it. As an example, if a dataset has 10 samples that should be trained for 2 epochs, the system first sorts the [0, ..., 9, 0, ..., 9]
sample indices in order, then shuffles the order to create the final global order for the dataset. Note that this means that the training will not simply iterate over the entire dataset and repeat, you may see the same sample twice before seeing another, but at the end of training the model will only see each sample twice Second-rate. This helps ensure a smooth training curve throughout training. These indices, including the offset of each sample in the original dataset, are saved to a file to avoid recomputing them each time training is started. Finally, several of these datasets can be blended with different weights into the final data used for training.
Embed LayerNorm
In our efforts to prevent the divergence of the 104B model, we found that adding an additional LayerNorm after the first word embedding layer made the training more stable.
This insight comes from experiments with bitsandbytes , which has an StableEmbedding
operation that is a normal embedding with a LayerNorm initialized with a uniform xavier function.
location code
Based on the paper Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , we also replace the normal positional embeddings with AliBi, which allows extrapolation of input sequences longer than the input sequences used to train the model. Therefore, even though we train with sequences of length 2048, the model can handle longer sequences during inference.
difficulties in training
With the architecture, hardware and software in place, we were able to start training in early March 2022. Since then, however, things have not been all smooth sailing. In this section, we discuss some of the main obstacles we encountered.
Before training begins, there are a lot of questions to figure out. In particular, we found several issues that only appeared after we started training on 48 nodes, not at small scales. For example, to CUDA_LAUNCH_BLOCKING=1
prevent the framework from hanging, we need to divide the optimizer group into smaller groups, otherwise the framework will hang again. You can read more about these in the pre-training chronicle .
The main type of problems encountered during training are hardware failures. Since this is a new cluster with about 400 GPUs, on average we experience 1-2 GPU failures per week. We save a checkpoint every 3 hours (100 iterations). As a result, we lose an average of 1.5 hours of training per week to hardware crashes. Jean Zay system administrator will then replace the faulty GPU and restore the node. In the meantime, we have spare nodes available.
We've also had various other issues that resulted in 5-10 hour downtime multiple times, some related to deadlock bugs in PyTorch, others due to insufficient disk space. See the training chronicles if you're interested in specifics .
All of this downtime was planned for in the feasibility analysis of training this model, and we chose the appropriate model size and the amount of data we wanted the model to consume accordingly. So, even with these downtime issues, we managed to complete the training within the estimated time. As mentioned earlier, it takes about 1 million compute hours to complete.
Another problem is that SLURM was not designed to be used by a group of people. SLURM jobs are owned by a single user, and if they are not around, other members of the group cannot do anything with the running job. We have a termination scheme that allows other users in the group to terminate the current process without the presence of the user who started the process. This works great on 90% of the problems. If the SLURM designers read this, please add the concept of a Unix group so that a SLURM job can be owned by a group.
Since the training runs 24/7, we need someone on call - but since we have people in Europe and the west coast of Canada, there's no need for someone to carry a pager and we're pretty good at backing each other up. Of course, weekend training has to be watched. We automate most things, including automatically recovering from hardware crashes, but sometimes human intervention is still required.
important link
Papers and Articles
It is impossible for us to explain everything in detail in this article, so if the techniques presented here pique your curiosity and make you want to learn more, please read the following papers:
Megatron-LM:
- Efficient Large-Scale Language Model Training on GPU Clusters.
- Reducing Activation Recomputation in Large Transformer Models
DeepSpeed:
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- ZeRO-Offload: Democratizing Billion-Scale Model Training
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- DeepSpeed: Extreme-scale model training for everyone
Megatron-LM and Deepspeedeed combined:
ALiBi:
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- What Language Model to Train if You Have One Million GPU Hours? - There you will find the experiments that ultimately led us to choose ALiBi.
BitsNBytes:
- 8-bit Optimizers via Block-wise Quantization (we used the embedding LaynerNorm in this paper, but other parts of the paper and its techniques are also very good, the only reason we didn't use 8-bit optimizers is that we already used DeepSpeed-ZeRO to save optimizer memory).