Efficient multi-GPU compute strategies for LLMs

Chances are at some point you will need to scale your model training workload to more than one GPU. In the last video, I emphasized that you need to use a multi-GPU computing strategy when your model becomes too large to fit on a single GPU. But even if your model does fit on a single GPU, there are benefits to using multiple GPUs to speed up training. Even if you're working with small models, it can be useful to understand how computations are distributed across GPUs. Let's discuss how to efficiently do this scaling across multiple GPUs.
insert image description here

You'll start by considering the case where the model still fits on a single GPU. The first step in scaling model training is to distribute large datasets to multiple GPUs and process these batches of data in parallel. A popular implementation of this model replication technique is PyTorch's Distributed Data Paraller, or DDP for short. DDP copies your model to each GPU and sends batches of data to each GPU in parallel. Each dataset is processed in parallel, and then a synchronization step combines the results from each GPU, thus updating the model on each GPU, which is always the same across chips. This implementation allows parallel computation across all GPUs, resulting in faster training. Note that DDP requires that your model weights and all other parameters, gradients and optimizer states required for training fit on a single GPU.
insert image description here

If your model is too large for this, you should consider another technique called Model Sharding. A popular implementation of model sharding is PyTorch's Fully Sharded Data Parallel, or FSDP for short. FSDP was motivated by a 2019 paper by Microsoft researchers that proposed a technique called ZeRO. ZeRO stands for Zero Redundancy Optimizer, and the goal of ZeRO is to optimize memory by distributing or sharding model state across GPUs with ZeRO data overlap. This allows you to scale model training across GPUs when the model does not fit in the memory of a single chip.
insert image description here

Before going back to FSDP, let's take a quick look at how ZeRO works.

Earlier this week you looked at all the memory components needed to train an LLM, the largest memory requirement is the optimizer state, which takes up twice as much space as the weights, then the weights themselves and the gradients.
insert image description here

Let's represent the parameters as this blue box, the gradient in yellow, and the optimizer state in green.
insert image description here

A limitation of the model replication strategy I showed earlier is that you need to keep a full copy of the model on each GPU, which leads to redundant memory consumption. You store the same number on each GPU.
insert image description here

ZeRO, on the other hand, removes this redundancy by distributing also known as sharding model parameters, gradients, and optimizer state across GPUs, rather than duplicating them. At the same time, the communication overhead of sinking model state is close to the Distributed Data Paraller (DDP) discussed earlier.
insert image description here

ZeRO offers three optimization stages. ZeRO Phase 1 only shards Optimizer States across GPUs, which can reduce your memory footprint by a factor of four.
insert image description here

ZeRO stage 2 also slices Gradient gradients onto the chip. When applied together with stage 1, this can reduce your memory footprint by a factor of eight.
insert image description here

Finally, ZeRO stage 3 slices all components (including model parameters) onto GPUs. When applied with stages 1 and 2, the memory reduction scales linearly with the number of GPUs. For example, sharding across 64 GPUs can reduce your memory by a factor of 64.
insert image description here

Let's apply this concept to the visualization of GDP insert image description here
,

And replace the LLM with an in-memory representation of model parameters, gradients, and optimizer state. When you use FSDP, you distribute the data to multiple GPUs like you see with DDP.
insert image description here

However, with FSDP, you can also distribute or shard the model parameters, gradients, and optimization state across GPU nodes using one of the strategies specified in the ZeRO paper. Using this strategy, you can now use models that are too large to fit on a single chip.
insert image description here

In contrast to DDP, where each GPU has all the model state needed to process each batch of data locally, FSDP requires you to collect this data from all GPUs before the forward and backward passes.
insert image description here

Each CPU requests data from other GPUs on demand, and converts fragmented data into non-fragmented data for operation. After the operation, you release non-sharded non-local data back to other GPUs as raw shard data. You can also choose to keep it for future operations during the backward pass. Note that this requires more GPU RAM, which is a typical performance vs. memory trade-off decision.

In the last step after the backward pass, FSDP synchronizes gradients across GPUs in the same way as DDP.
insert image description here

Model sharding as described by FSDP

  1. Allows you to reduce overall GPU memory usage.
  2. You can also choose to have FSDP offload part of the training computation to the GPU to further reduce GPU memory usage.
  3. To manage the tradeoff between performance and memory usage, you can configure the sharding level using FSDP's sharding factor.

A sharding factor of 1 basically removes sharding and replicates a full model similar to DDP.
insert image description here

If you set the sharding factor to the maximum number of available GPUs, you will open up full sharding. This saves the most memory, but increases communication between GPUs.
insert image description here

Any sharding factor in between enables hypersharding.
insert image description here

Let's see how FSDP vs DDP performs at teraflops per GPU. These tests were performed using up to 512 NVIDIA V100 GPUs with 80GB of memory each. Note that one teraflop corresponds to one trillion operations per second 1 0 12 10^{12}1012 floating point operations. The first figure shows the FSDP performance of T5 models of different sizes. You can see the different performance numbers for FSDP, with full sharding in blue, super sharding in orange, and full replication in green. For reference, DDP performance is shown in red.
insert image description here

For the first 25 models with 6.11 million parameters and 2.28 billion parameters, FSDP and DDP perform similarly. Now, if you choose a model size over 2.28 billion, for example 25 models with 11.3 billion parameters, DDP will run into an out of memory error. On the other hand, FSDP can easily handle models of this size and achieve higher teraflops when reducing the precision of the model to 16 bits.

The second figure shows a 7% reduction in teraflops per GPU when increasing the number of GPUs from 8-512 to the 1.1 billion T5 model,

Here orange with batch size 16 and blue with batch size 8 are used. As models grow in size and are distributed across more and more GPUs, the increased communication between chips starts to impact performance, slowing down computations.
insert image description here

Taken together, this shows that you can use FSDP for both small and large models, and seamlessly scale model training across multiple GPUs.

I know this discussion is very technical, and I want to stress that you don't need to memorize all the details. Most importantly, when training LLMs, understand how data, model parameters, and training computations are shared across processes. Given the expense and technical complexity of training models across GPUs, some researchers have been exploring how to achieve better performance with smaller models. In the next video, you'll learn about research on computing the best model. Let's read on.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/e8hbI/optional-video-efficient-multi-gpu-compute-strategies

Guess you like

Origin blog.csdn.net/zgpeace/article/details/132439826