DeepSpeed usage experience

Today's models are getting larger and larger, often several B or even hundreds of B. However, the size of the graphics card memory cannot support training inference at all. For example, with a 10G video memory of RTX2090, just loading the model will be enough OOM, let alone the subsequent training optimization.

As an alternative to the traditional pytorch Dataparallel, DeepSpeed's goal is to enable models with hundreds of millions of parameters to be trained and inferred on their own personal work servers.

This article aims to briefly introduce the core concepts of Deepspeed for large-scale model training, as well as the most basic usage methods. For more information, the author strongly recommends reading the DeepSpeed ​​tutorial on the HuggingFace Transformer official website:

Transformer DeepSpeed Integration

1. Core Idea (TLDR)

The core of DeepSpeed ​​is that the GPU memory is not enough, and the CPU memory is used to make up .

For example, if we only have a 10GB GPU, then we probably need an 80GB CPU to train a large model.

Take a look at the official website’s description of this concept:

Why would you want to use DeepSpeed with just one GPU?

  1. It has a ZeRO-offload feature which can delegate some computations and memory to the host’s CPU and RAM, and thus leave more GPU resources for model’s needs - e.g. larger batch size, or enabling a fitting of a very big model which normally won’t fit.
  2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit bigger models and data batches.

Specifically, DeepSpeed ​​caches the parameters that are not used by the training model at the current moment into the CPU, and when they are used, they are moved from the CPU to the GPU. The "parameters" here refer not only to model parameters, but also to optimizers, gradients, etc.

The more parameters are moved to the CPU, the less burden there is on the GPU; but the price comes at the cost of more frequent CPU-GPU interactions, which greatly increases the time overhead of training inference. Therefore, one of the core essences of DeepSpeed ​​is the trade-off between time overhead and memory usage .

2. How to install

Direct pip installation:

pip install deepspeed

The official recommendation is to use the warehouse to compile and install locally, which can better adapt to your local hardware environment:

git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

In addition, HuggingFace provides friendly integration with DeepSpeed . Many parameters required for DeepSpeed ​​use can be automatically specified by the Transformer's Trainer . It can be said that using DeepSpeed ​​on HuggingFace Transformer will be more convenient (of course, DeepSpeed ​​can also be used independently and does not depend on Transformer).

Install as an add-on package to Transformer:

pip install transformers[deepspeed]

3. How to use

After using DeepSpeed, your command line will look like this:

deepspeed --master_port 29500 --num_gpus=2 run_s2s.py \
--deepspeed ds_config.json
  • --master_port:The port number. It is best to specify explicitly, the default is 29500, which may be occupied (ie, running multiple DeepSpeed ​​processes).
  • --num_gpus: Number of GPUs, by default all GPUs currently seen will be used.
  • --deepspeed: The provided config file is used to specify many important parameters of DeepSpeed.

One of the core points of using DeepSpeed ​​is to write a configfile (it can be .json or a json-like configuration file). In this configuration file, you can specify the parameters you want, for example, trade-off time and video memory (As mentioned earlier, this is an important trade-off). Therefore, among the above parameters, the most important thing is --deepspeedthe config file you provide, ie ZeRO. This is what this article will focus on next.

3.1 ZeRO Overview

Zero Redundancy Optimizer (ZeRO) is the workhorse of DeepSpeed. Users can provide different ZeRO config files to implement different features of DeepSpeed.

Let’s take a look at the description of ZeRO in the official website tutorial :

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.

One sentence summary:  partitioning instead of replicating, divide rather than copy .

That is to say, in traditional deep learning, model training is performed in parallel, copying multiple copies of model parameters to multiple GPUs, and only splitting the data (for example, torch's Dataparallel), which will waste a lot of video memory redundancy. ZeRO is designed to eliminate this redundancy and improve memory utilization. Note that "memory" here not only refers to multiple GPU memories, but also includes the CPU.

The implementation method of ZeRO is to logically divide the parameter occupation into three types. Divide these types of parameters:

  • optimizer states: That is, the parameter status of the optimizer. For example, Adam's momentum parameter.
  • gradients: Gradient cache, corresponding to optimizer.
  • parameters: Model parameters.

Correspondingly, DeepSpeed's ZeRO config files can be divided into the following categories:

  • ZeRO Stage 1: Divide optimizer states. The optimizer parameters are divided into multiple memories, and the process on each momoey is only responsible for updating its own part of the parameters.
  • ZeRO Stage 2: Divide gradient. Each memory only retains the gradient corresponding to the optimizer state it is assigned to. This makes sense because gradients and optimizers are closely linked. Only knowing the gradient and not knowing the optimizer state means there is no way to optimize the model parameters.
  • ZeRO Stage 3: Divide model parameters, or in other words, different layers. ZeRO-3 will automatically allocate model parameters to multiple memories during forward and backward.

Since ZeRO-1 only allocates optimizer states (the number of parameters is very small), when actually used, we generally only consider ZeRO-2the sum ZeRO-3.

Next, we introduce the commonly used config files for stages 2 and 3.

3.2 ZeRO Stage 2

Based on the introduction on the official website, the author provides a commonly used ZeRO-stage-2 config file:

{
    "bfloat16": {
        "enabled": "auto"
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1e5
}
  • Aboutoffload

Among the above parameters, the most important one is "offload_optimizer". As shown above, we ”device“set it to cpu, and DeepSpeed ​​will operate according to the ZeRO mentioned before, and allocate the optimizer state to the cpu during the training process. Thereby reducing the memory usage of a single GPU.

  • Aboutoverlap_comm

Another parameter that needs to be mentioned is overlap_comm. Simply understood, it controls the size of the buffer for communication between processes on multiple memories. The larger this value is, the faster the communication between processes will be, and the model training speed will also increase, but the corresponding video memory usage will also increase; and vice versa.

Therefore, overlap_commit is also a parameter that requires certain trade-offs.

  • Aboutauto

We can find that a large number of parameters mentioned above are set to auto. Since DeepSpeed ​​has been integrated into the HuggingFace Transformer framework. Many parameters of DeepSpeed ​​are exactly the same as Transformer's Trainer parameter settings, for example, "optimizer", "scheduler". Therefore, it is officially recommended to set many commonly used model training parameters to. autoWhen using Trainer for training, these values ​​will be automatically updated to the settings in Trainer, or automatically calculated for you.

Of course, you can also set it yourself, but make sure it is the same as the setting in Trainer. Because, if the setting is wrong, DeepSpeed ​​will still run normally and will not report an error immediately.

  • Summarize

In most cases, you only need to pay attention to DeepSpedd-specific parameters (such as offload). It is strongly recommended to set other parameters that are duplicated by Trainer auto. For the specific meaning of each of these parameters and value settings, please refer to the official website for detailed introduction .

All in all, due to the settings auto, the above config can adapt to most stage-2use-cases of the Transformer framework.

3.3 ZeRO Stage 3

Similar to Stage-2, the author also provides a template config for stage-3

{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
  • About“offload_param”

As you can see, in addition to having the same offload_optimizerparameters as stage2, stage3 also has one offload_paramparameter. That is, divide the model parameters.

  • Other parameters related to stage-3

The following parameters are stage-3-specific:

"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true

In the same way, many of these values ​​can be used to control the memory usage and training efficiency of stage-3 (eg, sub_group_size); at the same time, some parameters can also be set to auto, allowing the Trainer to determine the value (eg, ,  reduce_bucket_size, stage3_prefetch_bucket_size) stage3_param_persistence_threshold.

For detailed descriptions of these parameters and the trade-off of values, please see the official website:
ZeRO-3 Config

  • Summarize

In the same way, the above config file can also adapt to most use-cases. Some stage-3-specific parameters may require additional attention. Specifically, it is recommended to read the official documentation.

3.4 ZeRO Infinity

In addition to stages 2 and 3, here is a brief introduction ZeRO-Infinity.

ZeRO-InfinityIt can be regarded as an advanced version of stage-3 and needs to rely on NVMe support. It can offload all model parameter status to CPU and NVMe. Thanks to the NMVe protocol, in addition to using CPU memory, ZeRO can additionally utilize SSD (solid state), which greatly saves memory overhead and accelerates communication speed.

The official website provides ZeRO-Infinitya detailed introduction:

DeepSpeed官方教程 :
ZeRO-Infinity has all of the savings of ZeRO-Offload, plus is able to offload more the model weights and has more effective bandwidth utilization and overlapping of computation and communication.
HuggingFace官网
It allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training process. ZeRO-Infinity requires ZeRO-3 enabled.

For specific config files and usage instructions, please see the official website.

4. Others

4.1 Model reasoning

In addition to model training, sometimes the model is too large, and even predictive reasoning may burst the video memory.

DeepSpeed ​​naturally also supports inference. Naturally, when reasoning, you can use the config file with the same parameters as stage-3 , and some of the training parameters will be automatically ignored (eg, optimizer, lr).

Specific reference:
ZeRO-Inference

4.2 Memory estimation

As emphasized many times before, one of the difficulties in the use of DeepSpeed ​​lies in the trade-off 时间和空间.

Allocating more parameters to the CPU can reduce the memory overhead, but it will also greatly increase the time overhead.

DeepSpeed ​​provides a simple memory estimation code:

from transformers import AutoModel
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live

## specify the model you want to train on your device
model = AutoModel.from_pretrained("t5-large") 
## estimate the memory cost (both CPU and GPU)
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)

Taking T5-large as an example, using only one GPU, the overhead of using DeepSpeed ​​will be as follows:

Insert image description here
As above, if stage2 and stage3 (the bottom two lines) are not used, training T5-large requires a graphics card with at least 12.49GB of video memory (taking into account many other cache variables, as well as your batch_size, it may actually require 24GB card). After using stages 2 and 3 successively, the graphics memory overhead was greatly reduced, while the CPU memory consumption increased significantly, and the model training time overhead also increased accordingly.

Suggestion:
Before using DeepSpeed, use the above code to roughly estimate the memory consumption, determine the number of GPUs to use, and ZeRO-stage.

The principle is, if you can directly train with multiple cards, don’t use ZeRO; if you can use ZeRO-2, don’t use ZeRO-3.

For details, please see the official website: Memory Requirements

4.3 Usage evaluation

The author tried to use DeepSpeed ​​for model training.


The first is stage 2, that is, only put the optimizer on the cpu. The following is a comparison of GPU memory usage and training speed before and after use:

  • GPU memory: 20513MB =>  17349MiB
  • Training speed ( tqdmestimated): 1.3 iter/s =>  0.77 iter/s

It can be clearly seen that the GPU memory usage has been significantly reduced, but the training speed has also become slower. Based on the author's current experience, deepspeed does not bring any benefits.

The author's machine is equipped with 24000a MB graphics card, and when the batch_size is 2, it takes up 20513MB; and DeepSpeed ​​only helps the author to free up 3000MB of video memory, but it is still not enough to increase the batch_size , resulting in a longer total training time.

Therefore, DeepSpeed ​​may only be suitable for extreme shortage of video memory (ie, the model is too large to run with batch_size == 1); or, the video memory saved by using DeepSpped is just enough to support a larger batch_size. Otherwise, in the author's current situation, using DeepSpeed ​​will only increase time overhead and have no other benefits.


After that, I also tried to use stage 3, but the speed was extremely slow . A training process that originally took 6 hours, after using DeepSpeed ​​stage3, ran for 2 days and 2 nights, with no sign of ending. The author had no choice but to terminate the test.

In addition, when using DeepSpeed ​​stage2, because the model parameters are assigned to multiple devices, no output information can be seen in the console (but the GPU is still beeping, and the utility is also 100%), so people do not know the program The running progress can be said to be very unfriendly to users.

4.4 Some frequently asked questions

Since DeepSpeed ​​will slow down the GPU overhead by occupying CPU memory, when the system CPU is not enough, the DeepSpeed ​​process will be automatically stopped by the system, resulting in the phenomenon that DeepSpeed ​​cannot be started without any error . It is recommended to first use the estimation introduced above to estimate the CPU memory usage, and then free -hcheck the machine's CPU memory space to determine whether DeepSpeed ​​can be used.

In addition, it is also possible that the loss is due to the training accuracy problem NAN. See: Troubleshooting .


After using DeepSpeed ​​stage2, the optimizer cannot be changed flexibly. The following figure is the source code of DeepSpeed.py:
Insert image description here
the default optimizer must be set in config, that is, the default optimizer and learning rate are used, and the group learning rate cannot be realized. If you want to customize the initialization process of the optimizer, you must implement two versions of the optimizer (CPU+GPU). As officially stated:

Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB).

In short, it will be more troublesome to customize the optimizer in this case.


Finally, there are patients who are heavily dependent on VScode:
Unfortunately, the DeepSpeed ​​process does not currently support debugging in Vscode because of the lack of support for the corresponding VScode compilation plug-in. For details, see: github issue


5. Reference:

  1. HuggingFace Transformer DeepSpeed Integration
  2. DeepSpeed ​​Tutorial English Tutorial
  3. DeepSpeed ​​Setup parameter description

#deep learning #python #artificial intelligence #pytorch

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/131615263