Deep learning: Large-scale model distributed training framework DeepSpeed

Deep learning: Large-scale model distributed training framework DeepSpeed

Introduction to DeepSpeed

Insert image description here

As machine learning models become more complex and larger, the computational resources required to train these models continue to increase. Especially in fields such as natural language processing (NLP), most models have reached billions or even hundreds of billions of parameters, which requires multi-card or multi-node distributed training. In order to effectively train these giant models, Microsoft has launched an open source large-scale model distributed training framework called DeepSpeed, which provides some optimization strategies and tools designed to support large-scale model training and a more efficient training process.

DeepSpeed ​​is designed to simplify and optimize the training of large-scale models. DeepSpeed ​​implements 3D parallelism, a flexible combination of three parallel methods: data parallelism, pipeline parallelism and tensor slice model parallelism supported by ZeRO. These techniques can greatly reduce the memory consumption of a single GPU, allowing the training of larger models with limited resources. In addition, DeepSpeed ​​can use both CPU and GPU memory to train large models through ZeRO-Offload, greatly reducing video memory consumption.

DeepSpeed ​​provides highly optimized data loading and network communication tools that can reduce communication volume and significantly improve training efficiency in multi-GPU and multi-node environments. The framework also supports mixed-precision training, further improving computing speed and resource utilization. In addition, DeepSpeed ​​proposes a sparse attention kernel, which supports an order of magnitude longer input sequence than the classic dense Transformer and achieves up to 6 times faster execution while maintaining comparable accuracy.

In addition, DeepSpeed ​​strives to optimize large-scale training, but in order to focus on user experience. DeepSpeed ​​provides an easy-to-integrate API that makes migrating existing PyTorch models to the DeepSpeed ​​framework a breeze without extensive code rewriting.

DeepSpeed ​​is an active open source project that is continuously updated and maintained on GitHub. Its Github link is https://github.com/microsoft/DeepSpeed .

DeepSpeed ​​core features

The core features of DeepSpeed ​​are as follows:
Insert image description here
DeepSpeed ​​provides innovative technologies such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, and ZeRO-Infinity for model training, making large-scale deep learning training effective and efficient, greatly improving ease of use, and Redefining the deep learning training landscape in terms of possible scale.
DeepSpeed ​​brings together innovations in parallel technologies such as Tensor, Pipeline, Expert, and ZeRO-parallelism in model inference, and combines them with high-performance custom inference kernels, communication optimization, and heterogeneous memory technology to achieve inference at an unprecedented scale, while Achieve unparalleled latency, throughput, and performance.
In terms of model compression, DeepSpeed ​​provides easy-to-use and flexible combination of compression technologies to compress their models, while providing faster speeds, smaller model sizes, and significantly reduced compression costs.
The DeepSpeed ​​team has launched a new initiative called DeepSpeed4Science, which aims to build unique capabilities through technological innovation in artificial intelligence systems to help domain experts solve today's biggest scientific mysteries.

How does DeepSpeed ​​work?

Let’s take a look at some of DeepSpeed’s core components and how they work.

  1. ZeRO Optimizer
    ZeRO (Zero Redundancy Optimizer) is a key innovation of DeepSpeed, designed to solve the memory bottleneck problem in large model training. It achieves memory savings by optimizing redundant data in data parallel strategies. Simply put, ZeRO distributes and stores model parameters, gradients, optimizer states, etc. on multiple GPUs instead of storing a complete copy on each GPU. It uses a dynamic communication schedule during training to share necessary data among distributed devices to maintain computational granularity and data parallelism in the communication volume. Doing so significantly reduces the memory load on each GPU, making it possible to train larger models. ZeRO’s levels are classified as follows:
  • ZeRO-0: Disable all types of sharding and only use DeepSpeed ​​as DDP (Distributed Data Parallel);
  • ZeRO-1: Split optimizer state, 4x less memory, same communication capacity as data parallelism;
  • ZeRO-2: split optimizer state and gradient, 8x memory reduction, same communication capacity as data parallelism;
  • ZeRO-3: Split optimizer states, gradients, and parameters, with memory reduction linear in data parallelism and complexity.
  • ZeRO-Infinity is an extension of ZeRO-3. Allows training of large models by expanding GPU and CPU memory using NVMe SSDs. ZeRO-Infinity requires ZeRO-3 to be enabled.
  1. ZeRO-Offload technology
    ZeRO-Offload further expands ZeRO's capabilities by offloading some computing tasks to the CPU, reducing computing requirements on the GPU. This allows training of large models with limited GPU resources while still efficiently utilizing CPU resources.
  2. Parameter Sharding
    In DeepSpeed, parameter sharding is another means of reducing GPU memory requirements. By splitting a model's parameters into smaller pieces and loading them into memory only when necessary during training, DeepSpeed ​​can further reduce the amount of memory required on a single GPU, allowing for the training of larger models.
  3. Mixed precision training
    Mixed precision training refers to the technique of using both FP16 (half-precision floating point) and FP32 (single-precision floating point) precisions during the training process. Using FP16 can greatly reduce the memory footprint, allowing larger models to be trained. When using mixed precision training, some techniques need to be used to solve the possible problems of gradient disappearance and model instability, such as dynamic precision scaling and mixed precision optimizers.

How to use DeepSpeed?

DeepSpeed ​​integrates seamlessly with existing PyTorch code bases, allowing developers to start using it with only minimal modifications. Here are some basic steps to start using DeepSpeed ​​based on the model you have built:

  1. Install DeepSpeed: It can be quickly installed through the pip package manager pip install deepspeed .
  2. Modify code to be DeepSpeed ​​compatible: This usually involves importing the DeepSpeed ​​library import deepspeed, initializing the DeepSpeed ​​engine deepspeed.initialize(), and setting up the data loader data_loaderand iterative training (forward and backpropagation).
  3. Configure the running environment: Set the configuration file according to your hardware and model size, including batch size, learning rate, optimizer selection, memory optimization, etc.
  4. Start training: Use the command line tool provided by DeepSpeed ​​to start the training process. This tool can run your model distributedly on multiple GPUs.

In addition, DeepSpeed ​​is now integrated into several open source deep learning frameworks, such as Transformers, Accelerate, Lightning, MosaicML, Determined, and MMEngine. DeepSpeed ​​can be used in conjunction with these open source frameworks. For example, the transformers framework can use the integrated DeepSpeed ​​function through Trainer. This usage requires a configuration file deepspeed_config.json. For detailed tutorials, see the transformers official website link :

from transformers import Trainer

deepspeed_config = "./deepspeed_config.json"

model = ...
args = TrainingArguments(
	...
	deepspeed=deepspeed_config,
	...
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    data_collator=data_collator,
    optimizer=optimizer,
)
trainer.train()
trainer.save_model("best")

references

  1. [LLM] Large model training (1) – Introduction to DeepSpeed
  2. [LLM] Large model training (2) – DeepSpeed ​​use

Guess you like

Origin blog.csdn.net/weixin_43603658/article/details/134783077