Microsoft shocked the open source DeepSpeed Chat, one-click to achieve end-to-end RLHF training of ChatGPT

4ec4f0ece8af0cbc8febf36333bae0fc.png

Edited|Aeneas So Sleepy
Source|Xin Zhiyuan

Microsoft's open source DeepSpeed ​​Chat allows developers to realize the dream of having a ChatGPT for everyone!

The dream of having a ChatGPT in hand is about to come true?

Just now, Microsoft has open sourced a system framework that can add a complete RLHF process to model training - DeepSpeed ​​Chat.

In other words, high-quality ChatGPT-like models of all sizes are now available at your fingertips!

96d7206653a14394c3b8280aaf5fd478.png
▲Project address: https://github.com/microsoft/DeepSpeed

Unlock hundreds of billions of ChatGPT with one click, easily save 15 times of money

As we all know, because OpenAI is too open, the open source community has launched LLaMa, Alpaca, Vicuna, Databricks-Dolly and other models in order to allow more people to use the ChatGPT-like model.

However, due to the lack of an end-to-end RLHF scale system, the training of ChatGPT-like models is still very difficult. The emergence of DeepSpeed ​​Chat just complements this "bug".

d8d6393b0b350f950fef6316845a2fd0.png

What's more, DeepSpeed ​​Chat has greatly reduced the cost.

Previously, expensive multi-GPU setups were out of the reach of many researchers, and even with access to multi-GPU clusters, existing methods were unable to afford the training of hundreds of billions of parameter ChatGPT models.

Now, for $1620, you can train an OPT-66B model in 2.1 days with the hybrid engine DeepSpeed-HE.

With a multi-node, multi-GPU system, DeepSpeed-HE can train an OPT-13B model in 1.25 hours for $320, and an OPT-175B model in less than a day for $5,120.

Elvis, a former Meta AI expert, forwarded excitedly, saying that this is a big event, and expressed curiosity about how DeepSpeed ​​Chat compares with ColossalChat.

9fcc832fd333ee9c7054ec8d02117ea8.png

Next, let's see how it works.

After DeepSpeed-Chat training, the 1.3 billion parameter version of "ChatGPT" performed very well in the question-and-answer session. Not only can you get the context of the question, but the answers you give are also decent.

be0fb6ea36a105796cc329f73ffc1bf8.gif

In multiple rounds of dialogue, the performance of this 1.3 billion parameter version of "ChatGPT" has completely surpassed the inherent impression of this scale.

46fd121ee3f076a95a66bbacaa3c0dc3.gif

A piece of code to generate your first ChatGPT

Of course, before the experience, you need to configure the environment:

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
pip install .

git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/
pip install -r requirements.txt

A cup of coffee, after training the 1.3 billion parameter version of ChatGPT

Also try training a "little toy" with DeepSpeed-Chat if you only have about 1-2 hours for a coffee or lunch break.

The team specially prepared a training example for the 1.3B model, which can be tested on consumer-grade GPUs. Best of all, when you get back from your lunch break, it's all ready to go.

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1

Consumer-grade NVIDIA A6000 GPU with 48GB of video memory:

c4b6e7b608c60eebe9e45ba105b1a565.png

A GPU Node, half a day to get 13 billion parameters

If you only have half a day and a server node, you can use the pre-trained OPT-13B as the actor model and OPT-350M as the reward model to generate a ChatGPT-like model with 13 billion parameters:

python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8

Single DGX node, equipped with 8 NVIDIA A100-40G GPUs:

727a9eacf52d77707494615f78af8898.png

Super money-saving cloud solution, training 66 billion parameter model

If you have access to a multi-node cluster or cloud resources and want to train a larger, higher quality model. Then just enter the model size you want (such as 66B) and the number of GPUs (such as 64) based on the following line of code:

python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --num-gpus 64

8 DGX nodes each equipped with 8 NVIDIA A100-80G GPUs:

198e0f4f7744b721e2f33aad897600b7.png

Specifically, for models of different scales and hardware configurations, the time and cost required for the DeepSpeed-RLHF system are as follows:

80e240fa02a95f265d858dad782dbea9.png

What is DeepSpeed ​​Chat?

DeepSpeed ​​Chat is a general system framework that enables end-to-end RLHF training of ChatGPT-like models, thus helping us generate our own high-quality ChatGPT-like models.

6c293b7e008374c58fbfa467bb77f471.png

DeepSpeed ​​Chat has the following three core functions:

  1. Simplify the training and enhanced inference experience of ChatGPT type models

Developers can implement multiple training steps with only one script, and after completion, they can use the inference API for conversational interactive testing.

  1. DeepSpeed-RLHF module

DeepSpeed-RLHF reproduces the training mode in the InstructGPT paper, and provides data abstraction and mixing functions to support developers to use multiple data sources from different sources for training.

  1. DeepSpeed-RLHF system

The team integrated DeepSpeed's training engine and inference engine into a unified hybrid engine (DeepSpeed ​​Hybrid Engine or DeepSpeed-HE) for RLHF training. Since DeepSpeed-HE is able to seamlessly switch between inference and training modes, it can take advantage of various optimizations from DeepSpeed-Inference.

The DeepSpeed-RLHF system has unparalleled efficiency in large-scale training, making complex RLHF training fast, economical and easy to scale up:

  • Efficient and economical:

DeepSpeed-HE is more than 15 times faster than existing systems, making RLHF training fast and affordable.

For example, DeepSpeed-HE can train an OPT-13B model in just 9 hours and an OPT-30B model in 18 hours on the Azure cloud. These two workouts cost less than $300 and $600, respectively.

f6a1b1b2db1727237b506b2c416144e6.png
  • Excellent scalability:

DeepSpeed-HE can support the training of models with hundreds of billions of parameters, and exhibits excellent scalability on multi-node multi-GPU systems.

Therefore, even a model with 13 billion parameters can be trained in only 1.25 hours. For a model with 175 billion parameters, training with DeepSpeed-HE takes less than a day.

3a6cda8feec319f1b96d97be93dab799.png
  • Realize the popularization of RLHF training:

With only a single GPU, DeepSpeed-HE can support the training of models with more than 13 billion parameters. This enables data scientists and researchers who cannot use multi-GPU systems to easily create not only lightweight RLHF models, but also large and powerful models for different usage scenarios.

db33d007ed470314991cee427eb41d7f.png

Complete RLHF training process

To provide a seamless training experience, we follow InstructGPT and include a complete end-to-end training process in DeepSpeed-Chat.

febd3647f71454cea345b963834a3789.png
▲DeepSpeed-Chat's RLHF training process diagram, including some optional functions

The process consists of three main steps:

step 1:

Supervised fine-tuning (SFT), which uses curated human responses to fine-tune a pre-trained language model for a variety of queries.

Step 2:

Reward model fine-tuning, trains a separate (typically smaller than SFT) reward model (RW) on a dataset containing multiple human-scored answers to the same query.

Step 3:

RLHF training, in this step, the SFT model is further fine-tuned from the reward feedback of the RW model by using the approximate policy optimization (PPO) algorithm.

In step 3, the researchers also provided two additional functions to help improve the quality of the model:

  • Exponential Moving Average (EMA) collection, with the option to select an EMA-based checkpoint for final evaluation.

  • Hybrid training, which mixes the pre-training objective (i.e., next word prediction) with the PPO objective to prevent performance regression on public benchmarks such as SQuAD2.0.

cf5b763fb9746136f9b7e61ac75720c7.png

The two training features of EMA and hybrid training are often ignored by other open source frameworks because they do not hinder the training.

However, according to InstructGPT, EMA checkpoints tend to provide better response quality than traditional final training models, and hybrid training can help models maintain baseline solving capabilities before training.

Therefore, the researchers provide these functions for users so that they can fully obtain the training experience described in InstructGPT.

In addition to being highly consistent with the InstructGPT paper, the researchers also provided functions that allow developers to use a variety of data resources to train their own RLHF models:

  • Data abstraction and blending capabilities:

DeepSpeed-Chat is equipped with (1) abstract dataset layer to unify the format of different datasets; and (2) data splitting/mixing function so that multiple datasets are properly mixed and then performed in 3 training stages segmentation.

014cfc62620c6ec537cf5699edb3e30e.png

DeepSpeed ​​Hybrid Engine

Steps 1 and 2 of the instruction-guided RLHF pipeline, similar to conventional fine-tuning of large models, achieve scale and speed by combining ZeRO-based optimization with flexible parallel strategies in DeepSpeed ​​training.

And step 3 of the pipeline is the most complicated part in terms of performance impact.

Each iteration needs to efficiently handle two phases: a) inference phase, for token/experience generation, to generate input for training; b) training phase, for updating actor and reward model weights, as well as their interactions and scheduling.

It introduces two major difficulties: (1) memory cost, since multiple SFT and RW models need to be run throughout the third stage; (2) the generation answer stage is slower and will slow down significantly if not properly accelerated throughout the third phase.

In addition, two important functions added by the researchers in the third stage - exponential moving average (EMA) collection and hybrid training, will incur additional memory and training costs.

In order to meet these challenges, the researchers combined the full system capabilities of DeepSpeed ​​training and reasoning into a unified infrastructure, namely the hybrid engine (Hybrid Engine).a6227fe55f48f12066faca1b92e70413.png

It leverages the original DeepSpeed ​​engine for fast training mode while effortlessly applying the DeepSpeed ​​inference engine for generation/evaluation mode, providing a faster training system for the third-stage RLHF training.

As shown in the figure below, the transition between DeepSpeed ​​training and inference engines is seamless: by enabling the typical eval and train modes for the actor model, DeepSpeed ​​selects different optimizations when running the inference and training pipelines to achieve faster Run the model and increase the throughput of the overall system.

5c50f42830ee1d8faec2eda067064cc0.png
▲ DeepSpeed ​​hybrid engine design for accelerating the most time-consuming part of the RLHF process

During the inference execution process of the experience generation stage of RLHF training, the DeepSpeed ​​hybrid engine uses a lightweight memory management system to process KV cache and intermediate results, and uses highly optimized inference CUDA cores and tensor parallel computing, which is compatible with existing Compared with the solution, the throughput (the number of tokens per second) has been greatly improved.

During training, the Hybrid Engine enables memory optimization techniques such as DeepSpeed's ZeRO family of techniques and Low-Order Adaptation (LoRA).

The way researchers design and implement these system optimizations is to make them compatible with each other and can be combined to provide the highest training efficiency under a unified hybrid engine.

The hybrid engine can seamlessly change model partitions during training and inference to support tensor-based parallel inference and ZeRO-based training sharding mechanisms.

It can also reconfigure the memory system to maximize memory availability in each mode.

This avoids memory allocation bottlenecks, supports large batch sizes, and greatly improves performance.

In conclusion, the Hybrid Engine pushes the boundaries of modern RLHF training, delivering unparalleled scale and system efficiency for RLHF workloads.

f8e683590d53601f7d7f614c58f339fe.png

effect evaluation

Compared with existing systems such as Colossal-AI or HuggingFace-DDP, DeepSpeed-Chat has more than an order of magnitude of throughput, capable of training larger actor models at the same latency budget or training similarly sized models at a lower cost .

For example, DeepSpeed ​​improves the throughput of RLHF training by more than 10 times on a single GPU. While both CAI-Coati and HF-DDP can run a 1.3B model, DeepSpeed ​​can run a 6.5B model on the same hardware, which is directly 5 times higher.

b64efbe8b7c2ba4b8c56c44ba3c42575.png

On multiple GPUs in a single node, DeepSpeed-Chat is 6-19 times faster than CAI-Coati in terms of system throughput, and HF-DDP is 1.4-10.5 times faster.

60ae15ffb426d4adc827456d7f2d2ddd.png

According to the team, one of the key reasons why DeepSpeed-Chat can achieve such excellent results is the acceleration provided by the hybrid engine during the generation phase.

5be7bfed7c4ca005332b6d9ee97fd3e5.png

aa6e9132577159bedbc94025b75b1542.jpegReply keywords in the background [ join the group ]

Join the NLP, CV, search promotion and job hunting discussion group

 b3ffb2f69cc022b22d4f0f07f4f947ac.png

[1]https://github.com/microsoft/DeepSpeed

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130177877