Shocking release, open the superpower of all people|DeepSpeed-Chat is open source!

f3437d53427d06d98b4c149727b0af28.jpeg

fd19be2fbc5faff49b2dd2c0e23cf022.png

This blog was officially written and translated by the Microsoft DeepSpeed ​​team ( https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat ), and authorized Kaiyuanshe to publish in the Chinese community. If you need to reprint, please indicate the source .

a6327aed933bc0be3fd380ad6ac335b3.png

#01

overview

In recent days, ChatGPT and similar models have sparked a wave in the field of artificial intelligence (AI). This trend has revolutionized the digital world. ChatGPT-like models are surprisingly versatile, able to perform tasks such as induction, programming, and translation, with results comparable to or better than human experts. To make training and deploying models like ChatGPT easier, the AI ​​open source community has made various attempts (e.g. ChatLLaMa, Alpaca, Vicuna, Databricks-Dolly, etc.).

However, despite the great efforts of the open source community, there is still a lack of a scalable system supporting end-to-end reinforcement learning with human feedback mechanism (RLHF), which makes it difficult to train a powerful ChatGPT-like model. For example, training a ChatGPT-like model with 6.7 billion parameters using existing open source systems usually requires expensive multi-card to multi-node GPU clusters, but these resources are inaccessible to most data scientists or researchers. At the same time, even with such computing resources, the training efficiency of existing open source systems is usually less than 5% of the maximum efficiency achieved by these machines. In short, even with expensive multi-GPU clusters, existing solutions cannot easily, quickly, and economically train state-of-the-art ChatGPT-like models with hundreds of billions of parameters.

The training of the ChatGPT model is based on the RLHF method in the InstructGPT paper. This is quite different from the common pre-training and fine-tuning of large language models. This makes existing deep learning systems have various limitations in training ChatGPT-like models. Therefore, in order to make ChatGPT-type models more accessible to ordinary data scientists and researchers, and to make RLHF training truly popular in the AI ​​community, we released DeepSpeed-Chat. DeepSpeed-Chat has the following three core functions:

(i) Simplified training and enhanced inference experience for ChatGPT type models: multiple training steps can be achieved with only one script, including models pre-trained with Huggingface, running all three steps of InstructGPT training with DeepSpeed-RLHF system, and even generating Your own ChatGPT-like model. Additionally, we provide an easy-to-use inference API for users to test conversational interactions after model training.

(ii) DeepSpeed-RLHF module: DeepSpeed-RLHF replicates the training model from the InstructGPT paper and ensures that a) supervised fine-tuning (SFT), b) reward model fine-tuning and c) reinforcement learning with human feedback (RLHF) are included in The three steps in it correspond to it one by one. In addition, we also provide data abstraction and mixing functions to support users to use multiple data sources from different sources for training.

(iii) DeepSpeed-RLHF system: We integrate the training engine and inference engine of DeepSpeed ​​into a unified hybrid engine (DeepSpeed ​​Hybrid Engine or DeepSpeed-HE) for RLHF training. DeepSpeed-HE is able to seamlessly switch between inference and training modes in RLHF, enabling it to take advantage of various optimizations from DeepSpeed-Inference, such as tensor parallel computing and high-performance CUDA operators for language generation, while training Some also benefit from ZeRO- and LoRA-based memory optimization strategies. DeepSpeed-HE is also capable of intelligent memory management and data caching at different stages of RLHF automatically.

The DeepSpeed-RLHF system has unparalleled efficiency in large-scale training, making complex RLHF training fast, affordable, and easy to scale up:

Efficient and Economical: DeepSpeed-HE is more than 15 times faster than existing systems, making RLHF training fast and affordable. For example, DeepSpeed-HE can train an OPT-13B model in just 9 hours and an OPT-30B model in just 18 hours on the Azure cloud. These two workouts cost less than $300 and $600, respectively.

4be4a2ec9b0839fd6a1612b88144d081.png

Table 1. Single Node 8x A100: Training Duration and Estimated Azure Cost.

Excellent scalability: DeepSpeed-HE can support training models with hundreds of billions of parameters, and exhibits excellent scalability on multi-node multi-GPU systems. Therefore, even a model with 13 billion parameters can be trained in only 1.25 hours. For a huge model with 175 billion parameters, it takes less than a day to train using DeepSpeed-HE.

aeba16a76b5f1bf6596aa9199262b604.png

Table 2. Multi-Node 64x A100-80GB: Training Duration and Estimated Azure Cost.

Very important detail: The data in the above two tables (i.e., Table 1 and Table 2) are all for step 3 of RLHF training, based on real datasets and testing of DeepSpeed-RLHF training throughput. The training runs for one epoch on a total of 135 million (135M) tokens. We have a total of 67.5 million query characters (131.9k queries, each sequence length 256) and 67.5 million generation/answer characters (131.9k answers, each sequence length 256), and the maximum The global character batch size is approximately 5 million characters (1024 query-answer pairs). Before making any cost and end-to-end time comparisons with DeepSpeed-RLHF, we advise readers to be aware of these settings. For more details, see our page  benchmark setting .

Democratizing RLHF Training: With just a single GPU, DeepSpeed-HE can support training models with over 13 billion parameters. This enables data scientists and researchers who do not have access to multi-GPU systems to easily create not only lightweight RLHF models, but also large and powerful models for different usage scenarios.

29f3493a9ab2b8cbf6e56434c14e3676.png

Table 3. The largest models supported by DeepSpeed-HE on different GPU single cards.

Next, we will take a deeper look at the three functions of DeepSpeed-Chat introduced above.

#02

Simple, efficient and economical ChatGPT

Training and Inference Experience

Next, we show you how to train OPT-1, 3B, OPT-13B and OPT-66B models with the DeepSpeed-RLHF system with one line of commands. For example, even with limited time, you can train an OPT-1.3B model in about two hours on a single consumer-grade GPU. Finally, we demonstrate how to use our DeepSpeed-chat RLHF API to develop your own models:

Easily train your first ChatGPT-like model using DeepSpeed-Chat's RLHF example

a) Only one script is needed to complete all three stages of RLHF training and generate your first ChatGPT model!

The following example shows how to use a script to generate a final 13 billion parameter ChatGPT type model with the pre-trained OPT-13B as the actor model and OPT-350M as the reward model:

git clone https://github.com/microsoft/DeepSpeed.git

cd DeepSpeed

pip install .

git clone https://github.com/microsoft/DeepSpeedExamples.git

cd DeepSpeedExamples/applications/DeepSpeed-Chat/

pip install -r requirements.txt

python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8

In about half a day, your model with 13 billion parameters is ready to train. The table below shows how long the entire process takes in three steps:

1a8a1daa2e070cf8c46c78c4e574d265.png

Table 4. Time required to train OPT-13b using DeepSpeed-Chat for different RLHF steps on a single DGX node (8x A100-40G).

b) Test the trained model using DeepSpeed-Chat's inference API

Import the trained model into the provided simple API, and you can not only interact with the model for question and answer, but also conduct conversational communication.

Human:        Do you know Microsoft? 

Assistant:    Yes, I’m a big fan of Microsoft.  What about them? 

Human:      Can you explian it to a 6-year old child? I wonder how I should describe it 

Assistant:   Sure, I can try.  Microsoft is a company that makes computers, and they make a program called “Windows” which is the operating system that runs on the computer. It’s like the “brain” of the computer.  It’s where all the programs and files are stored. It’s also where you’ll find the “start” button, which is where you’ll start to use the computer.

Want to try different model sizes and configurations? Easy to do!

We learned that users often like to experiment with different model sizes and configurations to meet their varying training time, resource, and quality needs. With DeepSpeed-Chat, you can easily achieve these goals. For example, if you want to train a larger, higher quality model on a GPU cluster for your research or business, you can use the same script, just enter your desired model size (e.g. 66 billion parameters) and Number of GPUs (eg 64 GPUs):

python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --num-gpus 64

Within the next 9 hours, you will have a 66 billion parameter ChatGPT model ready to use in your favorite front-end GUI:

bc3c12408d68bd01aaf04a6529548be2.png

Table 5. Time required to train OPT-66b using DeepSpeed-Chat for different RLHF steps on 8 DGX nodes (64x A100-80G).

You can also try training a small model with DeepSpeed-Chat if you only have about 1-2 hours for coffee or lunch breaks. For example, we prepared a training example of a 1.3 billion parameter model for a single dataset to test our framework on your consumer GPU. When you get back from your lunch break, your model is ready for you to use!

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1

26cacdba520aba4981e293459e55b7b1.png

Table 6. Time required to train OPT-1.3b using DeepSpeed-Chat for different RLHF steps on a single consumer-grade A6000-48G.

Use DeepSpeed-Chat's RLHF API to customize your own RLHF training process

DeepSpeed-Chat allows users to build their own RLHF training pipeline using our flexible APIs (shown below), which users can use to rebuild their own RLHF training strategies. We hope that these functions can provide a common interface and backend for creating various RLHF algorithms in research exploration.

7f2f31360fc282559525bb0ee4d3598e.png

#03

Overview of the complete RLHF training process

To achieve a seamless training experience, we follow the approach of the InstructGPT paper and integrate an end-to-end training pipeline in DeepSpeed-Chat, as shown in Figure 1.

7a3c34668ee307129a4cc890b3be981a.png

Figure 1: Diagram of the RLHF training process for DeepSpeed-Chat, including some optional features.

Our process consists of three main steps:

  • Step 1: Supervised fine-tuning (SFT)  - use curated human responses to fine-tune the pre-trained language model for various queries;

  • Step 2: Reward model fine-tuning  - train a separate (usually smaller than SFT) reward model (RW) using a dataset containing multiple human-scored answers to the same query;

  • Step 3: RLHF training  —Using the Proximal Policy Optimization (PPO) algorithm, the SFT model is further fine-tuned according to the reward feedback of the RW model.

In step 3, we provide two additional functions to help improve model quality:

  • Exponential Moving Average (EMA)  - EMA-based checkpoints can be selected for final evaluation

  • Hybrid training  - mixes the pre-training objective (i.e. next word prediction) with the PPO objective to prevent performance loss on public benchmarks like SQuAD2.0

These two training functions, EMA and hybrid training, are often ignored by other open source frameworks, because they do not hinder the training process. However, according to InstructGPT, EMA generally provides better response quality than traditional final trained models, and hybrid training can help models maintain pre-trained baseline solving ability. Therefore, we provide these features for users to fully obtain the training experience described in InstructGPT and strive for higher model quality.

In addition to being highly consistent with the InstructGPT paper, we also provide a convenient feature to support researchers and practitioners to train their own RLHF models using multiple data sources:

  • Data abstraction and blending capabilities: DeepSpeed-Chat is able to train models using datasets from multiple different sources for better model quality. It is equipped with (1) an abstract dataset layer to unify the format of different datasets; and (2) data splitting/blending functions so that multiple datasets are properly mixed and then split in 3 training stages.

In our previous chapters, you can see how the entire DeepSpeed-Chat trained model performs on multiple rounds of conversation.

#04

DeepSpeed Hybrid Engine

- Unified high-efficiency mixing engine,

Powering and Optimizing RLHF Training

The first two steps of the DeepSpeed-Chat process are similar to the regular fine-tuning of large models, thanks to the flexible combination of ZeRO-based memory management optimization and parallel strategies in DeepSpeed ​​training, the scale and speed are improved. However, the third step of the process is the most challenging part in terms of performance. Each iteration needs to efficiently handle two phases: a) the inference phase that generates answers, providing input for training; b) the training phase that updates actor and reward model weights, and the interaction and scheduling between them. This introduces two major difficulties: (1) memory cost, since multiple SFT and RW models need to be run throughout the third stage; (2) the generation-answer stage is slower and will be significantly faster if not properly accelerated. Slow down the entire third stage. Additionally, two important optional features we added in Phase III, including Exponential Moving Average (EMA) collection and hybrid training, will incur additional memory and training costs.

To address these challenges, we combined the system capabilities of DeepSpeed ​​training and inference into a unified infrastructure called the Hybrid Engine. It utilizes the original DeepSpeed ​​engine for high-speed training mode, while easily applying the DeepSpeed ​​inference engine for generation/evaluation mode, providing a significantly faster training system for the third stage of RLHF training. As shown in Figure 2, the transition between the DeepSpeed ​​training and inference engines is seamless: by enabling the typical eval and train modes for the actor model, when running the inference and training pipelines, DeepSpeed ​​chooses its different optimizations to run the model more efficiently. faster and improve overall system throughput.

c1328c4d32621aada3540a29eb3d1159.png

Figure 2. Design diagram: DeepSpeed ​​Hybrid Engine to accelerate the most time-consuming part of the RLHF process.

During inference execution during the experience generation phase of RLHF training, the DeepSpeed ​​hybrid engine uses a lightweight memory management system to handle KV caching and intermediate results, while using highly optimized inference CUDA cores and tensor parallel computing. DeepSpeed-HE significantly improves throughput (tokens per second) compared to existing solutions.

During training execution, the hybrid engine uses a variety of memory optimization techniques, such as DeepSpeed's ZeRO series of technologies and the now popular LoRA method. These techniques are compatible with each other in the hybrid engine and can be combined to provide the highest training efficiency.

DeepSpeed-HE can seamlessly change the model partition between training and inference to support tensor parallel computing-based inference and ZeRO-based sharding mechanism for training. It also reconfigures the memory system to maximize memory availability during this time. DeepSpeed-HE also further improves performance by circumventing memory allocation bottlenecks and supporting large batch sizes. The hybrid engine integrates a series of system technologies for DeepSpeed ​​training and inference, breaking through the limits of existing RLHF training, and delivering unparalleled scale and system efficiency for RLHF workloads.

#05

DeepSpeed RLHF:

Unparalleled Scale and Efficiency with Hybrid Engine

review

As mentioned earlier, DeepSpeed-HE is a powerful combined system for inference and training, designed to enable DeepSpeed-RLHF to achieve superior scale and efficiency on various hardware, making RLHF training fast, economical and easy AI community use.

In terms of efficiency and economy, as shown in Table 1, DeepSpeed-HE can train an OPT-13B model in only 9 hours on the Azure cloud, and can train an OPT-30B model in 18 hours, costing less than 300 Dollars and $600. In terms of speed and scalability, as shown in Table 2, even the 13B model can be trained in 1.25 hours, while the huge 175B model can be trained in less than a day with 64 GPU clusters. In terms of accessibility and popularization of RLHF, DeepSpeed-HE can train models with more than 13 billion parameters on a single GPU, as shown in Table 3.

Throughput and model size scalability comparison with existing RLHF systems

Compared with other RLHF systems such as Colossal-AI or HuggingFace powered by native PyTorch, DeepSpeed-RLHF outperforms in terms of system performance and model scalability:

  • In terms of throughput, DeepSpeed ​​achieves more than a 10x improvement in RLHF training on a single GPU (Figure 3). In multi-GPU setups, it is 6-19x faster than Colossal-AI and 1.4-10.5x faster than HuggingFace DDP (Figure 4).

  • In terms of model scalability, Colossal-AI can run models up to 1.3B on a single GPU and 6.7B on a single A100 40G node, while DeepSpeed-HE can run 6.5B and 50B model, achieving up to 7.5 times improvement.

Thus, with more than an order of magnitude higher throughput, DeepSpeed-HE possesses the ability to train larger actor models under the same time budget compared to existing RLHF systems such as Colossal-AI or HuggingFace DDP, or with Ability to train similarly sized models at one-tenth the cost.

b83a80fd3ef1b86a635948da9496a481.png

Figure 3. Comparing the throughput of RLHF training with the other two system frameworks at step 3 on a single NVIDIA A100-40G GPU. No icon for OOM (out of memory) condition

36b25ac5bb2738a98d902ed600e50443.png

Figure 4. End-to-end training throughput comparison of different model sizes for step 3 (the longest part) of the training pipeline using 8 NVIDIA A100-40G GPUs on a single DGX node. No icon indicates an OOM (Out of Memory) condition.

This efficiency improvement is the result of DeepSpeed-HE's accelerated RLHF generation during RLHF processing with DeepSpeed ​​Inference Optimization. Figure 5 shows the time consumption details of the 1.3B parameter model in RLHF training iterations: most of the time is spent in the generation phase. By leveraging DeepSpeed's high-performance inference cores, DeepSpeed-HE can achieve up to 9x throughput improvement over HuggingFace and 15x over Colossal-AI at this stage, resulting in unparalleled end-to-end efficiency.

31434760da47f09ac03729592ee3c4a1.png

Figure 5. Superior acceleration of DeepSpeed ​​Chat’s hybrid engine in the generation phase: time/sequence decomposition of OPT-1.3B actor model + OPT-350M reward model trained using 8 A100-40G GPUs on a single DGX node

Effective Throughput and Scalability Analysis

(I) Effective throughput analysis.  In phase 3 of RLHF training, the effective throughput of DeepSpeed-HE depends on the throughput it achieves in the generation and RL training phases. In our RLHF (see  benchmarking setting for details ), the generation phase accounts for about 20% of the total computation, while the RL training phase accounts for the remaining 80%. However, despite the smaller scale, the former may take most of the end-to-end time, since it needs to run the actor model once for each generated character, making it memory bandwidth bound and difficult to achieve high throughput. In contrast, the RL training phase is computationally intensive, requiring only a few forward and backward passes to run the reference actor model, with each sample having the full 512 characters from hinting and generation, enabling good throughput quantity.

0c24bf6b0333aa5f645d4583645a3f1c.png

Figure 6. DeepSpeed-HE RLHF generation, training, and effective throughput for different model sizes at maximum efficiency.

To maximize the effective throughput, DeepSpeed-HE optimizes two stages. First, it uses as large a batch size as possible to achieve higher efficiency on both stages. Second, in the generation phase, it leverages high-performance CUDA kernels on the model to maximize GPU memory bandwidth utilization on a single GPU, and in other cases utilizes Tensor Parallelism (TP) for computation. DeepSpeed-HE further uses TP instead of ZeRO in the generation stage to reduce inter-GPU communication and maintain high GPU memory bandwidth utilization.

Figure 6 shows the best effective throughput (expressed in TFlops/GPU) that DeepSpeed-HE can achieve in the model size range from 1.3B to 175B. It also shows the throughput achieved during the generation and training phases, respectively. DeepSpeed-HE is most efficient for models in the range 6.7B-66B. Beyond this range to 175B, throughput drops due to limited memory to support larger batch sizes, but is still 1.2x more efficient than the small 1.3B model. When we scale these huge models to more GPUs with more memory, the per-GPU throughput of these models may be further improved.

Furthermore, we would like to point out that, as shown in Figure 2, the effective performance of our system is 19 times higher than that of existing systems, which indicates that they operate at less than 5% of the peak speed. This illustrates the challenge of optimizing RLHF workloads and the effectiveness of our system in the face of the challenge.

7279a332f212f4e7d0e3f5badbb7edb7.png

Figure 7. Scalability training of 13B (left) and 66B (right) actor models and 350M reward models on different numbers of DGX (A100-40/80G GPU) nodes.

(II) Scalability analysis.  The best effective throughput for different model sizes depends on different numbers of GPUs. This is partly because some larger model sizes require more memory to run. Based on this, we next discuss the scalability properties of DeepSpeed-HE.

Figure 7 shows that DeepSeed-RLHF achieves good overall scaling on clusters up to 64 GPUs. However, if we look closely, DeepSpeed-RLHF training achieves super-linear scaling at small scales, followed by near-linear or sub-linear scaling at larger scales. This is due to the interplay between memory availability and the maximum global batch size.

The core of the training phase of DeepSpeed-HE is based on ZeRO. This means that as the number of GPUs increases, the memory consumption per GPU decreases, enabling DeepSpeed-HE to support larger batch sizes per GPU, enabling super-linear scaling. However, at large scale, the maximum global batch size still limits the batch size per GPU, leading to near-linear or sub-linear scaling despite the continuous increase in available memory. Thus, DeepSpeed-HE achieves the best throughput between superlinear and sublinear scalability at a given maximum global batch size (e.g., we set to 1024 sentences, each sentence length 512) and cost-effective. The exact balance point depends primarily on the maximum batch size runnable on each GPU, which in turn is a function of available memory and the global batch size.

Release: Try DeepSpeed ​​Chat Now!

We are very happy to announce that DeepSpeed-Chat is now open source and open to the AI ​​community.

  • If you find our results useful to you or like our open source results, please  click ⭐ on DeepSpeed ​​( https://github.com/microsoft/DeepSpeed ) and  DeepSpeedExamples ( https://github.com/microsoft/DeepSpeedExamples ) .

  • Visit our DeepSpeed-Chat GitHub page to get started: GitHub landing page ( https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat )

  • We will continue to improve DeepSpeed-Chat based on your feedback and support. Our roadmap ( https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/README.md#-deepspeed-chats-roadmap- ) shows currently supported features and planned future support function.

DeepSpeed-Chat is part of the larger DeepSpeed ​​ecosystem, which includes numerous deep learning systems and modeling techniques. For more information,

  • Please visit our website ( https://www.deepspeed.ai/ ) for detailed blog posts, tutorials, and useful documentation.

  • You can also follow our English Twitter (DeepSpeed) and Japanese Twitter (マイクロソフトDeepSpeed) to learn about the latest developments of DeepSpeed.

DeepSpeed ​​welcomes your contributions! We encourage you to report issues, contribute PRs, and participate in discussions on  the DeepSpeed ​​GitHub page ( https://github.com/microsoft/DeepSpeed/ ) . Please refer to our contribution guide ( https://github.com/microsoft/DeepSpeed/blob/master/CONTRIBUTING.md ) for more details. We are willing to cooperate with universities, research laboratories, companies, etc. to jointly carry out deep learning research, and apply DeepSpeed ​​to AI models and applications that empower the real world. For such requests (and other requests not suitable for submitting on GitHub), please email  [email protected] directly .

This blog was officially written and translated by the Microsoft DeepSpeed ​​team ( https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat ), and authorized Kaiyuanshe to publish in the Chinese community. If you need to reprint, please indicate the source .

07fd830d7b357f2de15c37cfe331c654.png

Author丨Microsoft DeepSpeed ​​Group

Editor丨Corrie

Design丨Zhao Yuyue

Related Reading| Related Reading

f803bf5a484e81c83b97410528cee562.jpeg

5f335dbdf998f6b9192ea081f1c9d90b.jpeg

Introduction to Kaiyuanshe

Founded in 2014, the Open Source Club is composed of individual members who voluntarily contribute to the open source cause. It is formed according to the principles of "contribution, consensus, and co-governance". It has always maintained the characteristics of vendor neutrality, public welfare, and non-profit. International integration, community development, project incubation" is an open source community federation with the mission. Kaiyuanshe actively cooperates closely with communities, enterprises and government-related units that support open source. With the vision of "Based in China and Contributing to the World", it aims to create a healthy and sustainable open source ecosystem and promote China's open source community to become an active force in the global open source system. Participation and Contributors.

In 2017, Kaiyuanshe was transformed into an organization composed entirely of individual members, operating with reference to the governance model of top international open source foundations such as ASF. In the past nine years, it has connected tens of thousands of open source people, gathered thousands of community members and volunteers, hundreds of lecturers at home and abroad, and cooperated with hundreds of sponsors, media, and community partners.

964edc856f93377ed7d0566d372c53d4.jpeg

Guess you like

Origin blog.csdn.net/kaiyuanshe/article/details/130143615