Everyone can GPT! Microsoft open-sources DeepSpeed Chat to help users train models

Introduction

On April 12, Microsoft announced that it has open-sourced DeepSpeed ​​Chat to help users accelerate the training of a model similar to ChatGPT.

DeepSpeed ​​Chat can simplify the training process of the ChatGPT type model and enhance the reasoning experience. Among them, the DeepSpeed-RLHF system can switch between two modes of inference and training, making the complex RLHF training faster and easier to promote on a large scale. DeepSpeed-HE is more than 15 times faster than existing systems and at a lower cost. On Microsoft cloud Azure, the system can train an OPT-13B model in just 9 hours and an OPT-30B model in 18 hours.

insert image description here
DeepSpeed-Chat has the following three core functions:

  • Simplified training and enhanced inference experience for ChatGPT type models: multiple training steps with just one script, including models pretrained with Huggingface, running all three steps of InstructGPT training with DeepSpeed-RLHF systems, or even generating your own ChatGPT-like model. Additionally, an easy-to-use inference API is provided for users to test conversational interactions after model training.
  • DeepSpeed-RLHF module: DeepSpeed-RLHF replicates the training model in the InstructGPT paper and ensures three-fold training including a) supervised fine-tuning (SFT), b) reward model fine-tuning, and c) reinforcement learning with human feedback (RLHF). Each step corresponds to it one by one. In addition, data abstraction and mixing functions are provided to support users to use multiple data sources from different sources for training.
  • DeepSpeed-RLHF system: It integrates DeepSpeed's training engine and inference engine into a unified hybrid engine (DeepSpeed ​​Hybrid Engine or DeepSpeed-HE) for RLHF training. DeepSpeed-HE is able to seamlessly switch between inference and training modes in RLHF, enabling it to take advantage of various optimizations from DeepSpeed-Inference, such as tensor parallel computing and high-performance CUDA operators for language generation, while training Some also benefit from ZeRO- and LoRA-based memory optimization strategies. DeepSpeed-HE is also capable of intelligent memory management and data caching at different stages of RLHF automatically.

DeepSpeed-Chat has the following three features:

  • Efficiency and affordability: In terms of efficiency, DeepSpeed-HE is more than 15 times faster than existing systems, making RLHF training fast and affordable. For example, DeepSpeed-HE can train OPT-18B in 30 hours and OPT-600B in 300 hours for less than $13 and $9 on Azure Cloud.
    insert image description here

  • Excellent scalability: DeepSpeed-HE supports models with hundreds of billions of parameters and can achieve excellent scalability on multi-node multi-GPU systems. Thus, even a 13B model can be trained in 1.25 hours, while a massive 175B model can be trained in a day using DeepSpeed-HE.
    insert image description here

    The numbers in the two tables above are for step 3 of training, based on the DeepSpeed-RLHF curated dataset and the actual measured training throughput on a training recipe that trains for one epoch on a total of 135M tokens. We have a total of 67.5M query tokens (131.9k queries, sequence length 256) and 67.5M generated tokens (131.9k answers, sequence length 256), with a maximum global batch size of 0.5M per step tokens (1024 query-answer pairs).

  • Democratize RLHF Training: With just a single GPU, DeepSpeed-HE supports training models with over 1.3 billion parameters, enabling data scientists to create not only toy RLHF models but real-world scenarios without access to multi-GPU systems Large and powerful models.
    insert image description here

hands-on experience

Example 1: A Q&A session providing a 1.3B final model trained from DeepSpeed-Chat

insert image description here

Example 2: Multi-turn conversation by feeding a model trained from DeepSpeed-Chat

insert image description here

One script completes all three stages of RLHF training and produces your first ChatGPT model!

pip install deepspeed>=0.9.0

git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/
pip install -r requirements.txt

python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node

In about half a day, the 13B model will be fully trained and ready for checkpoint. The following table shows the training time breakdown for each of the three steps:

insert image description here

Different model sizes and configurations

For example, if you want to train larger, higher quality models for your research or business on a GPU cluster, you can simply use the same script to build the desired model size, e.g. 66B and GPU count, e.g. 64 GPU

python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node

insert image description here

In 9 hours, you can have a 6.6 billion parameter ChatGPT model ready to serve in your favorite front-end GUI:

You could also try training a small/toy model with DeepSpeed-Chat if you only have about 1-2 hours for coffee or a lunch break. For example, we prepare a training example for a 1.3B model with a single dataset for testing on your consumer-grade GPU.

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

insert image description here
Specifically, for models of different scales and hardware configurations, the time and cost required for the DeepSpeed-RLHF system are as follows:

Customize your own RLHF training pipeline using DeepSpeed-Chat's RLHF API

DeepSpeed-Chat allows users to build their own RLHF training pipelines using the flexible APIs shown below, which users can use to rebuild their own RLHF training strategies. This provides a common interface and backend for creating various RLHF algorithms for research exploration.


engine = DeepSpeedRLHFEngine(
  actor_model_name_or_path=args.actor_model_name_or_path,
  critic_model_name_or_path=args.critic_model_name_or_path,
  tokenizer=tokenizer,
  num_total_iters=num_total_iters,
  args=args)

trainer = DeepSpeedPPOTrainer(engine=engine, args=args)

for prompt_batch in prompt_train_dataloader:
  out = trainer.generate_experience(prompt_batch)
  actor_loss, critic_loss = trainer.train_rlhf(out)

Reference: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/README.md

Reference: https://www.deepspeed.ai/

Reference: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/README.md#-deepspeed-chat-roadmap-

Proven RLHF training pipeline

To provide a seamless training experience, we follow InstructGPT and include a mature end-to-end training pipeline in DeepSpeed-Chat, as shown in Figure 1.

insert image description here
Our pipeline consists of three main steps:

  • Step 1: Supervised fine-tuning (SFT), where human responses to various queries are carefully selected to fine-tune the pre-trained language model.
  • Step 2: Reward model fine-tuning, where a separate (typically smaller than SFT) model (RW) is trained using a dataset with human-provided rankings of multiple answers to the same query.
  • Step 3: RLHF training, where the SFT model is further fine-tuned with reward feedback from the RW model using the proximal policy optimization (PPO) algorithm.

We provide two additional functions in step 3 to help improve model quality:

  • A collection of Exponential Moving Averages (EMAs), with an option to checkpoint EMA-based checkpoints for final evaluation.
  • Hybrid training, which mixes the pre-training objective (i.e., next word prediction) with the PPO objective to prevent regression performance on public benchmarks such as SQuAD2.0.

Two training functions (EMA and hybrid training) are often omitted by other recent works because they can be optional. However, according to InstructGPT, EMA checkpoints generally provide better response quality than traditional final training models, and hybrid training can help the model maintain the pre-training baseline solving ability. Therefore, we provide them for users to fully obtain the training experience described in InstructGPT and strive for higher model quality.

In addition to being highly consistent with InstructGPT papers, we also provide convenient functions to support researchers and practitioners to train their own RLHF models using multiple data sources:

  • Data abstraction and blending capabilities: DeepSpeed-Chat is able to train models using multiple datasets for better model quality. It is equipped with (1) abstract dataset layer to unify the format of different datasets; (2) data splitting/blending function to properly mix multiple datasets and then split them in 3 training stages.

DeepSpeed ​​Hybrid Engine – unified infrastructure to power and optimize RLHF training

Steps 1 and 2 of the instruction-guided RLHF pipeline resemble conventional fine-tuning of large models, and they are powered by a flexible combination of ZeRO-based optimization and parallel strategies in DeepSpeed ​​training for scale and speed. On the other hand, step 3 of the pipeline is the most complicated part in terms of performance impact. Each iteration needs to efficiently handle two phases: a) the inference phase of token/experience generation, which generates inputs for training, and b) the training phase to update the weights of the participants and reward models, as well as the interaction and scheduling between them. It introduces two major costs: (1) memory cost, since multiple copies of the SFT and RW models need to be served throughout Phase 3; (2) the main generation phase, which if not properly accelerated, will slow down the entire Phase 3 considerably. Additionally, two important features we added in Phase 3, including Exponential Moving Average (EMA) collection and hybrid training, will incur additional memory and training costs.

To address these challenges, we combine the full system capabilities of DeepSpeed ​​Training and Inference into a unified infrastructure, which we call the Hybrid Engine. It utilizes the original DeepSpeed ​​engine for a fast training mode while effortlessly applying the DeepSpeed ​​inference engine to the generation/evaluation mode, providing a significantly faster training system for Phase 3 RLHF training. As shown in Figure 2, the transition between DeepSpeed ​​training and inference engines is seamless: by enabling the typical evaluation and training modes for the actor model, DeepSpeed ​​chooses its different optimizations to update the inference and training pipelines when running the inference and training pipelines. Run models faster and increase overall system throughput.

insert image description here

Guess you like

Origin blog.csdn.net/qq_45066628/article/details/130221821