Custom model and data for DeepSpeed-Chat training

This article will demonstrate how to use pre-trained models other than facebook opt in the DS-Chat code, and how to prepare and use custom data for model training in order to train large-scale models for specific domains or applications.

Video address: DeepSpeed-Chat uses its own model and data training ChatGPT model_哔哩哔哩_bilibili

The main contents of this chapter include the following points:

  • Introduction to experiment settings: We will introduce the settings of this experiment, and introduce the reasons for this choice, hoping to provide you with a reference for AI technology learning.

  • How to replace the model: We will introduce how to replace the pre-trained model used in DS-Chat and how to integrate your own pre-trained model into DS-Chat.

  • How to prepare data and replace data: We will introduce how to prepare your own data set, including the data format required for model training, data format conversion, etc. And how to use your own data set in DS-Chat model training.

I hope the above content can help you better understand how to use different models and data sets in DS-Chat, so as to train a large-scale model that is more suitable for your specific application scenarios.

Now, many models and public data in the NLP field can be found on Huggingface, and the DS-Chat tool also uses models and data from Huggingface. Therefore, the content of this chapter is mainly based on the model and data of Huggingface.

1 Experimental setup: model and data

【Watch the video explanation】

The experiments in this chapter are mainly set up with reference to LLMZoo.
GitHub - FreedomIntelligence/LLMZoo: ⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

1.1 Why choose this model?

There are two main reasons for choosing this model:

  • The model and data are both public and relevant articles are available for reference.
  • This model (Phoenix-inst-chat-7b) performs very well on Chinese data.

The openness of the model and data, as well as the introduction of related articles, can help us maximize the performance of the public model, making it easier for us to confirm whether our operations are correct.

At the time of the production of this video, the 7B-scale Phoenix-inst-chat-7b model performs very well on Chinese data. The following are the evaluation results of the authors. It can be seen that although there is a certain gap with super-large models such as ChatGPT and Baidu-Wenxin, its performance is still very good in the model with a parameter scale of 7B. Although the performance of ChatGLM with 6B parameters seems to be better, its training data is not made public.

Students with conditions can learn the training skills of large models with the goal of reproducing this model.

The following table is the evaluation of the model by GPT-4

Model Ratio
Phoenix-inst-chat-7b vs. ChatGPT 85.2%
Phoenix-inst-chat-7b vs. ChatGLM-6b 94.6%
Phoenix-inst-chat-7b vs. Baidu-Wenxin 96.8%
Phoenix-inst-chat-7b vs. MOSS-moon-003-sft 109.7%
Phoenix-inst-chat-7b vs. BELLE-7b-2m 122.7%
Phoenix-inst-chat-7b vs. Chinese-Alpaca-7b 135.3%
Phoenix-inst-chat-7b vs. Chinese-Alpaca-13b 125.2%

The following are the results of human evaluation

Model win tie lose
Phoenix vs. ChatGPT 12 35 53
Phoenix vs. ChatGLM-6b 36 11 53
Phoenix vs. Baidu-Wenxin 29 25 46
Phoenix vs. BELLE-7b-2m 55 31 14
Phoenix vs. Chinese-Alpaca-13b 56 31 13

1.2 Pre-trained model

There are currently two types of models released in LLMZoo. Our main reference is the Chinese-biased Phoenix-inst-chat-7b model. The pre-training model used for training this model is: BLOOMZ-7b1-mt. Although this 7B model is much smaller than the current OpenAI model, it is still a very large model. To train this model requires dozens of GPUs for training.

In the early stage of learning, it is recommended that you use the BLOOMZ-560M model with fewer parameters. Although this model is small, by adjusting the parameters and continuously optimizing it, it is still possible to effectively learn the relevant knowledge and skills of LLM training.

When the code runs smoothly on this smaller model, qualified students can try to use more hardware resources to train a 7B-scale model. Of course, at this time you need to know more about DeepSpeed ​​to master how to train on multiple nodes.

The relevant pre-trained models are as follows:

  • BLOOMZ-7b1-mt: The pre-training model used by Phoenix-inst-chat-7b. The name of Huggingface is bigscience/bloomz-7b1-mt. About 48 32G GPUs are needed for training.
  • BLOOMZ-560M: Learning model, bigscience/bloomz-560m, can be trained with a single 24G GPU.

Reference: GPU resources required for different configurations:

Model Minimum amount of GPU batch-size batch-size(device) max_seq_len Status
BLOOMZ-7b1-mt 48(6x8,32g) 48x8x1 8 512 normal training
BLOOMZ-560m 1 (32G) 2 2 512 normal training

1.3 Training Data

The model training data in LLMZoo has been made public on Huggingface under the name FreedomIntelligence/phoenix-sft-data-v1. This training data contains a total of 473K records, including instruction and conversation types. Among them, the instruction type has 267K records, and the conversation type has 198K records. The data covers multiple languages, including Chinese (113K records) and English (51K records), involving more than 40 languages ​​in total.

Download this data, you can download it directly through the following link:
https://huggingface.co/datasets/FreedomIntelligence/phoenix-sft-data-v1/resolve/main/data.json

For more information about the data, please refer to: https://arxiv.org/abs/2304.10453

2 Replacement models

【Watch the video explanation】

The default training of DS-chat uses models and data based on the Huggingface format, so switching to the Huggingface-based BLOOMZ model is very simple. You only need to modify the model_name_or_path parameter to the model you want to use.
Note: Due to the influence of model architecture and encapsulation classes, not all models on Huggingface can be used directly. For example, GLM models cannot be used directly by DS-Chat.

The following takes the BLOOMZ-560M model as an example to introduce how to use the BLOOMZ model in DS-Chat.

The following is a modified  run1.3b.sh script to use this pre-trained model by changing model_name_or_path to bigscience/bloomz-560m:

deepspeed --num_gpus 1 main.py \
   --data_path Dahoas/rm-static \
   --model_name_or_path bigscience/bloomz-560m \
   --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage $ZERO_STAGE \
   --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
   --deepspeed --output_dir $OUTPUT 2>&1 | tee $OUTPUT/training.log

Note: The above settings occupy about 30G of video memory. You can adjust per_device_train_batch_size and per_device_eval_batch_size to reduce the usage of video memory.

Model import can be divided into three parts:

  • Import tokenizer: AutoTokenizer.from_pretrained(...)
  • 导入 model_config: AutoConfig.from_pretrained(...)
  • 导入 model: AutoModelForCausalLM.from_pretrained(...)

For detailed implementation details, please refer to the code below.

from utils.model.model_utils import create_hf_model
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path,
                                          fast_tokenizer=True)
model = create_hf_model(AutoModelForCausalLM,
                        args.model_name_or_path,
                        tokenizer,
                        ds_config,
                        disable_dropout=args.disable_dropout)

The implementation code of create_hf_model function is as follows:

def create_hf_model(model_class,
                    model_name_or_path,
                    tokenizer,
                    ds_config=None,
                    rlhf_training=False,
                    disable_dropout=False):
    model_config = AutoConfig.from_pretrained(model_name_or_path)
    if disable_dropout:
        model_config.dropout = 0.0
    # Note: dschf is defined in function scope to avoid global effects
    # https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration
    if ds_config is not None and ds_config["zero_optimization"]["stage"] == 3:
        dschf = HfDeepSpeedConfig(ds_config)
    else:
        dschf = None
    if rlhf_training:
        # the weight loading is handled by create critic model
        model = model_class.from_config(model_config)
    else:
        model = model_class.from_pretrained(
            model_name_or_path,
            from_tf=bool(".ckpt" in model_name_or_path),
            config=model_config)

    model.config.end_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = model.config.eos_token_id
    model.resize_token_embeddings(int(8 *math.ceil(len(tokenizer) / 8.0)))  
    # make the vocab size multiple of 8

    return model

When using BLOOMZ series models, there is no need to modify any model import code. However, when using other models, such as GLM, DS-Chat cannot directly import the model. In this case, the above code needs to be adjusted.

common problem:

  • Insufficient memory during training:
    countermeasure: reduce the batch-size, you can add parameters  --per_device_train_batch_size 1 --per_device_eval_batch_size 1
    , and you can also modify parameters:--max_seq_len 255

  • Model downloaded from Huggingface, local storage location:
    the default location is in: ~/.cache/huggingface/hub directory

  • How to use your own model ? Just
    set the parameters  model_name_or_path to the local path.  Note that you need to confirm whether there are two files "  and  "
    in the model folder  (DS-Chat does not store these two files when saving the model).tokenizer_config.jsontokenizer.json

3 Replacement data

【Watch the video explanation】

An important development effort for large models is to further optimize the model using task-specific data. Typically, models optimized using data from related tasks will perform better on the target task. Using your own data for model training in the DS-Chat tool can be divided into the following three steps:

  1. Prepare the data and organize the data according to a certain format, such as using JSON format.
  2. Modify the code of data_utils.py and raw_datasets.py to add support for new data.
  3. Set up the new data in the training shell script and start model training.

3.1 How to prepare data

Before preparing data, you first need to understand the data format required for model training. We can understand the data format used during training by looking at the raw_datasets.py code. The following is an example of one of the types of data reading implemented in the code:

class HelloSimpleAIHC3ChineseDataset(PromptRawDataset):
    def get_prompt(self, sample):
        if sample['question'] is not None:
            return " Human: " + sample['question'] + " Assistant:"
        return None

    def get_chosen(self, sample):
        if sample['human_answers'][0] is not None:
            return " " + sample['human_answers'][0]
        return None

    def get_prompt_and_chosen(self, sample):
        if sample['question'] is not None and sample['human_answers'][
                0] is not None:
            return " Human: " + sample['question'] + " Assistant: " + sample[
                'human_answers'][0]
        return None

    def get_rejected(self, sample):
        ...
    def get_prompt_and_rejected(self, sample):
        ...

Through the above code, we can see that there are three data formats in this data: prompt, answer, rejected, and their combination: prompt+answer and prompt+rejected. Therefore, the most basic contents of the training data are prompt, answer and rejected.

Then, we can learn in the part of line 141 in the data_utils.py file:

  • In Stage 1, get_prompt_and_chosen() is called to read the training data. Therefore, if we want to perform Stage 1 training, we need to prepare prompt and answer.

  • In Stage 2, get_prompt_and_chosen and get_prompt_and_rejected are called to read data to train the reward model, that is, this part requires prompt, answer and rejected data.

  • Only get_prompt is called in Stage 3, so only prompt is required to perform Stage 3 training.

The training of the model in the LLMZoo model is similar to Stage 1, so the data you need to prepare only needs to contain prompt and answer.

In order to facilitate data reading, I formatted the phoenix-sft-data-v1 data. The following is a JSON example of its data:

[
  {
    "id": "0",
    "type": "Instruction",
    "from_human": "假设你是一位Airbnb房主。... \n",
    "from_gpt": "很抱歉,作为AI语言模型,我无法检查您的Airbnb列表。"
  },
  {
    "id": "1",
    "type": "Instruction",
    "from_human": "假设你是一位翻译。... \n",
    "from_gpt": "\"Al dente\" means cooking the ..."
  }
]

Among them, from_human is prompt, and from_gpt is answer. Next, if you have your own data, you can prepare it according to the above format.

3.2 Modify the code to read data

【Watch the video explanation】

Next, we'll show you how to modify your code to read custom data. DS-Chat provides data reading methods in multiple formats. You can choose a data reading class similar to your own data format to modify it. Or directly select one of the formats and prepare data according to its format, which can reduce the amount of code modifications.

Code modifications include (please refer to the video for the modification process):

  • New content in data_utils.py
    : objects and interfaces of new data classes need to be defined.
  • New content in raw_datasets.py
    : Define new data reading class. The local data reading method of load_dataset: self.raw_datasets = load_dataset(path="/home/data/", data_files="yourData.json").
  • Run1.3b.sh
    modification: Set to use your own database name.

During the model training process, the data reading class will be called in data_utils.py through the database name to initialize the data reading object. Then in the raw_datasets.py file, when load_dataset is called for the first time, load_dataset will convert the JSON file into arrow format and cache it in the cache_dir directory. The next time the data is read again, the cached arrow file will be read directly.

Note:
If you are using distributed training, it is recommended to use a single GPU process to cache the data part first, because during distributed training, errors may occur when multiple processes cache the data, especially when the amount of data is relatively large. case.

Also note that DS-Chat will perform a second local data cache on the data, which may take up additional storage space on your hard drive. And this method will also lead to excessive memory consumption when the amount of data is relatively large. The issue is currently under official resolution. For specific information, please refer to the link below. During the learning phase, you can use a small number of samples or use multi-GPU training to alleviate this problem.
Feature Request: add LazyPromptDataset to DeepSpeedChat · Issue #450 · microsoft/DeepSpeedExamples · GitHub

Data calling process
Next, I give the code modification process. When modifying the code, you can refer to the following calling process to make modifications.

- File: step1_supervised_finetuning/main.py: 
  - Line 224 (train_dataset, eval_dataset = create_prompt_dataset() 
    - File: /training/utils/data/data_utils.py
      - Line 268: train_dataset, eval_dataset = create_dataset()
      - Line 212: raw_dataset = get_raw_dataset()
        - Line 20:def get_raw_dataset(): 
            return raw_datasets.Wangrui6ZhihuKOLDataset()
            - File: training/utils/data/raw_datasets.py
              - Line 307: class Wangrui6ZhihuKOLDataset(PromptRawDataset)
      
      - Line 220: train_dataset = create_dataset_split()
        - Line 141: if train_phase == 1:
            chosen_sentence = raw_dataset.get_prompt_and_chosen()

common problem

  • Q/A 1: Error Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run
    Problem description: During the training process, you may encounter the above error. This problem is usually caused by unstable model training. It is recommended to increase the batch size to increase the stability of training. Increasing the batch size will increase the usage of video memory. An alternative is to use multiple GPUs, or set gradient_accumulation_steps to achieve the effect of increasing the batch size.
    If you still have problems, you can try using float32 (usually for nan errors).

  • Q/A 2: Pay attention to deleting the temporary data
    DS-Chat program. By default, the data will be cached multiple times, including:

    • Huggingface's caching of data, such as map operations, will automatically cache data (program modifications may cause re-caching, so be careful to delete old cache files).
    • load_dataset will automatically cache json data into arrow data format.
    • DS-Chat will cache the data to the local machine: traindata-xxxx.pt  evaldata-xxx.pt  file is in the /tmp/data_files/ directory of the local machine, and also includes a data index file (*.npy).
  • Q/A 3: Data reading error during distributed training.
    It is recommended to execute the data load_dataset part separately on a single GPU, cache the basic data processing, and then start multi-node distributed training.

  • Q/A 4: When the amount of data is large, how to reduce the usage of machine memory and
    split the data appropriately (corresponding code adjustments are required).
    Dynamic calling can be used. The official solution is being solved. You can follow the link below to learn about the latest progress: Feature Request: add LazyPromptDataset to DeepSpeedChat · Issue #450 · microsoft/DeepSpeedExamples · GitHub

  • Q/A 5: After the local data is modified, when re-training, the data is still the same as before the modification.
    This is caused by DS-Chat’s caching of data. You need to manually delete the cache files on the local machine:
    Default cache directory: /tmp /data_files/, you can delete this directory and restart training.

references

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/131311777