Zephyr-7B paper analysis and full training, Lora training

一、Zephyr:Direct Distillation of LM Alignment

1.1 Development process

1.1.1 Zephyr-7B-alpha

  A few months ago, a new team in Paris released their first model: Mistral 7B, a small but powerful model , outperformed all similar models in benchmark tests and is also an open source project.

  Hugging Face Two members of the H4 team discussed the possibility of fine-tuning the Mistral 7B model using the newly published DPO method from Stanford University in a small gathering. Subsequently, they found some public data sets on HF hub, including two major open source openBMBs jointly supported by Wallface Intelligence and Tsinghua University NLP. Large-scale, high-quality fine-tuned datasets: UltraFeedback and UltraChat.

  1. UltraFeedback: A large-scale, diverse, fine-grained preference data set. The construction process is as follows:
    • About 64k tips collected from multiple resources such as UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA and FLAN
    • In order to prevent reward models from overfitting certain text styles or capturing spurious correlations between text styles and rewards, we select 17 base models at various levels with different sizes, architectures, and training data to build a model pool, including LLaMA, Falcon, StarChat, MPT, GPT and Bard, etc.
    • definedHelpfulness, Truthfulness, Honesty, Verbalized Calibration , Harmless5 principles to adjust model behavior from different aspects
    • For each instruction, 4 models are randomly sampled to complete, and for each completed instruction, we randomly sample a principle and add it to the system prompts to adjust the model's behavior.
    • The final data set includes 64k instructions, 256k dialogue data and corresponding preference annotation data, and 308k high-quality feedback. Among non-community annotated preference data sets, this data size ranks first. Each preference annotation in the conversation data contains instruction-following, truthfulness, honesty and helpfulness fine-grained scores in four aspects and annotations of GPT-4. For the detailed technical principles of the entire data set, please refer to its paper.
      Insert image description here

Based on UltraFeedback, the Wallface team also trained a reward modelUltraRMand a criticism modelUltraCM "How UltraFeedback, the wall-facing intelligent alignment technology, allows the 7B model to defeat 70B LLaMA2" to further assist model evaluation and model feedback learning. For more information, see

  • UltraChat: A high-quality conversation data set, containing more than 1.5 million multi-turn command data. Call multiple ChatGPT APIs to talk to each other, thereby generating multiple rounds of conversation data.

  After several rounds of experiments, it was proved that the new model trained using the two OpenBMB data sets is very powerful. It is the strongest model the H4 team has ever seen in benchmark tests at Berkeley and Stanford, and was later named Zephyr model. The average MT-Bench score of Zephyr-7B-alpha 7.09 exceeds Llama2 -70B-Chat.
Insert image description here

  A 7B model based on a high-quality data set defeated one with ten times the parameters LLaMA2-70B-Chat, which shows that the underlying data work is the most scarce and valuable in time. , this may be one of the breakthrough points for various large models in the hundreds of models war.

  In addition, the effect of Zephyr is better than LLaMA2-70B-Chat. Another main reason is the use of the DPO method recently proposed by Stanford University and CZ Biohub. /span>. Different from the traditional PPO reinforcement learning method, the DPO method abandons reinforcement learning and is much more stable than PPO.

  DPOSimple explanation: In order to make the output of the model more consistent with human preferences, the traditional method has been to use a reward model to fine-tune the target model. If the output is good, reward will be given; if the output is bad, no reward will be given. The DPO method bypasses the modeling reward function and is equivalent to optimizing the model directly on the preference data. It solves the problem of difficult and high training costs of reinforcement learning based on human feedback.

1.1.2 Zephyr-7B-beta

  When developing the second-generation model, they considered distillation supervised fine-tuning (dSFT) for large models, but the model was misaligned with this method and could not well generate output that conformed to user intent.

Insert image description here
  So the team tried to use preference data from AI Feedback (AIF), use a "teacher model" to rank the output to form a dataset, and then apply Distilled Direct Preference Optimization (dDPO) to train a model aligned with user intent. model without requiring any additional sampling during fine-tuning. The researchers also tested the effect of not using SFT, and the performance was greatly reduced, indicating that the dSFT step is crucial.
Insert image description here

  The second generationZephyr-7B-beta explored extracting alignment from GPT-4 and Claude 2 and then injecting it into small Based on the ideas in the model, a method of applying Distilled Direct Preference Optimization (dDPO) to the small model was developed, and the average MT-Bench score increased to 7.34.

Insert image description here
On AlpacaEval, Zephyr’s winning rate is 90.6%, which is better than ChatGPT (3.5):
Insert image description here

1.2 Summary

  This article aims to create a smaller language model that is better able to align user intent.

  Previous research has shown that distillation supervised fine-tuning (dSFT) on larger models can significantly improve task accuracy. However, these models' responses to natural cues are suboptimal. To improve this property, we tried to use preference data from AI Feedback (AIF) to learn an intention alignment with significantly improved intent by applying Distilled Direct Preference Optimization (dDPO) using the output dataset ranked by the teacher model. Chat model.

  This approach requires only a few hours of training and requires no additional sampling during fine-tuning. The resulting Zephyr-7B achieved a new state-of-the-art level on the chat benchmarks of the 7B parameter model without the need for manual annotation. MT-Bench results show that Zephyr-7B exceeds LLaMA2-70B-Chat. Code, models, data, and tutorials for the system can be found atalignment-handbook.

1.3 Related work

  In recent years, open source large-scale language models have continued to emerge, such as LLaMA, RedPajama-INCITE, Falcon, Llama 2, Mistral and other models that emerged after ChatGPT, providing the research community with basic models for research and application. With the development of open source models, researchers are studying methods to transfer knowledge from large models to improve the performance of small models. This trend began withself-instruct and Alpaca, distillation strategies such as SFT and preference optimization are the focus of research.

  To keep up with the pace of innovation in generative AI, tools for benchmarking and evaluating LLM have also made great progress, such as:

  • Use powerful LLM (GPT-4, Claude) as an evaluator to score model outputs or pairwise ranked replies to judge model response.
  • LMSYS chatbot arena: Use crowdsourcing to benchmark LLM through anonymous random battles. Models are ranked based on their Elo score on the leaderboard.
  • AlpacaEval: A ranking method similar to LMSYS that compares models in pairs, but uses larger LLMs such as GPT-4 and Claude to replace humans for evaluation
  • MTBench: Use GPT-4 to score multiple rounds of conversations (1-10 points) in different task categories, including reasoning, role-playing, mathematics, coding, and writing , humanities, STEM and information extraction, etc.
  • 其它评测工具:HuggingFace Open LLM leaderbaordChain-of-Thought HubChatEvalFastEval等。

  This article finally demonstrates the effect of the Zephyr model through the evaluation results on MTBench, AlpacaEval and HuggingFace OpenLLM rankings.

Model performance on MT-Bench

1.4 Algorithm

Reference:"Zephyr-7B: Fine-Tuning and Inference with W&B", "HuggingFace's new work: 7 billion defeats 70 billion Llama 2, open source model Zephyr-7B!" MAC can run》

The paper aims to make the open source large-scale language model consistent with user intentions. As shown in the figure below, the entire training process is divided into three steps:

1.4.1 Distillation Supervised Fine-tuning (dSFT)

  ˆ dSFT (Distilled Supervised Fine-Tuning) uses high-quality instruction-response data sets to teach our models to respond to instructions and cues. Rather than using traditional supervised fine-tuning (SFT) on command-response datasets, Zephyr utilizes a teacher model to generate these high-quality responses, thereby "distilling" some of the teacher model's capabilities into our model, and you also Think of it as a pseudo-labeling method.

  Suppose you have a set of seed hints { x 1 , . . . , x j } \{x_1, ..., x_j\ } { x1,...,xj},Present details x i  x_i xi、Instructor model used (GPT-4)The instruction was received y i y_i < /span>andi, and further refine this instruction based on its response to obtain x ^ i \hat{x}_i x^i, the final set of numbers: C = { ( x ^ i , y i ) , . . . , ( x ^ j , y j ) } C = \{(\hat {x}_i, y_i), ..., (\hat{x}_j, y_j)\} C={(x^i,andi),...,(x^j,andj)}

The model is then instructed to tune to optimize the following equation:

π d S F T = m a x π   E ( x , y ) ∼ C l o g π ( y ∣ x ) \pi_{dSFT} = \underset{\pi}{max} ~ \underset{(x, y) \sim C}{ \mathbb{E}} log \pi(y|x)PidSFT=Piatx (x,y)CElogπ(yx)

  • π \piπ: The parameters to be optimized, that is, the student model
  • C: Training data set generated by the teacher model, including refined tips x ^ i \hat{x}_i x^i和响应 y i y_i andi
  •   E ( x , y ) ∼ C ~ \underset{(x, y) \sim C}{\mathbb{E}}  (x,y)CE: Indicates sampling from data set C x x xsum y y and

  The goal of this equation is to maximize the log-likelihood probability that the student model generates the response of the teacher model, that is, by making the student model imitate the response of the teacher model, knowledge transfer is achieved.

1.4.2 Preference-based AI feedback (AIF)

  Human feedback (HF, Human feedback) can provide additional guidance signals for aligning large language models (LLM) and is a common method for adjusting LLM. This article uses distillation, so instead the teacher model is used to provide guidance on the output generated by other models, that is, preference-based AI feedback (AIF, AI Feedback through Preferences). To put it bluntly, AI feedback (teacher model) is used to replace human feedback.

  Specifically, refer to the method in UltraFeedback, for each prompt x 1 , . . . , x j x_1, ..., x_j x1,...,xj, Use 4 models (Claude, LLaMA, Falcon, etc.) to generate the model ( y i 1 , y i 2 , y i 3 , y i 4 ) (y^1_i, y^2_i , y^3_i, y^4_i) (yi1,andi2,andi3,andi4), and then use GPT-4 as the teacher model to give it a score s { 1 , 2 , 3 , 4 } = π T ( ⋅ ∣ x i , y i { 1 , 2 , 3 , 4 } )  s^{\{1, 2, 3, 4\}} = \pi_{T}(\cdot|x_i, y_i^{\{ 1, 2, 3, 4\}}) s{ 1,2,3,4}=PiT(xi,andi{ 1,2,3,4}), the response with the highest score among the 4 responses is called y w  y_w andw , a randomly lower-scoring response is called y l y_l andl . In this way, we start from the prompt list { x 1 , . . . , x j } \{x_1, ..., x_j\} { x1,...,xj}Chinese derivative AI counter-number set D = { ( x 1 , y 1 w , y 1 l ) , . . . , ( x j , y j w , y j l ) }  D = \{(x_1, y_1^w, y_1^l), ..., (x_j, y_j^w, y_j^l)\} D={(x1,and1w,and1l),...,(xj,andjw,andjl)}, which is a triplet with a stronger response and a weaker response.

1.4.3 Direct distillation preference optimization (dDPO)

  The goal of Distilled direct preference optimization (dDPO, Distilled direct preference optimization) is to maximize the response in the preference model y w y_w andw对响应 y l y_l andlThe priority ranking probability is used to optimize the model πdSFT after dSFT. The reward function is determined by the student language model

  Past work on using artificial intelligence feedback (AI feedback) mainly focused on using reinforcement learning (RL) methods, such as PPO (Proximal Policy Optimization), by first training the reward function and then sampling from the current policy to calculate updates To optimize θ θ θ. In DPO, the preference model is determined by the reward function r θ ( x , y ) r_θ(x, y) rθ(x,y) OK, this function uses the student language model π θ π_θ Piθ

  The key observation of DPO is to use the optimal language model policy π and the original language model policy πdSFT to derive the optimal reward function. With an appropriate choice of preference model, they show that for constant β and partition function Z, we have:
r ∗ ( x , y ) = β π * ( y ∣ x ) π dSFT ( y ∣ x ) + β log ⁡ Z ( x ) r^*(x,y) = \beta \frac{\pi_{\text{*}}(y | x)} {\pi_{\text{dSFT}} (y | x)} + \beta\log Z(x) r(x,y)=bPidSFT(yx)Pi*(yx)+blogZ(x)

将内容设计设计偏好在线中,可以可以可以电视:
π θ = m a x π E ( x , y w , y l ) ∼ D l o g σ ( β l o g π ( y w ∣ x ) π d S F T ( y w ∣ x ) − β l o g π ( y l ∣ x ) π d S F T ( y l ∣ x ) ) \pi_\theta = \underset{\pi}{max} \underset{(x, y_w, y_l)\sim D}{\mathbb{E}} log \sigma (\beta log \frac{\pi(y_w|x)}{\pi_{dSFT} (y_w|x)} - \beta log \ frac{\pi(y_l|x)}{\pi_{dSFT} (y_l|x)}) Piθ=Piatx(x,yw,yl)DElogσ( βlogPidSFT(ywx)π(ywx)βlogPidSFT(ylx)π(ylx))

  Compared with RLHF, DPO optimizes the model directly from static preference data without requiring a trained reward model. According to the authors, DPO is lightweight and more stable. The method used in this article is called dDPO because the dataset is distilled from an earlier step, leveraging AI-provided preference labels.

  Summary of the entire training process:

  1. Adjust the dSFT instruction to the LLM to obtain the model π d S F T \pi_{dSFT} PidSFT
  2. Refer to the method in UltraFeedback, from the prompt list { x 1 , . . . , x j } \{x_1, ..., x_j\} { x1,...,xj}中组builtAIcounterfeit collection D = { ( x 1 , y 1 w , y 1 l ) , . . . , ( x j , y j w , y j l ) }  D = \{(x_1, y_1^w, y_1^l), ..., (x_j, y_j^w, y_j^l)\} D={(x1,and1w,and1l),...,(xj,andjw,andjl)}
  3. AIF ternary set ( x i , y i w , y i l ) } (x_i, y_i^w, y_i^l)\} (xi,andiw,andil)} to optimize the model.
    • Calculation dSFT model ( x , y w ) (x, y_w) (x,andw) ( x , y l ) (x, y_l) (x,andl) Probability (forward calculation only).
    • Calculation dDPO model ( x , y w ) (x, y_w) (x,andw) ( x , y l ) (x, y_l) (x,andl)probability
    • Calculate the loss based on the objective function, backpropagate to update the parameters, and repeat the process.
1.4.4 Training details
  • π d S F T \pi_{dSFT}PidSFTModel training cannibalization is: cosine LR scheduler, maximum learning rate 2e-5, warmup steps=10%, epoch=1, sequence length=2048, batch size =512.
  • DPO model training cannibalization is: linear LR scheduler, maximum learning rate 5e-7, warmup steps=10%. batch size =32, β=0.1, epoch=3.

  The final Zephyr-7B model is weight initialized on the SFT model (trained for 1 epcoh), and then DPO trained for 3 epochs.

1.5 Experiment

  1. dDPO improves the performance on the dialogue data sets MT-Bench and AlpacaEval
    Insert image description here
  2. dDPO improves the performance on traditional tasks (Academic Task)
    Insert image description here
  3. Is preference optimization necessary?
    In Table 3, we examine the impact of different steps in the alignment process by fine-tuning Mistral 7B in four different ways:
    Insert image description here
    • dDPO - dSFT: Directly perform 1 epoch DPO training on the base model, and the data set is UltraFeedback. It can be seen that without the first step of SFT training, the model cannot learn from feedback and performs poorly.
    • dSFT1: Perform 1 epoch SFT training on the base model, and the data set is UltraChat. This step significantly improves the model's scores on the two chat benchmarks.
    • dSFT2: First performdSFT1, and then perform 1 epoch SFT training on the UltraFeedback data set. The model is overfitting
    • dDPO+dSFT: The training strategy of this article, dSFT1 is then followed by 1 epoch of DPO training on the ltraFeedback data set, and there is a significant improvement on both benchmarks.
  4. Will overfitting result in loss of performance on downstream tasks?
    • After one round of DPO training, the model will experience strong overfitting, as shown in the perfect performance of the training set accuracy in the figure below. But this does not hurt the downstream performance on MT-Bench and AlpacaEval. As the training time increases, the effect becomes better after overfitting. The researchers believe this is similar to overfitting in SFT.
    • If SFT training exceeds 1 epoch, the DPO step will cause performance degradation.
    • The best model underwent one round of SFT training and three rounds of DPO training.
      Insert image description here

2. alignment-handbook: low-cost training Zephyr

Reference: "How to train a model Zephyr-7B that can surpass 70B Llama2 at low cost", project address《alignment-handbook》

  The complete training process of Zephyr is published in "alignment-handbook". For environment installation, see the project homepage. The training process is briefly introduced below.

2.1 Project Introduction

The entire training process is divided into two steps:

  1. SFTTraining: UsingUltraChat dataset pairMistral 7B The model is trained by SFT.
    For SFT training, we used UltraChat dataset , which contains about 1.6M generated by GPT3.5 dialogue. We initially trained on all the data, but later found that the trained model had a somewhat annoying personality. Therefore, we filtered out about 200K more helpful examples for training, and the filtered dataset is ultrachat_200k.
  2. DPOFine-tuning: Using theUltraFeedback dataset'spreprocessed version, fine-tune the SFT model with DPO (direct preference optimization) and align it with AI feedback.
    The UltraFeedback dataset covers a wide range of models. Each response was scored by GPT-4 based on criteria such as helpfulness to derive the AI's preferences. An interesting finding is that when using the DPO method, as the training time increases, the effect is actually better after overfitting. The researchers believe this is similar to overfitting in SFT.

  In addition, TRL and DeepSpeed ​​ZeRO-3 were used in all experiments: SFTTrainer, DPOTrainer, total computing cost: $500 or running on 16 x A100 for 8 hours, experience demo: zephyr-chat .

  Evaluation method: We used the excellent tool MT Bench provided by LMSYS. This multi-round benchmark evaluates the chatbot’s capabilities in various areas including creative writing, coding, and mathematics. It provides more accurate information about chatbot performance than other rankings.

Ultimately, the project provides two training methods:

  • Zephyr-7B complete training: Because it is full training, deepspeed ZERO stage3 is enabled. For the environment configuration, seerecipes/accelerate_configs/deepspeed_zero3. yaml.

    # Step 1 - SFT
    ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_full.yaml
    
    # Step 2 - DPO
    ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
    
  • Zephyr-7B LoRA training: Fine-tuning does not require turning on deepspeed. For environment configuration, see recipes/accelerate_configs/multi_gpu.yaml< a i=3>.

    # Step 1 - SFT
    ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml
    
    # Step 2 - DPO
    ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_lora.yaml
    

The training code is given below. I’m too tired to write notes today and haven’t run yet. I’ll make up for it when I have time.

2.2 Full training

2.2.1 Environment configuration

Placement item listrecipes/accelerate_configs/deepspeed_zero3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
2.2.2 SFT training
  1. The model configuration file isrecipes/zephyr-7b-beta/sft/config_full.yaml
# Model arguments
model_name_or_path: mistralai/Mistral-7B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true

# Data training arguments
dataset_mixer:
  HuggingFaceH4/ultrachat_200k: 1.0
dataset_splits:
- train_sft
- test_sft
preprocessing_num_workers: 12

# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 2
gradient_checkpointing: true
hub_model_id: zephyr-7b-sft-full
hub_strategy: every_save
learning_rate: 2.0e-05
log_level: info
logging_steps: 5  
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 2048
max_steps: -1
num_train_epochs: 1
output_dir: data/zephyr-7b-sft-full
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 32
push_to_hub: true
remove_unused_columns: true
report_to:
- tensorboard
save_strategy: "no"
save_total_limit: null
seed: 42
tf32: true
  1. See the SFT training codescripts/run_sft.py
#!/usr/bin/env python
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Supervised fine-tuning script for decoder language models.
"""

import logging
import random
import sys

import datasets
import torch
import transformers
from transformers import set_seed

from accelerate import Accelerator
from alignment import (
    DataArguments,
    H4ArgumentParser,
    ModelArguments,
    SFTConfig,
    apply_chat_template,
    get_datasets,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
    get_tokenizer,
)
from trl import SFTTrainer


logger = logging.getLogger(__name__)


def main():
    parser = H4ArgumentParser((ModelArguments, DataArguments, SFTConfig))
    model_args, data_args, training_args = parser.parse()

    # Set seed for reproducibility
    set_seed(training_args.seed)

    accelerator = Accelerator()

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {
      
      training_args.local_rank}, device: {
      
      training_args.device}, n_gpu: {
      
      training_args.n_gpu}"
        + f" distributed training: {
      
      bool(training_args.local_rank != -1)}, 16-bits training: {
      
      training_args.fp16}"
    )
    logger.info(f"Model parameters {
      
      model_args}")
    logger.info(f"Data parameters {
      
      data_args}")
    logger.info(f"Training/evaluation parameters {
      
      training_args}")

    ###############
    # Load datasets
    ###############
    raw_datasets = get_datasets(data_args, splits=data_args.dataset_splits)
    logger.info(
        f"Training on the following datasets and their proportions: {
      
      [split + ' : ' + str(dset.num_rows) for split, dset in raw_datasets.items()]}"
    )

    ################
    # Load tokenizer
    ################
    tokenizer = get_tokenizer(model_args, data_args)

    #####################
    # Apply chat template
    #####################
    raw_datasets = raw_datasets.map(apply_chat_template, fn_kwargs={
    
    "tokenizer": tokenizer, "task": "sft"})
    train_dataset = raw_datasets["train"]
    eval_dataset = raw_datasets["test"]

    with training_args.main_process_first(desc="Log a few random samples from the processed training set"):
        for index in random.sample(range(len(raw_datasets["train"])), 3):
            logger.info(f"Sample {
      
      index} of the processed training set:\n\n{
      
      raw_datasets['train'][index]['text']}")

    #######################
    # Load pretrained model
    #######################
    logger.info("*** Load pretrained model ***")
    torch_dtype = (
        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
    )

    model_kwargs = dict(
        revision=model_args.model_revision,
        trust_remote_code=model_args.trust_remote_code,
        use_flash_attention_2=model_args.use_flash_attention_2,
        torch_dtype=torch_dtype,
        use_cache=False if training_args.gradient_checkpointing else True,
        device_map=get_kbit_device_map(),
        quantization_config=get_quantization_config(model_args),
    )
    logger.info("*** Model loaded! ***")

    ########################
    # Initialize the Trainer
    ########################
    trainer = SFTTrainer(
        model=model_args.model_name_or_path,
        model_init_kwargs=model_kwargs,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field="text",
        max_seq_length=training_args.max_seq_length,
        tokenizer=tokenizer,
        packing=True,
        peft_config=get_peft_config(model_args),
    )

    ###############
    # Training loop
    ###############
    logger.info("*** Train ***")
    train_result = trainer.train()
    metrics = train_result.metrics
    max_train_samples = data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
    metrics["train_samples"] = min(max_train_samples, len(train_dataset))
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    ##########
    # Evaluate
    ##########
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        metrics = trainer.evaluate()
        max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
        metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

    ##################################
    # Save model and create model card
    ##################################
    logger.info("*** Save model ***")
    trainer.save_model(training_args.output_dir)
    logger.info(f"Model saved to {
      
      training_args.output_dir}")

    # Save everything else on main process
    if accelerator.is_main_process:
        kwargs = {
    
    
            "finetuned_from": model_args.model_name_or_path,
            "dataset": list(data_args.dataset_mixer.keys()),
            "dataset_tags": list(data_args.dataset_mixer.keys()),
            "tags": ["alignment-handbook"],
        }
        trainer.create_model_card(**kwargs)
        # Restore k,v cache for fast inference
        trainer.model.config.use_cache = True
        trainer.model.config.save_pretrained(training_args.output_dir)

        if training_args.push_to_hub is True:
            logger.info("Pushing to hub...")
            trainer.push_to_hub()

    accelerator.wait_for_everyone()


if __name__ == "__main__":
    main()
2.2.3 DPO training
  1. The environment configuration files are the same
  2. Model arrangement text viewrecipes/zephyr-7b-beta/dpo/config_full.yaml
# Model arguments
model_name_or_path: alignment-handbook/zephyr-7b-sft-full

# Data training arguments
# For definitions, see: src/h4/training/config.py
dataset_mixer:
  HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
preprocessing_num_workers: 12

# DPOTrainer arguments
bf16: true
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 1
gradient_checkpointing: true
hub_model_id: zephyr-7b-dpo-full
learning_rate: 5.0e-7
log_level: info
logging_steps: 10
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 512
num_train_epochs: 3
optim: rmsprop
output_dir: data/zephyr-7b-dpo-full
per_device_train_batch_size: 8
per_device_eval_batch_size: 4
push_to_hub: true
save_strategy: "no"
save_total_limit: null
seed: 42
warmup_ratio: 0.1
  1. DPO 训练代码VIEWscripts/run_dpo.py
#!/usr/bin/env python
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import sys

import torch
import transformers
from transformers import AutoModelForCausalLM, set_seed

from accelerate import Accelerator
from alignment import (
    DataArguments,
    DPOConfig,
    H4ArgumentParser,
    ModelArguments,
    apply_chat_template,
    get_datasets,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
    get_tokenizer,
    is_adapter_model,
)
from peft import PeftConfig, PeftModel
from trl import DPOTrainer


logger = logging.getLogger(__name__)


def main():
    parser = H4ArgumentParser((ModelArguments, DataArguments, DPOConfig))
    model_args, data_args, training_args = parser.parse()

    #######
    # Setup
    #######
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process the small summary:
    logger.info(f"Model parameters {
      
      model_args}")
    logger.info(f"Data parameters {
      
      data_args}")
    logger.info(f"Training/evaluation parameters {
      
      training_args}")

    # Set seed for reproducibility
    set_seed(training_args.seed)

    # Increase distributed timeout to 3h to enable push to Hub to complete
    accelerator = Accelerator()

    ###############
    # Load datasets
    ###############
    raw_datasets = get_datasets(data_args, splits=data_args.dataset_splits)
    logger.info(
        f"Training on the following splits: {
      
      [split + ' : ' + str(dset.num_rows) for split, dset in raw_datasets.items()]}"
    )
    column_names = list(raw_datasets["train"].features)

    #####################################
    # Load tokenizer and process datasets
    #####################################
    data_args.truncation_side = "left"  # Truncate from left to ensure we don't lose labels in final turn
    tokenizer = get_tokenizer(model_args, data_args)

    #####################
    # Apply chat template
    #####################
    raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={
    
    "tokenizer": tokenizer, "task": "dpo"},
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=column_names,
        desc="Formatting comparisons with prompt template",
    )

    # Replace column names with what TRL needs, text_chosen -> chosen and text_rejected -> rejected
    for split in ["train", "test"]:
        raw_datasets[split] = raw_datasets[split].rename_columns(
            {
    
    "text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
        )

    torch_dtype = (
        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
    )
    model_kwargs = dict(
        revision=model_args.model_revision,
        trust_remote_code=model_args.trust_remote_code,
        use_flash_attention_2=model_args.use_flash_attention_2,
        torch_dtype=torch_dtype,
        use_cache=False if training_args.gradient_checkpointing else True,
        device_map=get_kbit_device_map(),
        quantization_config=get_quantization_config(model_args),
    )

    model = model_args.model_name_or_path
    if is_adapter_model(model, model_args.model_revision):
        # load the model, merge the adapter weights and unload the adapter
        # Note: to run QLora, you will need to merge the based model separately as the merged model in 16bit
        logger.info(f"Merging peft adapters for {
      
      model_args.model_name_or_path=}")

        peft_config = PeftConfig.from_pretrained(model_args.model_name_or_path, revision=model_args.model_revision)

        model_kwargs = dict(
            revision=model_args.base_model_revision,
            trust_remote_code=model_args.trust_remote_code,
            use_flash_attention_2=model_args.use_flash_attention_2,
            torch_dtype=torch_dtype,
            use_cache=False if training_args.gradient_checkpointing else True,
        )
        base_model = AutoModelForCausalLM.from_pretrained(
            peft_config.base_model_name_or_path,
            **model_kwargs,
        )
        model = PeftModel.from_pretrained(
            base_model, model_args.model_name_or_path, revision=model_args.model_revision
        )
        model.eval()
        model = model.merge_and_unload()
        model_kwargs = None

    ref_model = model
    ref_model_kwargs = model_kwargs

    if model_args.use_peft is True:
        ref_model = None
        ref_model_kwargs = None

    #########################
    # Instantiate DPO trainer
    #########################
    dpo_trainer = DPOTrainer(
        model,
        ref_model,
        model_init_kwargs=model_kwargs,
        ref_model_init_kwargs=ref_model_kwargs,
        args=training_args,
        beta=training_args.beta,
        train_dataset=raw_datasets["train"],
        eval_dataset=raw_datasets["test"],
        tokenizer=tokenizer,
        max_length=training_args.max_length,
        max_prompt_length=training_args.max_prompt_length,
        peft_config=get_peft_config(model_args),
    )

    ###############
    # Training loop
    ###############
    train_result = dpo_trainer.train()
    metrics = train_result.metrics
    max_train_samples = (
        data_args.max_train_samples if data_args.max_train_samples is not None else len(raw_datasets["train"])
    )
    metrics["train_samples"] = min(max_train_samples, len(raw_datasets["train"]))
    dpo_trainer.log_metrics("train", metrics)
    dpo_trainer.save_metrics("train", metrics)
    dpo_trainer.save_state()

    logger.info("*** Training complete ***")

    ##########
    # Evaluate
    ##########
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        metrics = dpo_trainer.evaluate()
        max_eval_samples = (
            data_args.max_eval_samples if data_args.max_eval_samples is not None else len(raw_datasets["test"])
        )
        metrics["eval_samples"] = min(max_eval_samples, len(raw_datasets["test"]))
        dpo_trainer.log_metrics("eval", metrics)
        dpo_trainer.save_metrics("eval", metrics)

    ##################################
    # Save model and create model card
    ##################################
    dpo_trainer.save_model(training_args.output_dir)
    # Save everything else on main process
    if accelerator.is_main_process:
        kwargs = {
    
    
            "finetuned_from": model_args.model_name_or_path,
            "dataset": list(data_args.dataset_mixer.keys()),
            "dataset_tags": list(data_args.dataset_mixer.keys()),
            "tags": ["alignment-handbook"],
        }
        dpo_trainer.create_model_card(**kwargs)
        # Restore k,v cache for fast inference
        dpo_trainer.model.config.use_cache = True
        dpo_trainer.model.config.save_pretrained(training_args.output_dir)
        if training_args.push_to_hub is True:
            dpo_trainer.push_to_hub()

    # Ensure we don't timeout on model save / push to Hub
    logger.info("*** Waiting for all processes to finish ***")
    accelerator.wait_for_everyone()

    logger.info("*** Run complete! ***")


if __name__ == "__main__":
    main()

2.3 Lora training

  The training code and model configuration files during Lora training are exactly the same as those during full training, except that the environment configuration is different. Because it is just fine-tuning and there is no need to enable ZERO stage3, the environment configuration isrecipes/accelerate_configs/multi_gpu.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

2.4 Testing

  The test part can be found in the test folder under the alignment-handbook project. The author has not uploaded the relevant documentation yet, so you can continue to track the progress.

Guess you like

Origin blog.csdn.net/qq_56591814/article/details/134344019