Implementing ChatGPT class with RLHF from scratch: parsing Microsoft DeepSpeed Chat line by line

write at the front

This article was first written in April 2023 in this article "Implementing ChatGPT with RLHF from scratch: from TRL/ChatLLaMA/ColossalChat to DeepSpeed ​​Chat". Later, I had to talk about DSC in the "Large Model Project Development Offline Camp" In order to avoid the original text being too long, the DSC part at the end of the article is extracted into this article.

Preface

As mentioned in this article, Microsoft's open source DeepSpeed ​​Chat (DSC for short) is implemented well. Its three-stage training method similar to ChatGPT will give you a complete and transparent "PPO algorithm/RLHF" code implementation process (Okay Information can help you get the result with half the effort)

After digging out its key code, you will find that it corresponds one-to-one with the principle part written in another article in this blog ( if you haven’t read the principle yet, it is recommended to read this article first: Analysis of ChatGPT technical principle , only Only by understanding the principles can you better understand the implementation or actual implementation, especially the third part of the article  ), and the one-to-one correspondence of papers, principles/algorithms, formulas, and codes can make a qualitative change in your understanding.

DSC makes full use of the optimization plan of the DeepSpeed ​​project, considering that actors have to switch repeatedly between train (parameter update) and eval (experience collection) modes during the RLHF stage. Without optimization, the overall speed will be very slow, but the original train acceleration and eval acceleration of DeepSpeed They are two dissociated solutions. DSC has designed an engine called DeepSpeedHybridEngine, which enables actors to enjoy both train and eval acceleration optimization during the RLHF stage, improving the overall RLHF speed.

 To sum it up in one sentence: DeepSpeed ​​came to speed up RLHF and became deepspeed chat.

Note: A student of the July online ChatGPT class "Spring of the Bragging Class" wrote this model in great detail (more than 5 months since the beginning of the year, in addition to the ChatGPT series in this blog, this deepspeed chat analysis in the spring is my personal The only articles I have seen that are in-depth and detailed enough, mainly because they are really in-depth and detailed. On the one hand, the technology is too new, and on the other hand, there are too many details involved), so most of the analysis in this article is based on his blog . modified to get

In general, the three-phase training methods of DeepSpeed ​​Chat and instructGPT are similar. The three phases are represented by phase1, phase2, and phase3 respectively.

The following is a brief description of the three stages of training:

Part 1 DSC phase-1: Supervised Finetuning

1.1 SFT training process

The core code of phase1 can be found at: applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py. The training process is shown in the figure below (right-click the image:  open the image in a new tab to view the high-resolution large image  )

  1. Load tokenizer(1-2)
  2. Load the base model (currently only some CausalLM models are supported) (3-4)
  3. Determine whether LoRA technology is enabled based on whether lora_dim (the low-rank dimension of LoRA) is set. If enabled, the base model structure will be LoRA transformed (see details later), and the transformed model will be returned (5-6)
  4. Determine whether "Update LoRA parameters only" is enabled. If enabled, freeze the remaining structural parameters and return the frozen model (7-8)
  5. Get Dataset (9-10)
  6. Instantiate DataLoader(11)
  7. Use DeepSpeed's optimization technology DeepSpeedEngine to wrap objects such as models (12)
  8. Before starting formal training, first conduct an indicator evaluation. The selected indicator is perplexity (13-14).
  9. Start training, epoch cycle:

1.2 Explanation on LoRA and Perplexity

There are two details in the above process that are worth mentioning:

  1. For a detailed explanation of LoRA , please see section 2.2.3 of this article "Alpaca-LoRA: Fine-tuning "Alpaca based on LLaMA" on consumer-grade GPUs through the PEFT library"
  2. DeepSpeed-Chat chose perplexity as the evaluation index during phase1 training.
    Perplexity is an index that measures the performance of a language model. It measures how well the trained model fits the test data. For each token of the output sentence , you can get the confidence probability value of its output. By multiplying these values ​​and taking the reciprocal of their geometric mean, you can calculate the perplexity. It is more concise to use the formula to express: Among them, the output sentence has a total of tokens
    \text { perplexity }=\left(\prod_{t=1}^{T} p_{t}\right)^{-\frac{1}{T}}
    , Tand t the The confidence probability value of token is p_t

    and the training process of the CausalLM model is usually optimized using log-likelihood loss. The output loss formula is as follows:
    \text { loss }=-\frac{1}{T} \sum_{t=1}^{T} \log p_{t}
    Among them, the output sentence has a total Tof tokens, and the confidence probability value of the t ttth token is p_t

    therefore There is actually the following relationship between perplexity and the loss of CausalLM: The
    \text { perplexity }=\exp (\text { loss })
    perplexity calculation of the relevant source code is also based on the above formula: first input the verification data into the model to obtain the model loss output, and then calculate the perplexity through the exponential relationship between perplexity and loss.
        def evaluation(model, eval_dataloader):
            """
            以困惑度perplexity为评估指标进行验证
            """
            model.eval()
            losses = 0
            for step, batch in enumerate(eval_dataloader):
                """
                batch: 由input_ids、attention_mask、labels共3个部分组成的dict。
                其中每个部分的shape均为(bs, max_seq_len)
                """
                batch = to_device(batch, device)
                with torch.no_grad():
                    outputs = model(**batch)
    
                """Causal LM 的损失函数为交叉熵损失"""
                loss = outputs.loss
                losses += loss.float()
            losses = losses / (step + 1)
    
            try:
                """困惑度perplexity通常可以通过exp(CELoss)计算得到"""
                perplexity = torch.exp(losses)
            except OverflowError:
                perplexity = float("inf")
    
            try:
            	"""
            	- get_all_reduce_mean中调用了torch.distributed.all_reduce(perplexity, op=torch.distributed.ReduceOp.SUM)
            	- 对所有进程、或者说GPU(因为通常情况下就是单个进程控制单个GPU)中的perplexity进行求和
            	- 然后再除以全局进程数torch.distributed.get_world_size()得到平均的perplexity结果
            	"""
                perplexity = get_all_reduce_mean(perplexity).item()
            except:
                pass
            return perplexity

Part 2 DSC phase-2: Reward Model Finetuning

2.1 Training data

The model will be trained on data pairs in the form of the following examples based on ranking loss, and finally an RM (Reward Model) with human-like scoring capabilities will be obtained.

Data format name    illustrate     Sample
chosen_sentence The complete dialogue of human preference, obtained by prompt connection preference response chosen, applicable to phase1 and phase2 “Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that develops, licenses, and supports software products,including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products.”
reject_sentence The complete dialogue of human rejection is obtained by connecting the prompt to the rejection response rejected, and is suitable for phase 2  “Human: Please tell me about Microsoft in a few sentence? Assistant: I’m not sure what you mean.”

2.2 Training process

The general training process of phase2 is shown in the UML timing diagram ( right-click the picture: open the picture in a new tab to view the high-definition large picture ):

  1. Load tokenizer(1-2)
  2. Load the model (rm_model), which involves certain structural changes (3-8)
  3. Determine whether LoRA technology is enabled based on whether lora_dim (the low-rank dimension of LoRA) is set. If enabled, the base model structure will be LoRA transformed (see details later), and the transformed model will be returned (9-10)
  4. Determine whether "Only update LoRA parameters" is enabled. If enabled, freeze the remaining structural parameters and return the frozen model (11-12)
  5. Get Dataset(13-14)
  6. Instantiate DataCollator to further organize the loaded data (15-16)
  7. Instantiate DataLoader(17)
  8. Use DeepSpeed's optimization technology DeepSpeedEngine to wrap objects such as rm_model (18)
  9. Before starting formal training, the indicator evaluation is first carried out. The selected indicator is the accuracy of the sorting results (19-20)
  10. Start training, epoch cycle:

2.3 Detailed explanation of key codes

2.3.1 Specific structure of RM

First, use the AutoModel class of transformers to read the backbone network of the specified model (without directly defining the network structure with an output head), and then introduce a linear layer that can achieve dimensionality reduction from hidden_size to 1. This linear layer will be used as the output of the backbone network. Header, outputs 1 score for each position in the input sequence

# applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
"""
rm_model调用了create_critic_model进行载入
默认情况下rm_model是不启用dropout的
"""
rm_model = create_critic_model(···)

# applications/DeepSpeed-Chat/training/utils/model/model_utils.py
def create_critic_model(···):
    """此处的模型读取方法用的是“AutoModel”,因此此处critic_model只有主干部分"""
    critic_model = create_hf_model(AutoModel, ···)

    """
    critic_model传入RewardModel,将额外得到线性层输出头,
    因此此处的critic_model结构为“v_head + 主干部分”
	"""
    critic_model = RewardModel(critic_model, ···)
    ...
    return critic_model

# applications/DeepSpeed-Chat/training/utils/model/reward_model.py
class RewardModel(nn.Module):
    """
    将读取得到的model的结构修改为适用于RewardModel的形式,
    总的来说即是使用载入的主干网络进行特征提取,
    其所提取的特征(最后层的各位置输出特征hidden_states)将被传入线性层,输出得到1个数值,
    该数值即为分值,因此max_seq_len维度的每个位置均会得到1个分值
    """
    def __init__(self, base_model, ...):
        super().__init__()
		···
        if hasattr(self.config, "word_embed_proj_dim"):
        	"""
			OPT系列模型的word_embed_proj_dim为embedding层的输出维度,
			通常在transformer模型中也就等于 hidden_size,
			v_head将基于主干网络的输出特征 hidden_state 进行分值预测,共输出max_seq_len个分值
			"""
            self.v_head = nn.Linear(self.config.word_embed_proj_dim,
                                    1,
                                    bias=False)
        ···
        """base_model即为主干网络,因此RM最终由1个主干网络和1个线性层构成"""
        self.rwtranrsformer = base_model

The model structure of RM is basically as follows (the base model here is "facebook/opt-125m"), which consists of the backbone network rwtransformer and the output head v_head:

RewardModel(
  (v_head): Linear(in_features=768, out_features=1, bias=False)
  (rwtranrsformer): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
)

2.3.2 Input form required by DataCollator and RM

The data collator data_collator used in phase2 is DataCollatorReward(). The single sample example taken out in this phase is actually a chosen-rejected data pair (see the code block below)

That is, a batch of batch_size takes out batch_size data pairs. The data_collator will split the data pairs into chosen_sentence and reject_sentence (example is divided into two). Therefore, in fact, the actual amount of data input into the model for a batch should be "batch_size" * 2"

# applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
"""phase2使用的data_collator为DataCollatorReward()"""
data_collator = DataCollatorReward()

# applications/DeepSpeed-Chat/training/utils/data/data_utils.py
class DataCollatorReward:
    def __call__(self, data):
        """
        对dataloader取到的数据 data 进一步整理,将数据整理成batch输入形式
        入参 data 的具体样式可见下个代码块
        """
        batch = {}

        """f为data中的1个tuple,tuple的第0个元素和第2个元素
        分别为chosen_sentence和reject_sentence的input_ids"""
        batch["input_ids"] = torch.cat([f[0] for f in data] + 
        							   [f[2] for f in data],
                                       dim=0)

        """f为data中的1个tuple,tuple的第1个元素和第3个元素
        分别为chosen_sentence和reject_sentence的attention_mask"""
        batch["attention_mask"] = torch.cat([f[1] for f in data] +
                                            [f[3] for f in data],
                                            dim=0)

        """batch的具体样式可见下个代码块"""
        return batch

And the input data is a batch data list, each element of which is a pair of chosen-rejected data:

    (
	 chosen_sentence_input_ids, 
	 chosen_sentence_attention_mask,
	 reject_sentence_input_ids,
	 reject_sentence_attention_mask
	)

The 0th and 2nd elements of each set of data are input_ids, and the 1st and 3rd elements are attention_mask.

The output batch is a dictionary: {"input_ids": tensor([...]), "attention_mask": tensor([...])}
and in the dictionary value, chosen is in the first half and rejected is in the second half:

    {
    "input_ids": [
                  chosen_sentence_1_input_ids,
                  chosen_sentence_2_input_ids,
                  ...,
                  reject_sentence_1_input_ids,
                  reject_sentence_2_input_ids,
                  ...
                 ]
    "attention_mask": [
                       chosen_sentence_1_attention_mask,
                       chosen_sentence_2_attention_mask,
                       ...,
                       reject_sentence_1_attention_mask,
                       reject_sentence_2_attention_mask,
                       ...
                      ]
        
    }

After subsequent input into the model, the data is directly divided into the first half and the second half and juxtaposed to obtain the corresponding chosen-rejected data pairs.

2.3.3 Reward design and pairwise ranking loss of the entire conversation

The forward propagation process of RM is not complicated. In general, it is:

  1. The data passes through the backbone network to obtain the last layer output feature hidden_states with shape (bs*2, max_seq_len, hidden_size);
  2. Then send the output features to the linear layer v_head to get the score rewards with shape (bs*2, max_seq_len)

The more complicated part is actually "computation of pairwise ranking loss" and "score aggregation design"

2.3.3.1 Pairwise Ranking Loss

\operatorname{loss}(\theta)=\mathrm{E}_{\left(\mathrm{x}, \mathrm{y}_{\mathrm{c}}, \mathrm{y}_{\mathrm{ r}}\right) \sim \mathrm{D}}\left[-\log \left(\sigma\left(\mathrm{r}_{\theta}\left(\mathrm{x}, \mathrm{ y}_{\mathrm{c}}\right)-\mathrm{r}_{\theta}\left(\mathrm{x}, \mathrm{y}_{\mathrm{r}}\right)\ right)\right)\right]

Among them, r_\thetais RM, xis prompt, y_cis chosen, y_ris rejected, (x, y_c)and (x, y_r)are chosen_sentence and reject_sentence respectively.
The purpose of this loss function is to maximize the difference between "chosen/good/higher sorting" and "rejected/bad/lower sorting", thereby promoting the learning r_\thetaof the corresponding sorting pattern.

When DeepSpeed-Chat implemented this part, r_\theta(x,y_c)it r_\theta(x,y_r)chose the alignment part of the answer of chosen_sentence and reject_sentence respectively. The text description is slightly abstract. Viewing the code block below will help you understand this concept:

max_seq_len为10,pad_token_id为0,
有同属同个prompt的chosen_sentence和reject_sentence:
prompt: [11, 22, 33]
chosen_sentence: [11, 22, 33, 44, 55, 66, 0, 0, 0, 0]
reject_sentence: [11, 22, 33, 40, 50, 0, 0, 0, 0, 0]

“两者answer的对齐部分”即为“非prompt部分也非padding部分、但长度要对齐”:
chosen_truncated: [44, 55, 66]
reject_truncated: [40, 50, 0]

所以当上面的chosen_sentence的answer比较长时,reject_sentence在取相应部分时要取至与chosen部分等长为止;
类似的,如果reject_sentence的answer较长时,同理

In order to get the "alignment part" mentioned above, the code performs a relatively obscure and abstract index operation, but as long as you understand that its ultimate purpose is to get the reward of the aligned part of chosen_sentence and reject_sentence, you can calculate the loss.

2.3.3.2 Dialogue reward design

        Although the reward of the "aligned part" is used to calculate the pairwise ranking loss, the RM model's prediction score for a conversation actually takes the reward of the last valid token of the conversation text (usually the "end token"). The code block below provides a simple example illustrating this situation

pad_token_id = 0
conversation = [11, 22, 33, 44, 55, 66, 0, 0, 0, 0]
conversation_rewards = [2.01, 0.23, 2.89, 0.66, 0.33, 2.25, 0.36, 0.99, 1.32, 1.62]
token_id为66的token作为该对话的最后1个有效token,
其对应的reward“2.25”将被用于表示整个对话的reward

The overall code is as follows

# applications/DeepSpeed-Chat/training/utils/model/reward_model.py
class RewardModel(nn.Module):
	def __init__(self, ···):
		···
	···
	def forward(self, input_ids=None, ···):
		"""获得主干网络的输出的特征"""
		transformer_outputs = self.rwtranrsformer(···)

		"""
		取最后一层的输出特征
		hidden_states.shape: (bs*2, max_seq_len, hidden_size)
		"""
		hidden_states = transformer_outputs[0]

		"""
		将特征送入全连接层得到分数回归值
		rewards.shape: (bs*2, max_seq_len)
		"""
		rewards = self.v_head(hidden_states).squeeze(-1)

		"""先前提及过,实际的bs应该是输入bs的一半"""
		bs = input_ids.shape[0] // 2

		"""区分出chosen和reject"""
		chosen_ids = input_ids[:bs]
		rejected_ids = input_ids[bs:]
		chosen_rewards = rewards[:bs]
		rejected_rewards = rewards[bs:]
		
		loss = 0
        for i in range(bs):
            """
            取出同组chosen和rejected的token_id和分值reward
            chosen_id.shape: (max_seq_len, )
            """
            chosen_id = chosen_ids[i]
            rejected_id = rejected_ids[i]
            chosen_reward = chosen_rewards[i]
            rejected_reward = rejected_rewards[i]

			"""
			下方本应有各种取index相关的操作,
			基于源码解读的可读性考量,且这些部分只是逻辑形式上的弯弯绕绕,与相关原理并不存在直接关系,所以选择暂且将它们忽略
			"""
			
			"""
			c_ind为chosen_sentence的answer后的第一个pad_token的index
			例如pad_token_id=0,sentence[11,22,33,44,55,66,0,0,0,0],c_ind即为第一个pad_token的index=6 """
            c_ind = ···

            """
            r_ind同理,为reject_sentence的answer后的第一个pad_token的index"""
            r_ind = ···

            """end_ind则为两者的较大者"""
            end_ind = max(c_ind, r_ind)

            # 取chosen和rejected第一个不同的地方的index,可以理解为“response中两个回答自由发挥的第1个token的index”
            """divergence_ind为chosen_sentence和reject_sentence两者answer的第1个token的index"""
            divergence_ind = ···

            """
            以chosen_sentence和reject_sentence最先不同的地方为起始、生成结束的地方为终止,取两者在这个片段的对应分值
            这部分其实就是上个代码块提及的“对齐部分”
            """
            c_truncated_reward = chosen_reward[divergence_ind:end_ind]
            r_truncated_reward = rejected_reward[divergence_ind:end_ind]

            """
            (c_truncated_reward - r_truncated_reward).shape: (truncated_seq_len,)
            计算损失时使用了rank loss的形式,并且是对chosen和rejected“对齐片段”进行计算的
            """
            loss += -torch.log(
                torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()
		
        loss = loss / bs
        
        """取代表结束的pad token所在位置的前一个位置(可以理解为的最后一个有效token的位置)的分值作为参考分值"""
            chosen_mean_scores.append(
                chosen_reward[c_ind - 1])  #use the end score for reference
            rejected_mean_scores.append(rejected_reward[r_ind - 1])
        chosen_mean_scores = torch.stack(chosen_mean_scores)
        rejected_mean_scores = torch.stack(rejected_mean_scores)
        
        """返回损失和参考分值"""
        return {
            "loss": loss,
            "chosen_mean_scores": chosen_mean_scores,
            "rejected_mean_scores": rejected_mean_scores,
        }
   ···

2.3.4 Indicator evaluation of phase2

The evaluation metric used by DeepSpeed-Chat in phase 2 is correct sorting accuracy. The main process is:

  1. Input several pairs of chosen-rejected data pairs (split into chosen_sentence and reject_sentence by data_collator in the process) into RM for reasoning, and obtain the scores of each sentence;
  2. Compare the chosen_sentence score and the reject_sentence score that belong to the same prompt. When the chosen_sentence score is greater than the reject_sentence score, it is a "correct prediction", otherwise it is a "wrong prediction";
  3. The results of correct predictions are counted, and accuracy is calculated as an evaluation index.
  4. In addition, the average chosen_sentence score "scores" will be calculated during the evaluation process for reference.
def evaluation_reward(model, eval_dataloader):
    model.eval()
    """统计预测(赋分)正确的结果
    即 chosen_reward > rejected_reward 的结果数"""
    correct_predictions = 0

    """统计预测总数"""
    total_predictions = 0
    scores = 0
    for step, batch in enumerate(eval_dataloader):
        batch = to_device(batch, device)
        with torch.no_grad():
            """outputs: {'loss':tensor(), 
            			'chosen_mean_scores':tensor(bs,), 
            			'rejected_mean_scores':tensor(bs,)}"""
            outputs = model(**batch)

        """chosen.shape: (bs,)"""
        chosen = outputs["chosen_mean_scores"]

        """rejected.shape: (bs,)"""
        rejected = outputs["rejected_mean_scores"]

        """"赋分正确"即为chosen分值大于rejected分值"""
        correct_predictions += (chosen > rejected).sum()
        total_predictions += chosen.shape[0]

        """累加每个step的平均chosen分值"""
        scores += outputs["chosen_mean_scores"].mean().float()

        if step == 99:  # For faster evaluation and debugging
            break
    """计算acc指标"""
    acc = correct_predictions / total_predictions

    """计算当前step的平均chosen分值"""
    scores = scores / (step + 1)
    try:
        """多进程结果求和求平均"""
        acc = get_all_reduce_mean(acc).item()
        scores = get_all_reduce_mean(scores).item()
    except:
        pass
    return scores, acc

Regarding RM, the last thing worth mentioning is that in the implementation of DeepSpeed-Chat, the prediction score of a conversation by the RM model is actually based on the reward of the last token of the conversation text. Of course, this is not the only way to use this. This is an open strategy design for scoring conversations in this way, but the DeepSpeed-Chat team has adopted this implementation. Of course, users can also develop their own scoring processing strategies, such as average reward in the answer part, sequence reward and then complete The connection layer gets the aggregation rewad and so on

In our implementation, we use either the end token of the sequence or the first padding token as the aggregated score and compare them. Others may also use the average score for the entire answer as an alternative.

Part 3 DSC phase-3: RLHF Finetuning

This section is adapted from the third part of the analysis of deepspeed chat given by students in the online ChatGPT course in July in the spring.

3.1 Training data of RLHF

Data format name    illustrate     Sample
prompt 
 
The description of the current situation provides instruction input information for model generation. It can be understood as a "question" in a popular sense and is suitable for phase 3.

"Human: Please tell me about Microsoft in a few sentence?

Assistant: "(The text example is given for ease of understanding, in fact it is input_ids)

seq The complete dialogue sequence generated by the actor based on the prompt input.  

"Human: Please tell me about Microsoft in a few sentence?

Assistant: Microsoft is a world-renowned company." The text example is given for ease of understanding, in fact it is input_ids)

logprobs  Actor's logits/strategy logarithm based on seq output shape: It should be (seq_bs, max_seq_len, vocab_size). After gather processing, only the log_logit value of the actual label token is taken, which is (seq_bs, max_seq_len, 1)
ref_logprobs reference/SFT logits/strategy logarithm based on seq output shape: It should be (seq_bs, max_seq_len, vocab_size). After gather processing, only the log_logit value of the actual label token is taken, which is (seq_bs, max_seq_len, 1)
value critic evaluates the value of each position in the sequence based on the seq output.  shape: (seq_bs, max_seq_len)
reward eward/RM is based on the (environmental) reward for the entire conversation output by seq, and a β penalty term will be added when the actual code is implemented. shape: (seq_bs,)
   
attention_mask  Used to filter out non-valid elements shape: (seq_bs, max_seq_len)

There are two points worth mentioning:

  1. The definitions of empirical data in each framework are not exactly the same. For example, the empirical data defined by ColossalChat has more terms "adv" and "reward" than here (this reward is not the other reward, and ColossalChat's reward refers to "corrected by KL divergence"). "KL_Reward" after "), but they are essentially the same, except that the scope of the frame is different, because adv (advantage function Adventure) and KL_Reward can be calculated from the existing items logprobs, ref_logprobs, reward, and value
  2. From the perspective of code efficiency, the experience data definition of ColossalChat is relatively more rigorous, because the advantage adv and KL penalty rewards can be calculated from the basic experience data, and can be calculated in one step during the experience generation stage. DeepSpeed-Chat
    will It is arranged to be calculated in the training phase and calculated in each PPO iteration (after all, the advantages and KL penalty rewards are calculated based on basic experience data, and the basic experience data has been determined in the generation experience phase, so even in different PPO iterations , the advantage and KL penalty reward are also unchanged, so DeepSpeed-Chat repeatedly calculates adv and KL penalty reward. The calculation order of this link is estimated to be adjusted by relevant teams in the future)

3.2 The entire training process of RLHF

The entire RLHF training process is shown in the figure below ( right-click the image: open the image in a new tab to view the high-definition large image )

  1. Load tokenizer(1-2);
  2. Get the Dataset and instantiate the DataCollator (3-9): Get the Dataset (4-5) of the prompt for collecting experience, if unsupervised training is enabled, then get the Dataset of the unsupervised data (6-7), and instance DataCollator is used to further organize the loaded data.

    data_collator is instantiated from DataCollatorRLHF. This class mainly implements "padding to max_prompt_len (default is half of max_seq_len), and then flips". Why do we need to
    specifically perform flip (flip) operation on prompt token? ?
    The reason is that the purpose of using prompt in phase3 is to input the prompt into the actor model, and the actor will autoregressively generate subsequent content based on the prompt, so as to collect experience. Taking the

    actor model with a base of opt-125m as an example, this model The maximum sequence length (max_seq_len) that can be supported is 512, and phase3 will also be preset with a maximum prompt length (max_prompt_len), which is usually half of max_seq_len, that is, 256. The remaining half of the length will be used to generate the
    input When the prompt does not meet the maximum prompt length max_prompt_len, the prompt will need to be padded (reflected in the data_collator code of phase3), and the padding operation usually adds pad token directly to the back of the sequence, and the input after padding will become In the form of [ prompt, padding ], the autoregressive generation task will be generated after pad_token - this is unreasonable
    Therefore, you need to flip the prompt input first, perform a padding operation after flipping, and then flip it back. The padded input becomes in the form of [padding, prompt]. For autoregressive tasks, the content of the prompt is continued for generation. is reasonable.

    You should be able to better understand the purpose of this operation through the following examples.
    max_prompt_len = 5
    pad_token_id = 0
    
    prompt_token_ids = [233, 11, 22]
    # padding位于后侧 ×
    prompt_token_ids.padding() = [233, 11, 22, 0, 0]
    
    prompt_token_ids.flip(0) = [22, 11, 233]
    prompt_token_ids.flip(0).padding() = [22, 11, 233, 0, 0]
    # padding位于前侧 √
    prompt_token_ids.flip(0).padding().flip(0) = [0, 0, 233, 11, 22]
  3. InstantiateDataLoader(10);
  4. Use DeepSpeedRLHFEngine() to load various models (actor, ref/SFT, critic, reward/RM) required for PPO training, and encapsulate them to obtain rlhf_engine(11-12);
  5. Instantiate PPO's training management trainer(13-14);
  6. Instantiate MiniDataset for PPO training (different from the above-mentioned Dataset, which is used to obtain data for the entire large round. MiniDataset further manages the data provided by the Dataset for allocation to PPO rounds, that is, small rounds for training. )(15-16);
  7. Start training, large round epoch (prompt_epoch)

3.3 Detailed explanation of the key code of stage three: step3_rlhf_finetuning

3.3.1 Initialization of each model in stage three: main.py, rlhf_engine.py in step3_rlhf_finetuning

Regarding the initialization of the model, the DeepSpeedRLHFEngine class is used in the source code to initialize the actor, ref/SFT, critic, reward/RM, actor_ema and other models. This class mainly implements:

  1. Reading the model, although it also supports pulling the corresponding model directly from huggingface hub, usually the trained models of phase1 and phase2 are read from the local path:
    \rightarrow  actor, ref/SFT and actor_ema ( EMA is ExponentialMovingAverage, which is called exponential moving average in Chinese. , is a model training technique. The parameters obtained by the model during the k-th update are not directly using the k-th new parameters, but are obtained by the weighted average of the historical parameters at k-1 and the k-th new parameters. Mainly to enhance the stability of the training process  ) usually initialize the model obtained from phase1 training;
    \rightarrow  critic and reward usually initialize the model obtained from phase2 training
  2. Set different DeepSpeed ​​configurations (ds_config) for each related model and use DeepSpeedEngine for encapsulation. Actors will use DeepSpeedHybridEngine for encapsulation by default. A brief introduction to DeepSpeedHybridEngine can be seen below.
  3. Finally, an object rlhf_engine carrying all related models is obtained.

The corresponding code is as follows

  1. In main.py in step3_rlhf_finetuning , you can see that DeepSpeedRLHFEngine is called

    # applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py
    """
    使用DeepSpeedRLHFEngine类直接初始化模型
    当然其内部仍旧调用了“create_hf_model”方法来读取模型,
    但其中实现了更为精细的DeepSpeed控制
    """
    rlhf_engine = DeepSpeedRLHFEngine(···)
  2. The implementation of DeepSpeedRLHFEngine is in step3_rlhf_finetuning/rlhf_engine.py , which involves the initialization of four models such as actor, ref, critic, and reward.

    # applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py
    class DeepSpeedRLHFEngine():
        def __init__(···):
            """
            加载模型并进行DS封装
            1. actor与ref(以及actor_ema)通常都初始化自phase1训练所得的模型;
            2. critic与reward通常都初始化自phase2训练所得的模型
            根据它们的入参就能知道
            """
            ···
    
            """此处的actor是模型经过DeepSpeed封装后得到的DeepSpeedHybridEngine对象"""
            self.actor = self._init_actor(actor_model_name_or_path)
    
            """此处的reference是模型经过DeepSpeed封装后得到的DeepSpeedEngine对象"""
            self.ref = self._init_ref(actor_model_name_or_path)
    
            self.actor_ema = None
            """如果开启了ema,则初始化并封装ema"""
            if self.args.enable_ema:
                """此处的ema是模型经过DeepSpeed封装后得到的DeepSpeedEngine对象"""
                self.actor_ema = self._init_ema(actor_model_name_or_path)
    
            """此处的critic是模型经过DeepSpeed封装后得到的DeepSpeedEngine对象"""
            self.critic = self._init_critic(critic_model_name_or_path)
    
            """此处的reward是模型经过DeepSpeed封装后得到的DeepSpeedEngine对象"""
            self.reward = self._init_reward(critic_model_name_or_path)
  3. Then, the initialization details of the actor are as follows

    # applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py
    def _init_actor(self, actor_model_name_or_path):
        """
        初始化actor并使用DeepSpeedHybridEngine封装
        :param actor_model_name_or_path: phase1训练好的actor模型路径
        :return: 经DeepSpeedHybridEngine封装的actor
        """
    	···
    
        """
        DS Config
        根据传参构建ds config,
        与其他相关模型不同的地方在于,如果传参指定启用了enable_hybrid_engine,
        那么HybridEngine将作用于actor,对actor进行封装,
        因为HybridEngine可以使得模型可以在训练与推理两种模式中进行自动切换,
        同时享有训练与推理的优化,
        这对于既需要进行推理生成、又需要进行训练的actor来说是有增益作用的。
        """
        ds_config = get_train_ds_config(···,
            enable_hybrid_engine=self.args.enable_hybrid_engine,
            ···)
    	···
    
        # Model
        """使用CausalLM结构载入模型及权重,实例化actor"""
        actor_model = create_hf_model(
            model_class=AutoModelForCausalLM,
            model_name_or_path=actor_model_name_or_path,
            ds_config=ds_config,
            ···)
    
        # LoRA
        """如果开启LoRA训练则添加LoRA旁路"""
        if self.args.actor_lora_dim > 0:
            actor_model = convert_linear_layer_to_lora(···)
            if self.args.only_optimize_lora:
                actor_model = only_optimize_lora_parameters(actor_model)
    
        # Optimizer
        """实例化优化器:分组权重衰减等"""
        AdamOptimizer = DeepSpeedCPUAdam if self.args.offload else FusedAdam
        optim_params = get_optimizer_grouped_parameters(
            actor_model, self.args.actor_weight_decay)
        optim = AdamOptimizer(optim_params,
                              lr=self.args.actor_learning_rate,
                              betas=(0.9, 0.95))
    
        # LR Scheduler
        """实例化学习率调度器"""
        lr_scheduler = get_scheduler(
            name=self.args.lr_scheduler_type,
            optimizer=optim,
            num_warmup_steps=self.args.num_warmup_steps,
            num_training_steps=self.num_total_iters,
        )
    	
    	"""
        DeepSpeedEngine封装
        若ds_config中定义了启用HybridEngine,
        则返回的actor_engine不仅是个DeepSpeedEngine实例,
        确切地说还是个DeepSpeedHybridEngine实例,集成有HybridEngine的优化
        """
        actor_engine, *_ = deepspeed.initialize(model=actor_model,
                                                optimizer=optim,
                                                lr_scheduler=lr_scheduler,
                                                config=ds_config)
        ···
        return actor_engine

    The initialization of the rest of ref, actor_ema, critic, and reward is almost the same, except that the ds_config settings are different, but they will eventually return objects encapsulated by DeepSpeedEngine.

3.3.2  The difference between reward_score and values ​​and the acquisition of empirical data

3.3.2.0 Acquisition of empirical data: step3_rlhf_finetuning/ppo_trainer.py

Similar to the figure below, the process of obtaining experience data at this stage of DeepSpeed-Chat is as follows:

  1. Prepare prompt data (prompt_input_ids, prompt_attention_mask);
  2. Use the current actor to answer the prompt to get the complete dialogue sequence seq (that is, the sequence in the picture above);
  3. Input seq to the current actor, output the current (old) policy logits (i.e. action_logits in the above figure), and take the logarithm of logprobs;
  4. Input seq to ref/SFT, output the baseline strategy ref_logits (ie sft_logits in the above figure), and take the logarithm ref_logprobs;
  5. Input seq to reward/RM, and the output is the environmental reward reward_score (i.e. r(x,y) in the above figure);
  6. Input seq to the current critic, and output the current (old) value estimate values ​​(that is, the value in the above picture);
  7. At this point, all basic empirical data used for PPO training have been obtained. As for the adv, reward and other data shown in the figure, in DeepSpeed-Chat, the calculation starts during the specific training process (incidentally, the reward shown in the
    figure Refers to the "KL Reward" mentioned in InstructGPT: In order to prevent overconfidence in the reward learned in phase 2, the KL divergence of SFT and logits is introduced as a penalty reward, which will be explained in detail below)

The relevant code implementation can be seen in the code block below

# applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py
def generate_experience(self, prompts, mask):
    """
    生成经验
    :param prompts: prompt input ids,(bs, max_prompt_len)
    :param mask: prompt attention mask, (bs, max_prompt_len)
    :return:
    """
    
    """将actor、reference、critic、reward转换为eval模式"""
    self.eval()
    
    """
    seq.shape: (seq_bs, max_seq_len)
    seq_bs指:排除较短answer后的batch_size。
    所谓“较短answer”在默认设定中是“序列长度小于1的answer”,
    短answer的seq都被滤掉了,
    所以可能batch_size会比之前小,
    但这个可能性极低,DS-C认为只有在使用未经phase1训练的模型来生成才会出现该情况。
    
    _generate_sequence()更具体的细节可见后续详解
    """
    seq = self._generate_sequence(prompts, mask)
    
    """将actor、critic转换为train模式,因为后续两者仍需要进行训练"""
    self.train()
    
	···
	
    with torch.no_grad():
    	"""
        经验采集:这部分其实就是在获取计算phase3损失函数所需的内容
        1. actor:(旧)策略-output.logits
        2. reference:SFT策略-output_ref.logits
        3. reward:奖励-reward_score,InsructGPT中的r_\theta
        4. critic:(旧)价值估计-values
        """
        output = self.actor_model(seq, attention_mask)
        output_ref = self.ref_model(seq, attention_mask)

        # (seq_bs, max_seq_len, vocab_size)
   		logits = output.logits

    	# (seq_bs, max_seq_len, vocab_size)
    	logits_ref = output_ref.logits
		
		"""价值函数的forward_value()更具体的细节下文马上讲解 """
		"""reward_score取的是answer最后一个token的value"""
        # reward_score.shape: (seq_bs,)
        reward_score = self.reward_model.forward_value(
            seq, attention_mask,prompt_length=self.prompt_length)['chosen_end_scores'].detach()

        """critic_model.forward_value(return_value_only=True)
将返回shape为(seq_bs, max_seq_len)的序列各token的value"""
        # 相当于就输出了旧价值values序列
        values = self.critic_model.forward_value(
        	seq, attention_mask, return_value_only=True).detach()[:, :-1]

    # 返回的dict是“进行PPO所需要使用的一组数据”
    # prompts.shape: (bs, max_prompt_len)
    # logits[:, :-1, :].shape: (seq_bs, max_seq_len - 1)
    # seq[:, 1:].shape: (seq_bs, max_seq_len - 1)
    # gather_log_probs()相当于输入logits和labels,对logits进行log_softmax后取出对应label位置的logit值
    # 因此logprobs.shape: (seq_bs, max_seq_len - 1),ref_logprobs.shape: (seq_bs, max_seq_len - 1)
    # values.shape: (seq_bs, max_seq_len - 1)
    # rewards.shape: (seq_bs,),reward_score在InstructGPT中就是r_\theta
    # input_ids.shape: (seq_bs, max_seq_len)
    # attention_mask.shape: (seq_bs, max_seq_len)
    """gather_log_probs()更具体的细节可见后续详解。"""
    return {
        'prompts': prompts,
        'logprobs': gather_log_probs(logits[:, :-1, :], seq[:, 1:]),
        'ref_logprobs': gather_log_probs(logits_ref[:, :-1, :], seq[:,
                                                                    1:]),
        'value': values,
        'rewards': reward_score,
        'input_ids': seq,
        "attention_mask": attention_mask

Next, there are three points that need to be emphasized:

3.3.2.1 Seq generation: step3_rlhf_finetuning/ppo_trainer.py

For the prompt of this batch, it will be input to the current actor (for the actor that will be iteratively obtained based on empirical data, the "current actor" at this time can be considered as the "old strategy network") to generate the answer (as shown in the figure below) , and then splice prompt and answer to get seq

The sample code is as follows

# applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py
def _generate_sequence(self, prompts, mask):
    """
	生成seq
    """
	···
	"""
    获取prompt拼接上answer后的最大长度,实际上相当于max_seq_len,用于对生成长度做限制
    """
    max_min_length = self.max_answer_seq_len + prompts.shape[1]

    with torch.no_grad():
        """调用actor,输入input_ids和attention_mask进行生成"""
        seq = self.actor_model.module.generate(prompts,
                                               attention_mask=mask,
                                               max_length=max_min_length,
                                               min_length=max_min_length)

    """下方操作是为了过滤掉只有极短answer(有效长度小于1)的seq"""
    batch_size = seq.shape[0]

    """prompt长度:实际上就是max_prompt_len"""
    prompt_length = prompts.shape[1]

    """取出answer部分,此时还含有pad token"""
    ans = seq[:, prompt_length:]

    """统计answer的有效长度(去掉pad token后的长度)"""
    valid_ans_len = (ans != self.tokenizer.pad_token_id).sum(dim=-1)

    """排除较短(此处设置为有效长度小于1)的answer,余下的answer将被存入out_seq作为最终返回"""
    out_seq = []
    for i in range(batch_size):
        if valid_ans_len[
                i] <= 1:  # if the answer is shorter than 1 token, drop it
            continue
        else:
            out_seq.append(seq[i:i + 1])
    out_seq = torch.cat(out_seq, dim=0)  # concate output in the batch dim

    # out_seq.shape: (valid_batch_size, max_seq_len)
    return out_seq
3.3.2.2 The difference between reward reward_score and value estimation values: utils/model/ reward_model.py

"Reward/environmental reward/reward_score" mainly gives a reward value/score for the entire dialogue sequence.
"Value estimate/values" gives a value prediction for each position in the dialogue sequence , which is related to the time step/state. Closely related

For example, there is the dialogue sequence seq=[11, 22, 33, 44, 55, 66 , 0, 0, 0, 0], and its conversation_rewards = [2.01, 0.23, 2.89, 0.66, 0.33, 2.25 , 0.36, 0.99, 1.32, 1.62]

  • The reward reward_score will only be a scalar, specifically reward_score_seq = 2.25 corresponding to the last valid token;
  • Its value estimate values ​​is a 1-dimensional array, such as values_seq=[0.21, 1.26, 2.52, 0.03, 0.59, 1.55, 1.75, 2.12, 2.22, 1.32]

As follows, the model class RewardModel of the reward model implements the method for obtaining environmental rewards and value estimates, namely forward_value(),有两点需要重点强调下

  1. If this forward_value is called in Section 3.3.2 of this experience generation phase, the values ​​obtained are old.
            # 相当于就输出了旧价值values序列
            values = self.critic_model.forward_value(
            	seq, attention_mask, return_value_only=True).detach()[:, :-1]
    If this forward_value is called during the calculation of loss in the section "3.3.4.4 Final Calculation of Value Loss" below, the values ​​obtained will be new
        # 且此时因为是计算价值损失,所以这里计算的是新价值估计
        value = self.critic_model.forward_value(**batch,
                                                return_value_only=True,
                                                use_cache=False)[:, :-1]
  2. It is different from another method forward() used in this type of RewardModel training. The other method forward() mainly implements the acquisition of environmental rewards and the calculation of ranking loss. In short, the RewardModel class implements the
    forward () used in training. method, and also implements the forward_value() method used in inference

Finally, forward_value is implemented in the RewardModel class as follows:

# applications/DeepSpeed-Chat/training/utils/model/reward_model.py
class RewardModel(nn.Module):

    def __init__(self, base_model, tokenizer, num_padding_at_beginning=0):
        ···
    ···
    def forward(···):
    	"""forward()在之前“2.3.3 整个对话的reward设计和成对排序损失”中已经进行过详解,且与此处所述内容无关,此处不再赘述"""
        ···

    def forward_value(···, return_value_only=False, ···):
        """
        和forward的差别在于:forward需要针对输入的chosen-rejected对计算排序损失并返回
        而forward_value只需要考虑一个输入,然后返回分值
        说白了,forward的输入是数据对,因为要计算数据对的排序损失,而forward value的输入是单个数据,直接推理出其分值
        至于参数return_value_only: 如果设置为True,则在计算出values(在序列上每个位置的分值预测)后直接返回
        """
        
        """经过主干网络正向传播得到输出"""
        transformer_outputs = self.rwtranrsformer(···)

        # hidden_states.shape: (bs, max_seq_len, hidden_size)
        hidden_states = transformer_outputs[0]

        """将隐状态特征传入线性层v_head输出得到分值"""
        # values.shape: (bs, max_seq_len)
        values = self.v_head(hidden_states).squeeze(-1)
        
        if return_value_only:
        	"""
			如果传参中预设了“return_value_only=True”,
			那么将直接返回 values: (bs, max_seq_len)
			"""
            return values
        else:
        	"""否则还将进一步取得reward_score"""
            # 相当于为true  返回values序列,为false 返回values序列和reward标量值 
            bs = values.size(0)
            seq_len = input_ids.shape[1]
            chosen_end_scores = []
            for i in range(bs):
            	···
                # value.shape: (max_seq_len,)
                value = values[i]

                """c_ind即为prompt之后的序列片段中,第一个pad_token的index"""
                c_ind = ···

                """取c_ind的前一个index(实际上就是answer的最终位置)作为reward_score"""
                ···
                chosen_end_scores.append(value[c_ind - 1])
            
            """返回values和reward_score"""
            return {
                "values": values,
                "chosen_end_scores": torch.stack(chosen_end_scores),
            }
3.3.2.3 Further processing of policy model logits

The shape of the logits output by the policy model (actor, ref/SFT) is (bs, max_seq_len, vocab_size). However, when calculating the KL divergence penalty and importance weight, it is not necessary to calculate the logits of all vocabs. Only the groundtruth item is required. Just calculate the logits of (items corresponding to each token in seq)

batch_size = 1
max_seq_len = 4
vocab_size  = 3

logits = [
          [[1.23, 2.11, -0.56], 
           [-1.52, -1.11, 1.66], 
           [0.32, 0.13, 1.55], 
           [-0.55, -0.23, -1.62]]
         ]

seq = [
       [2, 2, 0, 1]
      ]

For CausalLM, the confidence value of logits at the t time step is to predict the seq token at the t+1th step, so logits[, :-1, :] and seq[:, 1:] are the "prediction and label" "Relationship:

logits[, :-1, :] = [
                      [[1.23, 2.11, -0.56], 
                       [-1.52, -1.11, 1.66], 
                        [0.32, 0.13, 1.55]]
                    ]
seq[:, 1:] = [
              [2, 0, 1]
             ]

You only need to extract the logits according to the corresponding labels from the prediction. Taking the above example as an example, the final result probs is

probs = [
             [-0.56, -1.52, 0.13]
            ]

Therefore, DeepSpeed-Chat defines the function gather_log_probs() to post-process the output logits to obtain the logarithmic result log_probs

# applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py
def gather_log_probs(logits, labels):
    """
    相当于输入logits和labels,对logits进行log_softmax后取出对应label位置耳朵logit值
    :param logits: (bs, seq_len, vocab_size)
    :param labels: (bs, seq_len)
    :return: log_probs_labels.squeeze(-1): (bs, seq_len)
    """

    # log_probs.shape: (bs, seq_len, vocab_size)
    log_probs = F.log_softmax(logits, dim=-1)

    """
    此处gather()可以根据labels(index)来从log_probs中获取对应index的值
    总的来说就是取出logits中对应labels数值位置的值
    log_probs_labels.shape: (bs, seq_len, 1)
    """
    log_probs_labels = log_probs.gather(dim=-1, index=labels.unsqueeze(-1))
    return log_probs_labels.squeeze(-1)

3.3.3 PPO training data management-MiniDataset: utils/data/data_utils.py

The Dataset was loaded once at the beginning, but the Dataset loaded at the beginning was for the management of all training data, while the MiniDataset used at this time was mainly for the management of data used in PPO training iterations. The data management process before PPO training can be understood as:

  1. First, the Dataloader takes out from the Dataset: 1 unsupervised data of prompt_batch, and 1 prompt data of prompt_batch
    ( Note: The unsupervised data here is to realize the ptx item. As for why there is this ptx item, the reason is: no Supervised training enables the model to have the basic ability to generate smooth sentences, and the introduction of ptx in the RLHF stage allows the model to pursue human preferences while not forgetting the basic generation ability.) For the latter, if the prompt data of 1 prompt_batch is used for experience
    collection , you will get 1 prompt_batch experience data
  2. After that, the unsupervised data of 1 prompt_batch and the empirical data of 1 prompt_batch will be sent to their respective MiniDataset instances for management: 1 prompt_batch will be divided into several ppo_batch for several iterations of PPO training, as shown in the following code ( from step3_rlhf_finetuning/main.py )
    # applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py
    """经验数据以及无监督数据都将被MiniDataset所管理"""
    exp_mini_dataset = MiniDataset(···)
    unsup_mini_dataset = MiniDataset(···)  
    
    # out为经验数据     
    out = trainer.generate_experience(···)
    exp_dataset = exp_mini_dataset.add(out)
    unsup_dataset = unsup_mini_dataset.add(batch_unsupervised)

The above step 2 is what MiniDataset has to do, and the MiniDataset class is defined in utils/data/data_utils.py and performs the following three operations:

  1. seperate(): subdivided into ppo_batch data, its specific implementation code is
    # applications/DeepSpeed-Chat/training/utils/data/data_utils.py
    class MiniDataset:
        def __init__(self, max_size, small_batch_size):
            """
            :param max_size: batch数。通常此处指“用于给actor做生成的prompt的batch数(注意是batch数不是batch_size)”
            :param small_batch_size: batch size。通常此处指“PPO训练的batch_size”。
            """
            self.dataset = []
            self.max_size = max_size
            self.small_batch_size = small_batch_size
    
        def seperate(self):
        	"""维护1个small_dataset"""
            small_dataset = []
    
            # 从self.dataset中逐个取batch
            for large_batch in self.dataset:
                """判断batch的数据类型(列表/元组/字典),
                根据数据类型取其batch_size,赋值给large_size"""
                if type(large_batch) == list or type(large_batch) == tuple:
                    large_size = len(large_batch[0])
                elif type(large_batch) == dict:
                    large_size = len(large_batch[list(large_batch.keys())[0]])
                else:
                    large_size = len(large_batch)
                """
    
                以下部分代码略微抽象,需要举例说明
                - 比如prompt的batch_size设置为3,PPO训练用的batch_size设置为4,则最后能取来用、存入small_dataset的也就只有3条数据
                - (因为生成用的dataloader只采样出了3条,最多也就只有3条)
    
                - 比如prompt的batch_size设置为5,PPO训练用的batch_size设置为4,则最后能取来用、存入small_dataset的就是2组数据
                - (第1组为idx0,idx1,idx2,idx3共4条数据、第2组为idx4共1条数据)
    
                - 比如prompt的batch_size设置为9,PPO训练用的batch_size设置为4,则最后能取来用、存入small_dataset的就是3组数据
                - ([0,1,2,3],[4,5,6,7],[8])
                """
                for i in range(0, large_size, self.small_batch_size):
                    if type(large_batch) == list or type(large_batch) == tuple:
                        small_dataset.append(
                            [x[i:i + self.small_batch_size] for x in large_batch])
                    elif type(large_batch) == dict:
                        small_dataset.append({
                            k: v[i:i + self.small_batch_size]
                            for k, v in large_batch.items()
                        })
                    else:
                        small_dataset.append(large_batch[i:i + self.small_batch_size])
            """清空self.dataset"""
            self.free()
            
            """返回最终取用的数据,该ppo_batch数据将用于ppo训练迭代"""
            return small_dataset
  2. add(): Get batch (prompt_batch) data;

        def add(self, data):
            """
    		在最开始的时候可以传参预设“生成X个batch再进行PPO训练”,
    		此处的max_size就是其中的X,
    		如果少于max_size则将batch数据加入至MiniDataset中,
    		直至达到max_size个batch
    		"""
            if len(self.dataset) < self.max_size:
                self.dataset.append(data)
                if len(self.dataset) == self.max_size:
                    """
                    seperate()主要实现了
                    1. 在batch的基础上,再细分ppo_batch并返回
                    2. 清空MiniDataset中的数据
                    """
                    return self.seperate()
                else:
                    return None
            else:
                raise ValueError(
                    "The dataset is full but we did not stop it. There is a bug in the code."
                )
  3. free(): Clear the obtained batch data and return ppo_batch data
        def free(self):
            """清空self.dataset中的数据"""
            self.dataset = []

3.3.4 PPO training under the AC architecture: Under the RM with β penalty and truncation, the strategy is continuously iterated and the value is estimated through empirical data

For a batch of collected empirical data, use MiniDataset to process it into multiple batches of ppo_batch data for multiple training iterations of related models.

The ppo_epochs set in DeepSpeed-Chat, from the perspective of reinforcement learning, actually represents the number of reuses of a batch of empirical data:

  • If ppo_epochs is set to 1, during training, the introduced batch of experience data will be directly discarded after one full traversal, and then the next round of prompt_epoch will be carried out, at which time a new batch of experience data will be collected again.
  • If ppo_epochs is set to n, during training, the introduced batch of experience data will be traversed n times before being discarded, which is equivalent to this batch of experience data being reused n times for off-policy training.
# applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py,以下是其中的第470-490行
for ppo_ep in range(args.ppo_epochs):
    """ppo_epoch循环"""
    for i, (exp_data, unsup_data) in enumerate(zip(exp_dataset, unsup_dataset)):
        """
        ppo_step循环:
        从MiniDataset返回的数据中,
        取1个ppo_batch的经验数据和无监督数据来训练
        """

        """经验数据训练,返回actor_loss和critic_loss"""
        actor_loss, critic_loss = trainer.train_rlhf(exp_data)

        """累加本ppo_step的指标,后续将除以内层迭代次数计算均值"""
        actor_loss_sum += actor_loss.item()
        critic_loss_sum += critic_loss.item()
        average_reward += exp_data["rewards"].mean()

        """无监督数据训练"""
        if unsupervised_training_enabled:
            """返回无监督损失"""
            unsup_loss = trainer.train_unsupervised(unsup_data, 
            											args.unsup_coef)
            """累加本ppo_step的无监督损失,后续将除以内层迭代次数计算均值"""
            unsup_loss_sum += unsup_loss.item()

        """PPO训练迭代次数(ppo_step)+1"""
        inner_iter += 1

        """是否启用指数移动平均技术"""
        if args.enable_ema:
            moving_average(rlhf_engine.actor,
                           rlhf_engine.actor_ema,
                           zero_stage=args.actor_zero_stage)

	"""打乱数据供off-policy复用"""
    random.shuffle(exp_dataset)
    random.shuffle(unsup_dataset)

1 PPO training is managed by the train_rlhf() method, which mainly implements " Note, if you don't understand the following content, you can combine it with " ChatGPT technical principle analysis: from RL's PPO algorithm, RLHF to GPT4, instructGPT " Section 3.2 in the article "Enhance understanding ":

  1. In the calculation of KL divergence penalty reward old_rewards, in order to prevent overconfidence in the environmental rewards learned in phase 2 r(x,y), the KL divergence penalty term is added:

    r_{KL} = r(x,y) - \beta \log \frac{\pi_{old}^{RL}(y|x)}{\pi^{SFT}(y|x)}

  2. Calculation of advantages and returns.
    The advantages implementation of most frameworks, including this framework, does not purely use TD-error, but combines the MC method based on TD-error, that is, GAE (generalized advantage estimation); for
    all For a Ttrajectory twith a \lambda=1length \lambda=0of In other words,  the return when it reaches a certain time step is
    \begin{array}{c} \hat{A}_{t}=\delta_{t}+(\gamma \lambda) \delta_{t+1}+(\gamma \lambda)^{2} \delta_{t+2}+\cdots+(\gamma \lambda)^{T-t+1} \delta_{T-1} \\ \text { where } \delta_{t}=r_{K L, t}+\gamma \cdot V_{\text {old }}\left(s_{t+1}\right)-V_{\text {old }}\left(s_{t}\right) \end{array}
    Tt

    R_t = \hat{A}_t + V_t

  3. In 1 ppo_batch, the actor's loss calculation formula is:
    p g_{-} l o s s=E_{\tau \sim \pi_{\text {old }}^{R L}} E_{\left(s_{t}, a_{t}\right) \sim \tau}\left[\max \left(-\hat{A}_{t} \cdot \frac{p_{\text {new }}^{R L}\left(a_{t} \mid s_{t}\right)}{p_{\text {old }}^{R L}\left(a_{t} \mid s_{t}\right)},-\hat{A}_{t} \cdot \operatorname{clip}\left(\frac{p_{n e w}^{R L}\left(a_{t} \mid s_{t}\right)}{p_{\text {old }}^{R L}\left(a_{t} \mid s_{t}\right)}, 1-\epsilon, 1+\epsilon\right)\right)\right]
    Among them, \ canit only refers to the content of the "answer" part, and does not include the prompt part.
  4. In 1 ppo_batch, the critic's loss calculation formula is:
    tailor the new value estimate V_{new}so that it does not deviate too much from the old value estimate when the experience is collected, so that the experience playback can still be effective:

    V_{clip} = clip(V_{new}, V_{old}-\phi, V_{old}+\phi)

    critic will fit the return R:

    vf\_loss = \frac{1}{2} \cdot E_{\tau \sim \pi_{old}^{RL}} E_{s_t \sim {\tau}} [\max((V_{new}(s_t)-R_t)^2, (V_{clip}(s_t)-R_t)^2)]

    Among them, \ canit only refers to the content of the "answer" part, and does not include the prompt part. It is equivalent to emphasizing that "this loss formula only calculates the answer part, and the loss of the prompt part is not included in this formula."

  Next, let’s look at the code implementation. In order to ensure the smoothness of reading, students in the July online ChatGPT class adjusted part of the code in the spring, so that the corresponding function code is connected behind its call, so as to facilitate the specific comparison of its passed parameters, so as to distinguish the new and old strategies passed in, the old and new strategies In order to make the value estimation and
        so on more clear, I split the code into several sections and added a series of formulas, diagrams, explanations, and explanations. Finally, I combined "code and diagrams" to make a more intuitive analysis and give you A unique and transparent explanation!

3.3.4.1 The first is a series of definitions and a KL penalty for rewards in stage two.

The formula corresponding to adding a KL penalty to the reward in stage 2 is expanded to (from 3.1.3 InstructGPT training stage 3 in another article on this blog: Analysis of ChatGPT technical principles : the strategy of how to further optimize the model through the PPO algorithm)

\begin{aligned} objective(\phi ) &= E_{(x,y)\sim D_{\pi _{\phi }^{RL}}} [r_\theta (x,y) - \beta log(\pi _{\phi }^{RL}(y|x) / \pi ^{SFT}(y|x) )] + \gamma E_{x\sim D_{pretrain}} [log(\pi _{\phi }^{RL})] \\&= E_{(x,y)\sim D_{\pi _{ }^{RL'}}} \left [ \frac{\pi _{\phi }^{RL}(y|x)}{\pi ^{RL'}(y|x)}r_{\theta'}(x,y) - \beta log(\pi^{RL'}(y|x) / \pi ^{SFT}(y|x) ) \right ] + \gamma E_{x\sim D_{pretrain}} [log(\pi _{\phi }^{RL})] \\&= E_{(x,y)\sim D_{\pi _{ }^{RL'}}} \left [ \min \left(\frac{\pi_{\phi }^{RL}(y|x)}{\pi ^{RL'}(y|x)} r_{\theta'}(x,y),{clip}\left(\frac{\pi_{\phi }^{RL}(y|x)}{\pi ^{RL'}(y|x)}, 1-\varepsilon, 1+\varepsilon\right) r_{\theta'}(x,y)\right) - \beta log(\pi^{RL'}(y|x) / \pi ^{SFT}(y|x) ) \right ]+ \gamma E_{x\sim D_{pretrain}} [log(\pi _{\phi }^{RL})]\\&= E_{(x,y)\sim D_{\pi _{ }^{RL'}}} \left [ \min \left(\frac{\pi_{\phi }^{RL}(y|x)}{\pi ^{RL'}(y|x)} A^{\theta^{RL'}}\left(x,y\right),{clip}\left(\frac{\pi_{\phi }^{RL}(y|x)}{\pi ^{RL'}(y|x)}, 1-\varepsilon, 1+\varepsilon\right) A^{\theta^{RL'}}\left(x,y\right)\right) \right ]+ \gamma E_{x\sim D_{pretrain}} [log(\pi _{\phi }^{RL})] \end{aligned}

The corresponding diagram is
"And there are two points worthy of special attention:

  1. When the actual code is implemented, when the KL penalty with β is applied to RM, the numerator takes the old strategy in the empirical data ( as shown in the above formula, the corresponding π(RL')). Of course, even if the numerator is the old strategy in the empirical data , the penalty ratio corresponding to β is still the ratio of old to new: π(RL')/π(SFT) , because although π(RL') is initially initialized by π(SFT), after one or more steps, π(RL') That is, it is updated. Whether it is updated in one step or multiple steps depends on the ppo_epochs mentioned above equal to 1 or n. As for the penalty ratio in the code implementation, it corresponds to: old strategy/SFT strategy = log_probs/ref_log_probs), non The ambiguous action_logits in Figure 1 below
  2. In addition, when the actual code was implemented, the KL penalty was added to RM, and the reward_clip was truncated for some security reasons. However, the reward_clip was not reflected in Figure 1 below.

The corresponding code is

# applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py
def train_rlhf(self, inputs):
    """
    使用1个ppo_batch的经验数据,执行1次rlhf训练迭代
    """
    # prompt input ids
    prompts = inputs['prompts']

    # (旧)策略
    log_probs = inputs['logprobs']

    # SFT策略
    ref_log_probs = inputs['ref_logprobs']

    # RM奖励
    reward_score = inputs['rewards']

    # (旧)价值估计
    values = inputs['value']
    attention_mask = inputs['attention_mask']

    # seq input ids
    seq = inputs['input_ids']
	
	"""
    获取prompts的最后1个位置作为start
    比如prompt_len为256,start则为 256-1=255
    这个start主要是用于取出经验数据中的“非prompt”部分(也即“answer+padding”部分)
    """
    start = prompts.size()[-1] - 1
	
	"""
	action_mask相当于取 attention_mask除了第0个序列位置外的部分,
	需要注意的是:
	1. 多数情况下,包括此处在内的transformers风格代码中,
	attention_mask指的实际上是“padding_mask”而非“sequence_mask”;
	2. 之所以要进行[:, 1:]切片,是为了去除第0个位置从而与seq对齐,
	因此 action_mask.shape: (bs, max_seq_len - 1)
	3. 后续将被用于过滤掉pad token位置的信息
	4. 但实际上在后续的使用中,
	基本都会结合上方定义的start,从action_mask中再切片出“非prompt”部分,
	例如 action_mask[start:],实际上就相当于取“非prompt”部分,
	action_mask[start:].shape: (bs, max_answer_len)
	"""
	action_mask = attention_mask[:, 1:]    
    ···

    """经验数据中的价值估计为“旧”价值估计"""
    old_values = values
    with torch.no_grad():
    	
    	###计算KL惩罚修正的奖励################################################
        """
        通过KL散度惩罚,以及r_\theta(来自phase-2的reward model)计算得到修正的奖励,
        注意此处的入参:
        1. log_probs为经验数据中的旧策略『特别注意这里 是针对旧策略log_probs(如上公式所示),非图示中的action_logits』
        2. ref_log_probs为经验数据中的SFT策略
        3. reward_score为经验数据中的RM赋分
        """
        old_rewards = self.compute_rewards(prompts, log_probs,
                                           ref_log_probs, reward_score,
                                           action_mask)
        def compute_rewards(self, prompts, log_probs, ref_log_probs, reward_score,
                    action_mask):
		    """
		    计算实际rewards,涉及(旧)策略与SFT的KL散度惩罚、RM的reward
		    """
		    """计算经验采样时actor与SFT的KL散度惩罚"""
		    kl_divergence_estimate = -self.kl_ctl * (log_probs - ref_log_probs)
		    rewards = kl_divergence_estimate

		    """
		    找到answer的起始start:即prompt的最后1个token位置
		    比如prompts长度为256,answer的起始则为256-1=255
		    """
		    start = prompts.shape[1] - 1

			"""
			ends为batch中各个数据的最后1个有效token的index,
			每个数据的最末有效token位置很大可能是不一样的,
			因此ends是个数组
			"""
		    ends = ···

		    """
		    将RM得到的奖励值限定在一定范围,默认为(-5,5)
            相当于既对RM加了修正,同时又对RM做了截断reward_clip
		    """
		    reward_clip = torch.clamp(reward_score, -self.clip_reward_value,
		                              self.clip_reward_value)
			···

		    """
		    因为batch中每个数据的最末有效token位置很可能不一样,
		    所以无法通过矩阵来并行,需要使用for循环逐个数据处理
		    """
		    for j in range(batch_size):
		    	"""
		        KL_reward = KL + reward
		        加和只在最末有效token上进行
		        """
		        rewards[j, start:ends[j]][-1] += reward_clip[j]	
		    """返回KL rewards"""
		    return rewards
3.3.4.2 Next, calculate the advantages and returns: get_advantages_and_returns in step3_rlhf_finetuning/ppo_trainer.py

(One more thing, as mentioned at the beginning of Section 3.1, in fact, because the advantage adv and KL penalty reward can be calculated from the basic experience data, the adv and KL penalty reward can be calculated in one step during the experience generation stage)


        ###计算优势与回报################################################
        """
        计算优势advantages和回报returns
		注意此处的入参:
        4. old_value为经验数据中的(旧)价值估计
        5. old_rewards为刚才计算得到的KL_reward
		"""
        advantages, returns = self.get_advantages_and_returns(
            old_values, old_rewards, start)
		def get_advantages_and_returns(self, values, rewards, start):
		    """
		    计算优势与回报
		    实现基本与上述公式相同
		    """
		    lastgaelam = 0
		    advantages_reversed = []
		    length = rewards.size()[-1]
		    
		    """反向遍历计算各个时间步的优势advantage"""
		    for t in reversed(range(start, length)):
		    	"""获取下个时间步的价值估计V_{old}(s_{t+1})"""
		        nextvalues = values[:, t + 1] if t < length - 1 else 0.0

		        """计算单步TD-error"""
                # gamma代表折扣因子γ,δ 表示TD误差
		        delta = rewards[:, t] + self.gamma * nextvalues - values[:, t]

		        """累计优势"""
		        lastgaelam = delta + self.gamma * self.lam * lastgaelam

		        """存储各个时间步的优势"""
		        advantages_reversed.append(lastgaelam)

		    """对逆序的优势列表进行正序处理,得到正常时间步排列的优势"""
		    advantages = torch.stack(advantages_reversed[::-1], dim=1)

		    """
		    return_t = adv_t + v_(s_t)
		    由优势计算得到回报
		    """
		    returns = advantages + values[:, start:]

		    """返回优势与回报"""
		    return advantages.detach(), returns

————————————————
In addition, I cited a complete computing advantage in Section 3.2 of the article  " Analysis of Technical Principles of ChatGPT: From RL PPO Algorithm, RLHF to GPT4, instructGPT " Examples of values ​​and return values ​​can make the entire calculation process more clear through examples.
"To further help you clarify every detail of related concepts, let's take a closer look at how DSC calculates returns.

In DSC, the advantage function is defined as the discounted sum of the TD errors (Temporal-Difference Error, δ here) at all future time steps. This method of calculating the advantage function is called Generalized Advantage Estimation (GAE) )

First, we need to calculate the TD error sequence, here we use the provided V_old value sequence and reward sequence to calculate:
values ​​= [-0.2761, -2.3945, 0.1729, -0.0919, -0.0867, -0.0818, -0.0758]
KL_rewards = [-4.6873e-04, -3.1257e-04, 5.8591e-05, -5.5084e-03, -4.0741e-03, -5.5275e-03, -8.5999e-02] γ=0.9
...
"
and The calculation code corresponding to this example is

# -*- coding: utf-8 -*-
"""
Created on Wed Jan 19 20:42:44 2022
@author: Remixa
"""
import torch

def get_advantages_and_returns(values, rewards, start, length, gamma, lam):
    lastgaelam = 0
    advantages_reversed = []
    length = rewards.size(-1)
    print("#" * 10 + "init:" + "#" * 10)
    print(f"start: {start}, length: {length},\ngamma: {gamma}, lam: {lam}\nvalues: {values}\nrewards: {rewards}")
    for t in reversed(range(start, length)):
        nextvalues = values[:, t + 1] if t < length - 1 else torch.zeros_like(values[:, t])
        delta = rewards[:, t] + gamma * nextvalues - values[:, t]
        lastgaelam = delta + gamma * lam * lastgaelam
        advantages_reversed.append(lastgaelam)
        print("#" * 10 + f"t = {t}" + "#" * 10)
        print(f"t = {t}:\nnextvalues: {nextvalues}\ndelta: {delta}\nlastgaelam: {lastgaelam}\nadvantages_reversed: {advantages_reversed}")
    advantages = torch.stack(advantages_reversed[::-1], dim=1)
    returns = advantages + values[:, start:]
    print("#" * 10 + "result:" + "#" * 10)
    print(f"advantages: {advantages}\nreturns: {returns}")
    return advantages.detach(), returns

if __name__ == '__main__':
    old_values = torch.tensor([[-0.2761, -2.3945, 0.1729, -0.0919, -0.0867, -0.0818, -0.0758]])
    old_rewards = torch.tensor([[-4.6873e-04, -3.1257e-04, 5.8591e-05, -5.5084e-03, -4.0741e-03, -5.5275e-03, -8.5999e-02]])
    start = 3
    length = 7
    gamma = 0.9
    lam = 0.95
    advantages, returns = get_advantages_and_returns(old_values, old_rewards, start, length, gamma, lam)
3.3.4.3 Then calculate the strategy loss: actor_loss_fn of step3_rlhf_finetuning/ppo_trainer.py

The corresponding diagram is

But before looking at the code, there are two points that need to be mentioned:

  1. The importance ratio of the old and new strategies: logprobs/old_logprobs  is reflected, but the truncation of the new strategy logprobs is not reflected in the above figure. After all, we already know that this ratio needs to be constrained by truncation or KL divergence to prevent the new strategy from being compared. Updates are too large than the old strategy
  2. log_probs  is the (old) strategy in the previously defined empirical data. Correspondingly, the actor's new strategy is defined as  actor_log_probs . The corresponding relationship between these four definitions in the code implementation is shown in the figure below (the reason for this may be as spring As I said, one person wrote the strategy loss function, and another person wrote the empirical data. In short, please pay attention to avoid confusion)

The corresponding code is

    ###计算actor损失并更新################################################
    batch = {'input_ids': seq, "attention_mask": attention_mask}

    """将seq经验数据输入至actor,进行自回归预测"""
    actor_prob = self.actor_model(**batch, use_cache=False).logits

    """取出probs,此处为新策略"""
    actor_log_prob = gather_log_probs(actor_prob[:, :-1, :], seq[:, 1:])

    """
    计算actor损失
    注意此处的入参:
    1. actor_log_probs为方才刚输出的新策略
    2. log_probs为之前定义的经验数据中的(旧)策略
    3. advantages为之前计算出的优势
    """
    actor_loss = self.actor_loss_fn(actor_log_prob[:, start:],
                                    log_probs[:, start:], advantages,
                                    action_mask[:, start:])
	def actor_loss_fn(self, logprobs, old_logprobs, advantages, mask):
	    """计算actor的损失"""
	    
	    """
	    重要性采样权重计算:ratio = exp(log(new)-log(old)) 
	    """
	    log_ratio = (logprobs - old_logprobs) * mask
	    ratio = torch.exp(log_ratio)

		"""计算策略梯度损失的2个情况:加权优势 与 裁剪加权优势"""
	    pg_loss1 = -advantages * ratio
	    pg_loss2 = -advantages * torch.clamp(ratio, 1.0 - self.cliprange,
	                                         1.0 + self.cliprange)

	    """
		从策2种情况中选择损失较大者作为真正的损失,
		并且基于ppo_batch内所有数据的所有有效时间步计算平均损失值
		"""
	    pg_loss = torch.sum(torch.max(pg_loss1, pg_loss2) * mask) / mask.sum()
	    return pg_loss

	"""actor反向传播、更新参数"""
    self.actor_model.backward(actor_loss)
    self.actor_model.step()
3.3.4.4 Final calculation of value loss: critic_loss_fn of step3_rlhf_finetuning/ppo_trainer.py

    ###计算critic损失并更新################################################
    """将seq经验数据输入至critic,预测得到新价值估计"""
    # 调用的forward_value即是上文「9.3.2.2 奖励reward_score和价值估计values的区别」中分析的那个
    # 且此时因为是计算价值损失,所以这里计算的是新价值估计
    value = self.critic_model.forward_value(**batch,
                                            return_value_only=True,
                                            use_cache=False)[:, :-1]

    """
    计算critic损失
    注意此处的入参:
    1. values为方才刚输出的新价值估计
    2. old_values为经验数据中的(旧)价值估计
    3. returns为之前计算出的回报
    """
	critic_loss = self.critic_loss_fn(value[:, start:], old_values[:,start:],
                                      returns, action_mask[:, start:])
	def critic_loss_fn(self, values, old_values, returns, mask):
	    """计算价值损失"""

	    """裁剪当前新values,使得其不至于太偏离经验采样阶段的旧values"""
	    values_clipped = torch.clamp(
	        values,
	        old_values - self.cliprange_value,
	        old_values + self.cliprange_value,)
	    
	    """计算当前values与回报的L2 Loss"""
	    vf_loss1 = (values - returns)**2

	    """计算裁剪后的当前values与回报的L2 Loss"""
	    vf_loss2 = (values_clipped - returns)**2

	    """
	    选择损失较大者作为真正的损失,
		并且基于ppo_batch内所有数据的所有有效时间步计算平均损失值,
		此外critic损失项的系数为0.5。
	    """
	    vf_loss = 0.5 * torch.sum(
	        torch.max(vf_loss1, vf_loss2) * mask) / mask.sum()
	    return vf_loss
	    
    """critic反向传播、更新参数"""
    self.critic_model.backward(critic_loss)
    self.critic_model.step()
	
	
	"""本次ppo_step将返回actor_loss和critic_loss供指标统计"""
    return actor_loss, critic_loss

By the way, during the RLHF training of phase 3, in order to enable the model to still maintain the performance of the pre-trained model in solving tasks while learning human preferences, traditional autoregressive language modeling was introduced for joint training.

\mathrm{p}(\mathrm{x})=\prod_{\mathrm{t}=1}^{\mathrm{T}} \mathrm{p}\left(\mathrm{x}_{\mathrm{ t}} \mid \maths{x}_{<\maths{t}}\right)

The corresponding sample code is

# applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py
unsup_loss = trainer.train_unsupervised(unsup_data, args.unsup_coef)
def train_unsupervised(self, inputs, unsup_coef):
    """
    1个ppo_batch的无监督训练
    :param inputs: dict:input_ids, attention_mask, labels
    :param unsup_coef: 无监督损失系数
    """
    """确保actor处于训练模式,否则将返回报错"""
    self._validate_training_mode()

    """actor进行常规的CausalLM训练"""
    outputs = self.actor_model(**inputs, use_cache=False)
    loss = outputs.loss
    """反向传播、更新参数"""
    self.actor_model.backward(unsup_coef * loss)
    self.actor_model.step()

    return loss

Finally, I would like to quote again some summary points made by students in the spring:

  1. "The training of RLHF involves reinforcement learning. The training process is extremely sensitive to the setting of hyperparameters. After trying a variety of parameter settings, the DeepSpeed-Chat team finally set per_device_train_batch_size (i.e. prompt_batch_size) = per_device_mini_batch_size (i.e. ppo_batch_size) by default, and generated Training starts immediately after prompt_batch - in this way, what is actually being done is On-Policy reinforcement learning, collecting once and learning once, and the data utilization rate is not high."
  2. In addition, the DeepSpeed-Chat team also found that it is very difficult to set the coefficient (unsup_coef) for the loss of unsupervised training, and the training process will become more oscillating, but the team did not spend much energy on adjusting this coefficient parameter.

Of course, these are not the best hyperparameter configurations. The DeepSpeed-Chat team still encourages users to try more and share their own parameter tuning experiences.

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/132939877