【RLHF】Want to train ChatGPT? Let’s take a look at reinforcement learning (RL) + language model (LM) first (with source code)

With the recent fire of ChatGPT, more and more people are paying attention to the core idea of ​​RLHF (Reinforcement Learning from Human Feedback) used in it .

The biggest advantage of using reinforcement learning (rather than supervised learning) to update the language model is that it enables "the model to explore and update directions more freely, thereby breaking through the performance ceiling of supervised learning."

As for why using RL technology can achieve better results, you can refer to the example in the following video (at 6:30 seconds):

In today's article, we will use an example to complete the task of updating the "language model" using "reinforcement learning".

1. Task description: use RL to train a praise generator

We set a task goal: to learn a "praise generator".

The model receives a prompt, for example: just received the goods, feeling

Immediately, let the model complete this paragraph, for example: a bit does not meet expectations, the goods are very poor

prompt: 刚收到货,感觉

output 1: 刚收到货,感觉 有 点 不 符 合 预 期 ,不 好
output 2: 刚收到货,感觉 挺 无 奈 的 送 货 速 度 不 太 行
...

In the initial state, the model will generate answers without any preference, which means it is possible to generate some bad reviews (like the above example).

Now, we will use the method of reinforcement learning (PPO) to train the generative model for "good reviews generation".

Whenever the model generates a sentence, we give a corresponding score (reward), which is used to represent whether the generated comment is "positive praise", as follows:

output 1: 刚收到货,感觉有 点 不 符 合 预 期 ,不 好                -> 0.2 分
output 2: 刚收到货,感觉有 挺 无 奈 的 送 货 速 度 不 太 行          -> 0.1 分
output 3: 刚收到货,感觉有 些 惊 喜 于 货 物 质 量                  -> 0.9 分
...

Immediately, we iterate the generative model with the reward that was typed.

The whole process is shown in the figure below:

Introducing a discriminant model instead of manual scoring

This would be a very lengthy process if a human were to score each output.

If we can find a discriminative model: receive a sentence as input, and output the probability that this sentence is a good one.

Then we can directly use the output of this discriminative model as a reward for generating sentences.

Therefore, we introduce another "emotion recognition model" to simulate the scores given by humans.

"Emotion recognition model" we use the built-in sentiment-analysis pipeline in transformers to implement.

The model is trained based on a data set of online comments, and can discriminate the "positive and negative" emotions of sentences, as follows:

We use the discrimination result (0.0~1.0) of this "emotion recognition model" as the reward of the GPT generation model to guide the GPT model to iteratively update through the reinforcement learning (PPO) algorithm.

2. Detailed training process

2.1 Generate samples (Rollout)

The purpose of the generation sampling phase is to let the current model generate some sampling results.

In order to ensure the diversity of generated sentences, we set up a prompt pool, from which the model will randomly select a prompt for answer generation:

# prompt池
prompts = [
    '刚收到货,感觉',
    '这部电影很',
    '说实话,真的很',
    '这次购物总的来说体验很'
]
...

for _ in range(config['batch_size']):
        random_prompt = random.choice(prompts)                                  # 随机选择一个prompt
        tokens = gpt2_tokenizer.encode(random_prompt)
        batch['tokens'].append(tokens)
        batch['query'].append(random_prompt)
query_tensors = [torch.tensor(t).long().to(device) for t in batch["tokens"]]
...

for i in range(config['batch_size']):
    gen_len = config['gen_len']
    response = gpt2_model.generate(query_tensors[i].unsqueeze(dim=0),           # 利用当前选择的prompt生成句子
                                   max_new_tokens=gen_len, **gen_kwargs)
    response_tensors.append(response.squeeze()[-gen_len:])

After this step, we will get the generated results of a bunch of models:

[
    '刚收到货,感觉 很 一 般',
    '这部电影很 俗 而 且 很 无 趣',
    '这次购物总的来说体验很 烂 不 是 我 想 要 的',
    ...
]

2.2 Reward Evaluation (Evaluation)

After obtaining the model generation results, we can use the "emotion recognition model" to score.

# 情绪识别模型初始化
senti_tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese')
senti_model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese')
sentiment_pipe = pipeline('sentiment-analysis', model=senti_model, tokenizer=senti_tokenizer, device=pipe_device)
...


texts = [q + r for q,r in zip(batch['query'], batch['response'])]           # 将 prompt 和生成的 response 做拼接
pipe_outputs = sentiment_pipe(texts)                                        # 计算正向/负向情感得分

After executing the above code, get the reward score of each sentence:

[
    0.4,
    0.3,
    0.3,
    ...
]

2.3 Model iteration (Optimization)

In the model iteration phase, we will use PPO to update the model parameters. The update code only needs one line:

ppo_trainer.step(query_tensors, response_tensors, rewards)          # PPO Update

PPO will calculate a total of 2 losses when updating: pg_loss, value_loss:

loss_p, loss_v, train_stats  = self.loss(logprobs, values, rewards, query, response, model_input)
loss = loss_p + loss_v
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
...

pg_loss

pg_loss is the loss function of the actor in PPO, which calculates the reward of the current step through discount reward and importance ratio:

losspg = P π new ( token ) P π old ( token ) ( r + γ V next − V current ) loss_{pg} = \frac{P_{\pi_{new}(token)}}{P_{\pi_{ old}(token)}}(r + \gamma V_{next} - V_{current})losspg=PPiold(token)PPinew(token)(r+γ VnextVcurrent)loss_{pg} = \frac{P_{\pi_{new}(token)}}{P_{\pi_{old}(token)}}(r + \gamma V_{next} - V_{current})

Among them, the importance ratio refers to the probability ratio of the same token under the active actor model and the reference actor model, which is also the Importance Sampling coefficient in the PPO model.

for t in reversed(range(gen_len)):
    nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0
    delta = rewards[:, t] + self.ppo_params['gamma'] * nextvalues - values[:, t]          # 优势函数:r + Vnext - V
    lastgaelam = delta + self.ppo_params['gamma'] * self.ppo_params['lam'] * lastgaelam   # GAE, 用于平衡 bias 和 variance
    advantages_reversed.append(lastgaelam)
    advantages = torch.stack(advantages_reversed[::-1]).transpose(0, 1)

logits, _, vpred = self.model(model_input)                                  # 跑一遍模型,得到句子中每个token被选择的概率
logprob = logprobs_from_logits(logits[:,:-1,:], model_input[:, 1:])         # 将概率取log对数
ratio = torch.exp(logprob - old_logprobs)                                   # log相减,等同于概率相除
pg_losses = -advantages * ratio

value_loss

value_loss is the critic's loss function in PPO, and its purpose is to judge the value of each token after it is generated.

This is because there needs to be a critic network in PPO. In order to achieve this effect, we need to modify the GPT model.

We add a Value Head to GPT to map the hidden_size vector to a 1-dimensional value vector:

class GPT2HeadWithValueModel(GPT2PreTrainedModel):
    """The GPT2HeadWithValueModel class implements a GPT2 language model with a secondary, scalar head."""
    def __init__(self, config):
        super().__init__(config)
        config.num_labels = 1
        self.transformer = GPT2Model(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.v_head = ValueHead(config)                                       # 添加 Value Head
        self.init_weights()
    ...

class ValueHead(nn.Module):
    """The ValueHead class implements a head for GPT2 that returns a scalar for each output token."""
    
    def __init__(self, config):
        super().__init__()
        self.summary = nn.Linear(config.hidden_size, 1)                        # (hidden_size -> 1)
    ...

value_loss should be equal to the difference between the predicted value v_pred generated by Value Head and the real value r + v_next:

l o s s v a l u e = ∣ ∣ V p r e d − ( r + V n e x t ) ∣ ∣ loss_{value} = || V_{pred} - (r + V_{next}) || lossvalue=∣∣Vbefore _ _(r+Vnext)∣∣loss_{value} = || V_{pred} - (r + V_{next}) ||

returns = advantages + values                      # r + v_next - v + v => r + v_next
logits, _, vpred = self.model(model_input)         # 跑一遍语言模型,得到每个 token 的 v_pred
vf_losses1 = (vpred - returns) ** 2                # MSE

3. Experimental results

The training curve is shown below. It can be seen that as the training progresses, the reward of the model changes from the earliest 0.68 -> 0.85:

At the beginning of the model training, GPT will generate some random answers, and the average reward at this time will not be very high, and some "negative" emotional comments will be generated (as shown below):

With training, GPT will slowly learn to favor "positive" sentiment comments (as shown below):

The full source code is here:

github.com/HarderThenHarder/transformers_tasks/tree/main/RLHF

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132278109