Reward Modelling(RM)and Reinfo

Reward Modelling(RM)and Reinfo

Article Tags Data Language Model Reinforcement Learning Article Classification jQuery Front-end Development Reads 254

Reward Modelling(RM)and Reinforcement Learning from Human Feedback(RLHF)for Large language models(LLM)技术初探

1. Background of RLHF technology

The ChatGPT dialogue model launched by OpenAI has set off a new wave of AI. It faces a variety of questions and answers, and seems to have broken the boundary between machines and humans. Behind this work is a new training paradigm in the field of Large Language Model (LLM) generation: RLHF (Reinforcement Learning from Human Feedback), which optimizes the language model based on human feedback by means of reinforcement learning.

Various LLMs over the past few years have been impressive in their ability to generate diverse text based on human input prompts. However, evaluation of generated results is subjective and context-dependent , e.g.,

  • We want the model to generate a creative story
  • A piece of factual informative text
  • executable code snippet

These results are difficult to measure with existing rule-based text generation metrics such as   BLEU  and   ROUGE .

In addition to evaluation metrics, existing models are usually modeled by predicting the next word and a simple loss function (such as cross-entropy), without explicitly introducing human preferences and subjective opinions .

In order to solve the above problems, wouldn't it be better if we  use the human feedback of generated text as a performance measure, or go a step further and use this feedback as a loss to optimize the model ? This is the idea of ​​RLHF: use reinforcement learning to directly optimize language models with human feedback .

RLHF enables language models trained on general text data corpora to align complex human values.

2. RLHF Technology Decomposition 

RLHF is a complex concept involving multiple models and different training stages. According to the idea of ​​OpenAI, RLHF is divided into three steps:

  1. Collect human feedback and pre-train/fine-tune a language model based on human-labeled data (prompt-completions pa irs)
  2. Use multiple models (can be initial model, finetune model, artificial, etc.) to give multiple answers to the same question, and then manually sort these question-answer pairs by some criteria (readability, harmlessness, correctness blabla) , aggregate Q&A data and train a reward model (Reward Model, RM) for scoring
  1. Question 1, why not manually score directly? Because scoring is subjective and needs to be normalized, and sorting generally everyone will have a common conclusion: which answer is better, A or B, for the same question. What humans feed back is not a standard answer, but a preference for a better answer, which is presented in the form of ranking. In fact, there is no standard best answer to most questions.
  2. Question 2, with a group of partial orders (A>B, A>C, C>B), how to get reward points for each answer? This step uses the Elo ranking system in Hug's blog. Anyone who plays online game qualifying matches or watches football and basketball games may know it. Treat each partial sequence as a competition, and treat the reward score as a ranking score. Here we use Elo to get a complete sort, and then get the reward score after normalization.
  3. Question 3, what model does this RM use? Just use the Elo system to score and normalize, and then go directly to the LM for regression. You can train from zero or use the old LM for finetune. An interesting thing here is that all the text needs to be entered for both Q&A and scoring. In fact, the capacity (or comprehension ability) of the two models should be similar, and the existing RLHF models use two different size model.
  4. Question 4, is there any other way to train the scoring model? Teacher Zhang Junlin pointed out that directly using pairwise learning to rank to score partial sequences is probably more in line with the conventional thinking, and the specific effect depends on practice.
  1. Fine-tune Pretrain LM with reinforcement learning (RL) to get a SFT-LM

Reference link:

https://zhuanlan.zhihu.com/p/591474085
https://zhuanlan.zhihu.com/p/613315873?utm_id=0

3. Collect human feedback and pre-train/fine-tune a language model (SFT LLM) based on human-labeled data (prompt-completions pairs by human feedback)

There are two main categories of models that can be used to gather human feedback:

  • Pre-training model (Base LLM), that is, a model that has only been trained by the expected library without fine-tune
  • Supervised baseline model (SFT LLM), that is, a model that uses the test data set fine-tune on the basis of the pre-trained model

For the results produced by the above model, a dedicated researcher labeler will conduct a relatively good or bad evaluation, and finally get "prompt-completions pairs by human feedback". Next, a sft language model can be trained using the classic fine-tuning method. For this step of the model,

  •  OpenAI uses a smaller version of GPT-3 in its first popular RLHF model,   InstructGPT
  • Anthropic uses a Transformer model with 10 million to 52 billion parameters for training
  • DeepMind uses its own 280 billion parameter model   Gopher

This LM can be fine-tuned here with additional text or conditions, e.g.

  • OpenAI fine-tunes "preferable" human-generated text
  • Anthropic distills the original LM on contextual cues by the criteria of "useful, honest and harmless"

Note that the training of this sft-llm is just a starting point, and then we will train a RM reward model, and then continue to train this sft-llm with the RM reward model.

When the RM reward model participates in SFT training, the human tendency experience contained in RM will be injected into the SFT feedback . Ultimately, our goal is to obtain a high-quality RLHF-LLM.

4. Training reward model (Reward Model)

Next, we will generate training reward model (RM , also called preference model ) data (completions corresponding to prompt) based on sft-llm, and introduce human preference information (scoring and ranking) in this step. 

0x1: Why do we need a reward model?

The following figure shows the current development paradigm of GPT technology for specific task applications.

In general, SFT can already meet the needs of most scenarios (what we need to do is mainly data purification and data distillation ), but if there is a higher demand for model generation quality, reinforcement learning based on human feedback is required (RLHF).

When the SFT Model can already generate a variety of responses in different styles, but for reasons such as law, ethics, human values, and task requirements in specific fields, we need to guide the SFT Model to choose a specific style of answer. Therefore, we need a way to provide feedback to LLMs to help them understand what works and what doesn't, so that we can align their output with accepted human values ​​such as honesty, helpfulness, and harmlessness .

In summary, we need to train an RM Model for the following reasons:

  • Although the basic SFT-LLM meets the basic quality requirements, it still does not fully meet the human tendency to restrict specific tasks, values, ethics, and laws
  • For workload reasons, it is impractical for humans to directly provide such feedback during training, so we need a model that can mimic human preferences to provide rewards when training aligned LLMs.
  • Whether in model tuning or daily performance monitoring after the model goes online, we need an automated evaluation standard and evaluation process to continuously monitor the generalization and decline of the model.

The above is exactly the goal of the reward model in LLM alignment.

0x2: The challenge of building a reward model

  • Amount of feedback data : Generating the amount and variety of human feedback data required for sufficiently accurate reward models is challenging.
  • Feedback distribution : Ideally, we want the reward model to accurately predict rewards not only for the data the model has seen, but also for data outside the training data distribution (OOD).
  • Reward gaming : If there are multiple recurrent black holes in the reward function, the agent can exploit them to get more rewards during RL without converging to the expected value.

0x3:Reward Modeling

The training of RM is the beginning of the difference of RLHF from the old paradigm . This model takes in a sequence of texts (prompt-completions pairs) and returns a scalar reward (scores ), numerically corresponding to the person's preference.

  • We can model with LM in an end-to-end fashion
  • Or modeled with a modular system (such as ranking the output, and then converting the ranking into a reward), this reward value will be crucial for subsequent seamless integration into existing RL algorithms.

Regarding model selection,

  • RM can be another fine-tuned LM
  • It can also be an LM trained from scratch on preference data

For example, Anthropic proposes a special pre-training method, which uses Preference Model Pretraining (PMP) to replace the fine-tuning process after general pre-training . Because the former is considered to have a higher utilization rate of sample data. But the jury is still out on which RM is better.

Regarding the training text, RM's prompt (prompt) - generation (completions) pair (prompt-c mpletions pairs) text is an enhanced text that contains completions scoring or completions pair sorting after manual marking . For example, as shown in the figure below

Regarding the value of training rewards, it is necessary to manually score the answers generated by SFT-LM.

  • One idea is to train RM directly on text annotation scores , but these scores are uncalibrated and noisy due to the different values ​​of the annotators.
  • Another idea is to compare the completions output of multiple models for the same prompt by ranking, and then use the   Elo  system to build a complete ranking. These different ranking results will be normalized to a scalar reward value for training.

Regarding the scalar number describing the quality of the text, the formula is expressed as follows:

  • x means prompt
  • y means completions
  • rθ represents the scoring value scores of the reward model with parameter θ
  • σ represents the sigmoid function

The reward model takes in a sequence of text (good or bad prompt-completions pair) and returns a scalar reward (scores), numerically corresponding to the person's preference.

An interesting artifact of this process is that currently successful RLHF systems use LMs of different sizes than the generative models, e.g.

  • OpenAI uses 175B of LM and 6B of RM
  • The LM and RM used by Anthropic vary in size from 10B to 52B
  • DeepMind uses the 70B Chinchilla model as LM and RM respectively

One intuition is that preference and generative models need to have a similar ability to understand the text given to them, i.e. referees need to be as capable as players to accurately judge player performance.

0x4: Policy model training

First, the fine-tuning task of the initial language model is modeled as a reinforcement learning (RL) problem, so basic elements such as policy , action space, and reward function need to be defined .

  • The strategy is based on the language model, receives prompt as input, and then outputs a series of texts (or the probability distribution of texts)
  • The action space is the permutation and combination of all tokens in the vocabulary at all output positions (a single position usually has about 50k token candidates)
  • The observation space is the possible input token sequence (prompt), which is obviously quite large, which is the permutation and combination of all tokens in all input positions of the vocabulary
  • The reward function (reward) is calculated based on the RM model we trained before to get the initial reward, and then add a constraint item to

The whole process looks like this: 

For the reinforcement learning algorithm, a common feasible solution is to use the policy gradient reinforcement learning (Policy Gradient RL) algorithm and the proximal policy optimization (Proximal Policy Optimization, PPO) to fine-tune some or all parameters of the initial LM. 

1. Reinforcement learning modeling of language model

Set the vocabulary to

, the language model is , then the probability distribution for a sequence of length n can be expressed as

  • input space
  • output space

for input

prompt with possible length 1000 ,

May be completions of length 100.

Then the probability of completions y generated by prompt x can be expressed as:

initialization strategy

, and then use the PPO algorithm to update the policy π, the reward function is defined as r, then the expected value of the reward can be expressed as:

Next, the PPO algorithm optimizes the reward function calculation steps as follows:

  • Input the prompt x into the initial LM and the current fine-tuned LM to get the output text y1, y2 respectively, and pass the text from the current policy to the RM to get a scalar reward rθ
  • Comparing the generated text of the two models, computing a penalty term for the difference, usually designed as a scaling of the Kullback–Leibler (KL) divergence between output word distribution sequences, i.e., where
  • This term is used to penalize RL policies that generate large deviations from the initial model in each training batch, ensuring that the model outputs reasonably coherent text. If this penalty term is removed, the model may generate garbled text during optimization to fool the reward model into providing high reward values. 

Finally, according to the PPO algorithm, we optimize according to the reward index of the current batch of data (from the on-policy characteristics of the PPO algorithm). The PPO algorithm is a Trust Region Optimization (TRO) algorithm that uses gradient constraints to ensure that the update step does not disrupt the stability of the learning process. In addition, the A2C (synchronous advantage actor-critic) algorithm can also be used to optimize the gradient. 

0x5: The overall process of RM & strategy model training

  1. Starting from Base LLM (such as GTP-3.5, LLaMA, Tongyi Qianwen), collect prompts and response answers (completions)
  2. Through manual feedback, the different completions of each prompt are compared and ranked in pairs, indicating human preferences for different response answers (completions), and the pairwise ranking is converted into the score corresponding to different completions through algorithms such as ELO
  3. Train an RM model (usually an LLM), input the "prompt-completions pair with score labels data set" to continue training, and the trained RM model has the ability to output the score score of a given prompt-completions pair
  4. Policy model training
  1. Let us first formulate the fine-tuning task as an RL problem. First, the policy is an LM that takes a prompt and returns a sequence of texts (or a probability distribution of texts). The action space of this strategy is all the tokens corresponding to the LM vocabulary (generally on the order of 50k), and the observation space is the possible input token sequence, which is also relatively large (vocabulary ^ input token quantity) . The reward function is a combination of a preference model and a policy shift constraint.
  2. The reward function determined by the PPO algorithm is specifically calculated as follows:
  1. Enter the prompt x into the initial LM and the current fine-tuned LM, and get the output text y1, yw respectively
  2. Pass text from the current policy to the RM for a scalar reward 
  1. Finally, according to the PPO algorithm, we optimize according to the reward index of the current batch of data (from the on-policy characteristics of the PPO algorithm). The PPO algorithm is a Trust Region Optimization (TRO) algorithm that uses gradient constraints to ensure that the update step does not destabilize the learning process. DeepMind uses a similar reward setup for Gopher, but uses the A2C (  synchronous advantage actor-critic ) algorithm to optimize the gradient
  1. Finally, an RM neural network that meets human preferences is obtained. Next, the rewards output by RM (scores for different completions) can be used to automatically filter out completions that are more in line with human preferences, so as to continuously fine-tune and optimize SFT-LM

Reference link:

https://karpathy.ai/stateofgpt.pdf
https://zhuanlan.zhihu.com/p/616708590
https://openreview.net/forum?id=10uNUgI5Kl
https://huggingface.co/blog/zh/rlhf
https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons/viewer/CarperAI--openai_summarize_comparisons/train?row=0
https://zhuanlan.zhihu.com/p/450690041

  

5. Train a simple Reward Model

Select the WebGPT dataset as the corpus of the reward model, as shown below, each prompt corresponds to a list of completions.

(
    'The USA entered World War I because Germany attempted to enlist Mexico as an ally, and for what other reason?',
    [
        "The United States entered World War I because of Germany's use of submarine warfare against ships in the Atlantic Ocean, which was hurting American exports to Europe. Additionally, Germany tried to enlist Mexico as an ally against the United States, an event which convinced American businessmen and industrialists that the United States should enter the war.",
        'The USA entered World War I because Germany attempted to enlist Mexico as an ally and for the Zimmerman Telegram.'
    ]
)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.

The dataset augmented by human feedback is as follows:

 The processing logic for selecting the best answer for human feedback from the dataset is as follows:

class WebGPT:
    name = "openai/webgpt_comparisons"

    def __init__(self, split: str = "train"):
        super().__init__()
        self.split = split
        dataset = load_dataset(self.name, split=self.split)
        self.dataset_dict = defaultdict(dict)
        for item in dataset:
            post_id = item["question"]["id"]
            if post_id not in self.dataset_dict.keys():
                self.dataset_dict[post_id] = {
                    "full_text": item["question"]["full_text"],
                    "answers": [],
                }
                if item["score_0"] > 0:
                    answers = [item["answer_0"], item["answer_1"]]
                elif item["score_0"] < 0:
                    answers = [item["answer_1"], item["answer_0"]]
                else:
                    answers = []
                answers = [re.sub(r"\[\d+\]", "", answer) for answer in answers]
                answers = [
                    ".".join([sent.strip() for sent in answer.split(".")])
                    for answer in answers
                ]
                if answers:
                    self.dataset_dict[post_id]["answers"].extend(answers)
                else:
                    _ = self.dataset_dict.pop(post_id)

        self.post_ids = list(self.dataset_dict.keys())

    def __len__(self):
        return len(self.post_ids)

    def __getitem__(self, idx):
        question, answers = self.dataset_dict[self.post_ids[idx]].values()
        return question, answers
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.

Then, use wrangling functions to do additional data preparation, such as tokenization and padding, before feeding the data into the model. Depending on the dataset, the number of completions for each hint may vary, so I will maintain an additional variable batch_k_lens to indicate the number of completions available for each hint in the batch. This will help us calculate the loss.

@dataclass
class RMDataCollator:
    tokenizer: PreTrainedTokenizer
    max_length: int = 512

    def format_example(self, example, eos, prompt=False):
        sp_token = SPECIAL_TOKENS["prompter"] if prompt else SPECIAL_TOKENS["assistant"]
        return "{}{}{}".format(sp_token, example, eos)

    def process_example(self, example):
        trunc_len = 0
        eos = self.tokenizer.eos_token
        prefix, outputs = example
        prefix = self.format_example(example, eos, prompt=True)
        outputs = [self.format_example(output, eos) for output in outputs]

        prefix_tokens = self.tokenizer.encode(prefix)
        input_ids, attention_masks = [], []
        for output in outputs:
            out_tokens = self.tokenizer.encode(
                output,
            )
            if len(prefix_tokens) + len(out_tokens) > self.max_length:
                trunc_len = max(
                    0, len(prefix_tokens) + len(out_tokens) - self.max_length
                )
            prefix_tokens = prefix_tokens[trunc_len:]
            out_tokens = prefix_tokens + out_tokens
            out_tokens = out_tokens[: self.max_length]
            pad_len = self.max_length - len(out_tokens)
            attn_masks = [1] * len(out_tokens) + [0] * pad_len
            out_tokens += [self.tokenizer.pad_token_id] * pad_len
            input_ids.append(out_tokens)
            attention_masks.append(attn_masks)
        return input_ids, attention_masks

    def __call__(self, examples):
        batch_k_lens = [0]
        input_ids, attention_masks = [], []
        for i, example in enumerate(examples):
            inp_ids, attn_masks = self.process_example(example)
            input_ids.extend(inp_ids)
            attention_masks.extend(attn_masks)
            batch_k_lens.append(batch_k_lens[i] + len(inp_ids))

        return {
            "input_ids": torch.tensor(input_ids),
            "attention_mask": torch.tensor(attention_masks),
            "k_lens": batch_k_lens,
        }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.

For the reward model model architecture, there are two options:

  • Use a pure encoder model like BERT, Roberta, and add a linear layer on top. Any model that supports AutoModelForSequenceClassification will do.
  • Use a pure decoder architecture like GPT and add a custom linear layer on top. The decoder-only model is more scalable. Any model that supports AutoModelForCausalLM will do.

I choose GPTNeoXModel for now, and I will average pool the last hidden layer and add a custom head on top to generate a scalar output. 

@dataclass
class GPTNeoxRMOuptput(ModelOutput):
    """
    Reward Model Output
    """

    logits: torch.FloatTensor = None


class GPTNeoXRM(GPTNeoXPreTrainedModel):
    """ """

    def __init__(
        self,
        config,
    ):
        super().__init__(config)
        self.gpt_neox = GPTNeoXModel(config)
        self.out_layer = nn.Linear(config.hidden_size, 1)

    def forward(
        self,
        input_ids,
        attention_mask,
        **kwargs,
    ):
        return_dict = (
            kwargs.get("return_dict")
            if kwargs.get("return_dict") is not None
            else self.config.use_return_dict
        )
        outputs = self.gpt_neox(
            input_ids,
            attention_mask,
            return_dict=return_dict,
            **kwargs,
        )
        hidden_states = outputs[0]
        if attention_mask is None:
            hidden_states = hidden_states.mean(dim=1)
        else:
            hidden_states = (hidden_states * attention_mask.unsqueeze(-1)).sum(
                dim=1
            ) / attention_mask.sum(dim=1).unsqueeze(-1)
        lm_logits = self.out_layer(hidden_states)

        if not return_dict:
            return (lm_logits,) + outputs[1:]

        return GPTNeoxRMOuptput(logits=lm_logits)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.

For the loss function, I will use an additional L2 normalization factor to prevent overfitting. For k responses to each prompt to answer completions, there is

A pairwise comparison.

The loss is computed individually for each hint and averaged to get the batch average loss.

class RMLoss(nn.Module):
    """ """

    def __init__(
        self,
        reduction=None,
        beta=0.001,
    ):
        super().__init__()
        self.reduction = reduction
        self.beta = beta

    def forward(
        self,
        logits,
        k_lens=None,
    ):
        total_loss = []
        indices = list(zip(k_lens[:-1], k_lens[1:]))
        for start, end in indices:
            combinations = torch.combinations(
                torch.arange(start, end, device=logits.device), 2
            )
            positive = logits[combinations[:, 0]]
            negative = logits[combinations[:, 1]]
            l2 = 0.5 * (positive**2 + negative**2)
            loss = (
                -1 * nn.functional.logsigmoid(positive - negative) + self.beta * l2
            ).mean()
            total_loss.append(loss)

        total_loss = torch.stack(total_loss)
        if self.reduction == "mean":
            total_loss = total_loss.mean()
        return total_loss
view raw
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.

Finally, we'll pass all of this along with the training parameters to a custom trainer to train and evaluate our model.

remember! Our ultimate goal is to train a "referee", which represents the tendency of human feedback, and it can score and sort prompt completions (essentially realize training set distillation).

Once a good "referee" is trained, the development of LLM SFT can enter a positive cycle. The overall development process is as follows:

  1. Prompt engine based on Base LLM to generate a batch of initial dataset data_v1
  2. SFT the Base LLM based on the initial data set to get a sft-llm_v1
  3. Introduce domain business experts to mark and sort the initial data set, and get an enhanced data set data_v2
  4. Based on the enhanced dataset data_v2, train a reward model, reward_v1
  5. Prompt engine based on sft-llm_v1, get a new prompt-completions data set data_v3
  6. Mark and sort data_v3 based on reward_v1 to get data_v4
  7. Introduce domain business experts to mark and sort data_v4, and get an enhanced data set data_v5
  8. Based on the enhanced dataset data_v5, train a reward model, reward_v2
  9. Perform SFT on sft-llm_v1 based on data_v5 to get a sft-llm_v2
  10. .....
  11. Repeat the above steps, and continuously optimize the reward model and sft llm through feedback from domain business experts
  12. When the performance of the reward model is basically equal to that of human experts, the subsequent training will no longer require manual intervention, and the reward model can automatically score and sort the completions of sft llm, and the entire training optimization process will be fully automated 

Reference link:

https://explodinggradients.com/reward-modeling-for-large-language-models-with-code
https://huggingface.co/datasets/openai/summarize_from_feedback/viewer/axis/test?row=0
  • 1.
  • 2.

6. Learn a complete RLHF development process through the rlhf case of trlx

Take   the rlhf case of trlx  as an example to understand the whole process in depth.

0x1: zero sample cold start

For the development of most domain-specific task LLMs, the initial stage of the project basically starts from a zero-sample cold start. Therefore, the first step of task-LLM is data preparation.

We discuss the zero-sample startup process in two cases.

1. The ability of the basic large model has poor generalization ability relative to the target task domain  

  • Already have at least one basic large model, you can enter the prompt to generate completions
  • The generalization ability of the basic large model for the target task domain is relatively weak, and the generated completions do not meet the needs of the target task domain.

When in this situation, we need to take processes such as prompt engineering, sample purification and distillation, etc., and continuously expand our basic samples in a cyclical iteration.

  1. Step 1: prompt engineering (prompt engineering):
  1. Use Base LLM to prompt the seed samples, and manually select and modify the marking results
  2. Continue to sub-step a, and continuously screen out high-quality prompt instruction sets
  3. Use the optimal prompt instruction set to input the generalized thousand-question base model to get the "basic prompt-completions data set"
  1. Step 2: sample distillation (sample distillation/purification)
  1. Manually select good case samples that meet the minimum quality requirements from the basic prompt-completions data set for sample purification
  2. Manually correct the completions of bad case samples that do not meet the requirements, so that they meet the minimum quality requirements, so as to ensure that the overall sample size remains basically unchanged
  3. The sample distillation/purification process can be batched and incrementally expanded to continuously inject generalization capabilities into the model. Each round of iterations is continuously accumulated to obtain a batch of "purified prompt-completions data sets" that are constantly expanding
  1. Step 3: sft train (fine-tuning self-supervised training)
  1. Based on the "purified prompt-completions data set", fine-tuning is performed based on Base-LLM, and the fine-tuned model sft-llm is obtained
  1. Step 4: rm reward model development & RLHF manual feedback training
  1. Develop and evaluate the web version, people can score the model by looking at the output of the model,
  2. Mark new samples based on the sft-llm fine-tuning model, generate more than 2 completions for the same prompt, and generate "sft-prompt-completions data set"
  3. Manually select good case and bad case, rank the "sft-prompt-completions data set", convert the scores of different completions through elo, feed it to the rm model for learning, and obtain a reward model
  4. Through ppo train, fine-tune sft-llm, and finally get RLHF-llm
  1. Step 5: Cycle the prompt engine and RLHF process
  1. Based on RLHF-llm as the basic model of step 1
  2. Recycle for a new round of prompt engineering (prompt engineering)
  3. Recycle for a new round of sample distillation (sample distillation/purification)
  4. Recycle for a new round of sft train (fine-tuning self-supervised training)
  5. Recycle for a new round of rm reward model development & RLHF manual feedback training
  1. Step 6: auto RLHF
  1. The reward model can be used as an automatic evaluation mechanism and feedback mechanism after the model goes online
  2. When the rm performance is only good enough (it has fully fitted the artificial tendency experience), the degree of manual intervention can be reduced, and the reward model rm can be used to assist the sft-model to continuously fine-tune the loop, and finally get the SOTA-RLHF model

2. The ability of the basic large model can generate samples that basically meet the relative target task domain

  • Already have at least one basic large model, you can enter the prompt to generate completions
  • The basic large model has excellent generalization ability for the target task domain, and the generated completions are of high quality for the target domain task

When this is the case, the sample distillation (sample distillation/purification) step can be basically omitted, and other steps remain unchanged.

The completions generated by the basic large model basically meet the minimum quality requirements of the target task domain, and the key work will be on the development of the enhanced reward model and RLHF fine-tuning training.

0x2: Training the basic SFT model

We use "  CarperAI/openai_summarize_tldr ", based on "  EleutherAI/gpt-j-6B " for SFT,

# 单GPU
cd sft/ && CUDA_VISIBLE_DEVICES=0 python3 train_gptj_summarize.py
# 多GPU
cd sft/ && deepspeed train_gptj_summarize.py
  • 1.
  • 2.
  • 3.
  • 4.

Through sft, a sft-llm that is aligned with the summarize task is obtained.

0x3: training of reward model (Reward Model)

1. Data set preparation (completions scoring, ranking)

In general project development, we need to hire data contractors or outsourcers to sort (rank) the completions generated by base-llm, sft-llm, and manual methods. This step is very time-consuming, but it is very important for the effect of the final model.

 Here we use the open source " CarperAI/openai_summarize_comparisons " on hugeface for demonstration.

2. Hugface data set (completions data set that has completed rank sorting) loading and preprocessing

Using an open source  dataset , create a list of dictionaries, each with 3 keys,

  • prompt: original prompt
  • chosen: The summary corresponding to the prompt is manually marked as "accepted", which means that the rank is higher
  • rejected: The summary corresponding to the prompt is manually marked as "rejected", which means the rank is lower 

def create_comparison_dataset(path="CarperAI/openai_summarize_comparisons", split="train"):
    dataset = load_dataset(path, split=split)
    pairs = []
    for sample in tqdm(dataset):
        pair = {}
        prompt = sample["prompt"]
        chosen_summary = sample["chosen"]
        rejected_summary = sample["rejected"]
        if chosen_summary == rejected_summary:
            continue
        if len(chosen_summary.split()) < 5 or len(rejected_summary.split()) < 5:
            continue
        pair["chosen"] = prompt + "\n" + chosen_summary
        pair["rejected"] = prompt + "\n" + rejected_summary
        pairs.append(pair)
    return pairs
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.

Splicing prompt-completions pair pair,

  • prompt+chosen
  • prompt+rejected

For the processed pair, perform word segmentation processing and construct a training data set form

class PairwiseDataset(Dataset):
    def __init__(self, pairs, tokenizer, max_length):
        self.chosen_input_ids = []
        self.chosen_attn_masks = []
        self.rejected_input_ids = []
        self.rejected_attn_masks = []
        for pair in tqdm(pairs):
            chosen, rejected = pair["chosen"], pair["rejected"]
            chosen_encodings_dict = tokenizer(
                "<|startoftext|>" + chosen + "<|endoftext|>",
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            rejected_encodings_dict = tokenizer(
                "<|startoftext|>" + rejected + "<|endoftext|>",
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            self.chosen_input_ids.append(chosen_encodings_dict["input_ids"])
            self.chosen_attn_masks.append(chosen_encodings_dict["attention_mask"])
            self.rejected_input_ids.append(rejected_encodings_dict["input_ids"])
            self.rejected_attn_masks.append(rejected_encodings_dict["attention_mask"])

    def __len__(self):
        return len(self.chosen_input_ids)

    def __getitem__(self, idx):
        return (
            self.chosen_input_ids[idx],
            self.chosen_attn_masks[idx],
            self.rejected_input_ids[idx],
            self.rejected_attn_masks[idx],
        )
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.

The above data is inconvenient to be input into the model for training at the same time. It is necessary to further organize the data and construct it into the following form:

  • input_ids: Concat the chosen and rejected of input_ids on the 0 dimension
  • attention_mask: Concat the chosen and rejected of attention_mask on the 0 dimension
  • labels: Set the chosen part to 0, the rejected part to 1, and concat on the 0 dimension. This step completes the digital vectorization of the stirng label.

It should be noted that after the above processing, the batch size becomes twice the original

class DataCollatorReward:
    def __call__(self, data):
        batch = {}
        batch["input_ids"] = torch.cat([f[0] for f in data] + [f[2] for f in data])
        batch["attention_mask"] = torch.cat([f[1] for f in data] + [f[3] for f in data])
        batch["labels"] = torch.tensor([0] * len(data) + [1] * len(data))
        return batch
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.

3. Build a reward model

The structure of RM is relatively simple, that is, transformer structure + linear classification head.

  • The transformer uses "  CarperAI/openai_summarize_tldr_sft " to pre-train the llm model, and freeze 70% of the neurons without parameter fine-tuning, that is, retain the original sft-llm's ability to understand the text
  • Linear linear classifier, used to output the score score of dim=1, used to score completions

Define the loss formula,

  • If the RM model predicts the prompt-completions pair as chosen, it returns 0; if the RM model predicts the prompt-completions pair as rejected, it returns 1
  • The goal of the loss optimization function is to make the 0/1 output by RM as close as possible to the 0/1 of the training data. The more matches, the smaller the loss

class GPTRewardModel(nn.Module):
    def __init__(self, model_path):
        super().__init__()
        model = AutoModelForCausalLM.from_pretrained(model_path)
        self.config = model.config
        # `gpt-neo(x)` models use `hidden_size` attribute names instead of `n_embd``
        self.config.n_embd = self.config.hidden_size if hasattr(self.config, "hidden_size") else self.config.n_embd
        self.transformer = model.transformer
        self.v_head = nn.Linear(self.config.n_embd, 1, bias=False)
        self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.PAD_ID = self.tokenizer(self.tokenizer.pad_token)["input_ids"][0]

    def forward(
        self,
        input_ids=None,
        past_key_values=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        mc_token_ids=None,
        labels=None,
        return_dict=False,
        output_attentions=False,
        output_hidden_states=False,
    ):
        loss = None
        transformer_outputs = self.transformer(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )

        hidden_states = transformer_outputs[0]

        rewards = self.v_head(hidden_states).squeeze(-1)
        chosen_end_scores = []
        rejected_end_scores = []

        # Split the inputs and rewards into two parts, chosen and rejected
        assert len(input_ids.shape) == 2
        bs = input_ids.shape[0] // 2
        chosen = input_ids[:bs]
        rejected = input_ids[bs:]
        chosen_rewards = rewards[:bs]
        rejected_rewards = rewards[bs:]

        loss = 0
        inference = False
        for i in range(bs):
            if torch.all(torch.eq(chosen[i], rejected[i])).item():
                c_inds = (chosen[i] == self.PAD_ID).nonzero()
                c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]
                chosen_end_scores.append(chosen_rewards[i, c_ind - 1])
                inference = True
                continue

            # Check if there is any padding otherwise take length of sequence
            c_inds = (chosen[i] == self.PAD_ID).nonzero()
            c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]
            r_inds = (rejected[i] == self.PAD_ID).nonzero()
            r_ind = r_inds[0].item() if len(r_inds) > 0 else rejected.shape[1]
            end_ind = max(c_ind, r_ind)

            # Retrieve first index where trajectories diverge
            divergence_ind = (chosen[i] != rejected[i]).nonzero()[0]
            assert divergence_ind > 0

            # Index into the correct rewards
            c_truncated_reward = chosen_rewards[i][divergence_ind:end_ind]
            r_truncated_reward = rejected_rewards[i][divergence_ind:end_ind]

            # Append the last rewards to the list of end scores
            chosen_end_scores.append(c_truncated_reward[-1])
            rejected_end_scores.append(r_truncated_reward[-1])

            # Compute loss based on truncated rewards (ignore padding)
            loss += -torch.log(torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()
        loss = loss / bs

        if not inference:
            chosen_end_scores = torch.stack(chosen_end_scores)
            rejected_end_scores = torch.stack(rejected_end_scores)

        if inference:
            chosen_end_scores = torch.stack(chosen_end_scores)
            return {"chosen_end_scores": chosen_end_scores}

        return {
            "loss": loss,
            "chosen_end_scores": chosen_end_scores,
            "rejected_end_scores": rejected_end_scores,
        }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.

Combining the above parts, you can train RM

# Initialize the reward model from the (supervised) fine-tuned GPT-J
    model = GPTRewardModel("CarperAI/openai_summarize_tldr_sft")

    # Freeze the first 70% of the hidden layers of the reward model backbone
    layers = model.transformer.h
    num_layers = len(layers)
    num_unfrozen = int(0.3 * num_layers)
    for layer in layers[:-num_unfrozen]:
        layer.requires_grad_(False)

    # Create the comparisons datasets
    data_path = "CarperAI/openai_summarize_comparisons"
    train_pairs = create_comparison_dataset(data_path, "train")
    val_pairs = create_comparison_dataset(data_path, "test")

    # Make pairwise datasets for training
    max_length = 550
    train_dataset = PairwiseDataset(train_pairs, tokenizer, max_length=max_length)
    val_dataset = PairwiseDataset(val_pairs, tokenizer, max_length=max_length)

    # Create the collator to gather batches of pairwise comparisons
    data_collator = DataCollatorReward()

    Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        compute_metrics=compute_metrics,
        eval_dataset=val_dataset,
        data_collator=data_collator,
    ).train()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.

4. Start BP training

cd reward_model/ && deepspeed train_reward_model_gptj.py
  • 1.

If you want to speed up the time, you can also directly download the open source trained reward model on hugeface,

mkdir reward_model/rm_checkpoint
wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O reward_model/rm_checkpoint/pytorch_model.bin
  • 1.
  • 2.

0x4: training of policy model (PPO)

Since the value function of the PPO algorithm can be a deep learning model, in this case a transformer model, the basic idea of ​​the policy gradient method is to express the value function as a function of the policy parameters, and then update it according to the feedback value of RM .

1. Normalization

Due to the large variance of the reward scores, it is necessary to make a difference based on the artificial results to achieve standardization, that is,

in

denote the model score and human score, respectively. The code is implemented as follows:

def reward_fn(samples: List[str]):
    # get humans summarizes
    posts = [sample.split('TL;DR')] for sample in samples]
    ref_samples = [post + 'TL;DR' + post_summ_dict[post] for post in post]
    samples_encodings = reward_tokenizer(samples)
    samples_scores = reward_model(**samples_encodings) # get scores from reward model for samples
    ref_samples_encodings = reward_tokenizer(ref_samples) # get scores from reward model corresponding references samples
    ref_samples_scores = reward_model(**ref_samples_encodings)
    norms_rewards = samples_scores - ref_samples_scores
    return norms_rewards

2. KL Divergence

When using PPO for fine-tuning, the summary is generated by the strategy (LLM). The generated summary is passed to the reward model to generate reward points, and then the strategy is updated. Since the above operations are batch-wise, and because RL training is very noisy, especially in the initial stage, these may lead to excessive policy deviation. To prevent this problem, KL divergence is introduced as a penalty term to avoid excessive deviation of the policy model.

in

represents the output score of the reward model,

represents the coefficient,

represents the policy model,

Represents a supervised model.

3. Start PPO training

accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py
  • 1.

0x5:Results

SFT vs PPO

Model

Rouge-1

Rouge-2

Red-L

Average

SFT

0.334

0.125

0.261

0.240

PPO

0.323

0.109

0.238

0.223

ROUGE scores 

Model

Average Reward

Reward Δ

SFT

2.729

-0.181

PPO

3.291

+0.411

Reward scores 

Reference link:

https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons/viewer/CarperAI--openai_summarize_comparisons/train?row=0
https://link.zhihu.com/?target=https%3A//github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf
https://github.com/CarperAI/trlx 
https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf

七、RL4LMs - A modular RL library to fine-tune language models to human preferences

References:

https://github.com/allenai/RL4LMs

8. Limitations of RLHF and future work

  • These models trained on the RLHF paradigm perform better, but may still output harmful or factually inaccurate text. This imperfection is the long-term challenge and optimization goal of RLHF.
  • When training a model based on the RLHF paradigm, the cost of manual annotation is very high , and the RLHF performance can only reach the knowledge level of the annotator in the end. In addition, the manual labeling here is mainly to label the sorting results of the output text for the RM model, and if you want to train the model by manually writing answers, the cost is unimaginable. In fact, for SFT-LLM or For RLHF-LLM, the truly valuable and important information is the output of human-written completions.
  • There are still many areas for improvement in the process of RLHF, among which improving the RL optimizer is particularly important . PPO is a relatively old RL algorithm based on trust region optimization , but there is no other better algorithm for optimizing RLHF.

9. Another Paradigm for Reward Model Development

RM has a total of two scenarios:

  1. Receives a prompt-completions pair , giving a numerical score (or a multi-dimensional numerical vector , defined by human experts)
  2. Auxiliary SFT-LLM for reinforcement learning training

In scenario 1, there is actually another paradigm, that is, to implement a "prompt-completions pair text quality reasoning chain" by constructing a prompt template. The prompt template includes the following elements:

  • prompt-completions pair input
  • problem definition
  • Evaluation Criteria Definition
  • Evaluation result output (can be designed to be formatted)

An example is as follows:

log in to copy 

You are a fair AI assistant for checking the quality of the answers of other two AI assistants. 

    [Question] 

    {data['query']}

    [The Start of Assistant 1's Answer]

    llama chains: {data['llama_chains']}
    llama answer: {data['llama_answer']}

    [The End of Assistant 1's Answer]

    [The Start of Assistant 2's Answer]

    chatgpt chains: {data['chatgpt_chains']}
    chatgpt answer: {data['chatgpt_answer']}

    [The End of Assistant 2's Answer] 

    We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. 
    Please first judge if the answer is correct based on the question, if an assistant gives a wrong answer, the score should be low.
    Please rate the quality, correctness, helpfulness of their responses based on the question.
    Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance, your scores should be supported by reasonable reasons. 
    Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. 
    The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias, and the order in which the responses were presented does not affect your judgement.
    If the two assistants perform equally well, please output the same score for both of them.

Playing back丨2023 Beijing Zhiyuan Conference
https://2023.baai.ac.cn/schedule

Zhiyuan "Enlightenment 3.0" large-scale model series comes out
https://baai.org/l/27398

Hinton: Abandoning the mortal computing of immortality (with video)
https://baai.org/l/27397

Yann LeCun: LLM has limited reasoning ability and needs to be retrained
https://baai.org/l/27396

Sam Altman: AI safety starts with a single step
https://baai.org/l/27385

David Holz: AI will revolutionize learning, creativity and organization
https://baai.org/l/27399

Guess you like

Origin blog.csdn.net/qq_39970492/article/details/131250227