LLM fine-tuning (3) | Analysis of RLHF + Reward Model + PPO technology in large models

        This article will delve into the concepts of RLHF (Reinforcement Learning with Human Feedback), RM (reward model) and PPO (Proximal Policy Optimizer) algorithms. Then, use RLHF to train your own large model and reward model RM through code demonstration. Finally, a brief dive into model toxicity and hallucination, and how to create a more model-oriented product or a generative AI lifecycle that is beneficial, honest, harmless, reliable, and aligned with human feedback.

一、RLHF(Reinforcement Learning with Human Feedback)

picture

       Let's start with a simple example - imagine we are creating an LLM conversational AI product model that can provide therapy to humans going through tough times. What if we trained a large model but did not make it consistent with humans? Consistently, it provides these individuals with illegal ways to feel better and their best, through substance abuse etc., which will lead to harm, lack of effective reliability and help. As OpenAI CTO said, the field of large models is booming. Large models are more reliable, more consistent, and produce less illusions. The only possible way is to use human feedback from different groups of people, and other methods, such as RAG, Langchain, to Provide context-based responses. Generating an AI lifecycle maximizes helpfulness, minimizes difficulties, and avoids discussion and engagement with dangerous topics.

       Before we understand RLHF in depth, we first introduce the basic principles of reinforcement learning, as shown in the figure below:

picture

     RL is the process of continuous interaction between the Agent and the Environment. First, the Agent is in a certain state of the Environment, and then performs an action, which will have an impact on the environment and enter another state. If the Agent is good or expects the Environment , then you will get a positive reward, otherwise it will be a negative reward. In the end, the cumulative reward during the entire iteration process is generally the largest.

2. In what aspects of large models is RL used?

picture

       There are Agent, Environment and Current Context of the large model. In this case, the strategy is to know our pre-trained or fine-tuned LLM model. Now we want to be able to generate text in a given domain, right? So we take an action, the LLM gets the current context window and the environment context, and based on that action, gets the reward. Strategies with rewards are where human feedback comes in.

3. Introduction to Reward Model

       A reward model is trained based on human feedback data. This model will be called in RLHF and does not require human participation. It can allocate different rewards according to different prompts of users. This process is called "Rollout".

So how do you build a dataset of human feedback?

picture

The data set format is as shown below:

picture

4. Reward Model Training

With the human feedback data set, we can train the RM model based on the following process:

picture

5. Use RLHF (PPO & KL Divergence) for fine-tuning

  1. Input a Prompt data set into the initial LLM;

  2. Enter a large number of prompts into instruct LLM and get some responses;

  3. Input the prompt completion to the trained RM model, RM will generate the corresponding scores, and then input these scores to the RL algorithm;

  4. The RL algorithm we use here is PPO, which generates some responses based on prompts, sorts the averages, uses backpropagation to evaluate the responses, and finally inputs the optimal responses to instruct LLM;

  5. After a few iterations, you end up with a reward model, but this has a downside.

PS: What if our models were constantly trained on positive values ​​and then started delivering weird, vague, and un-human output?

picture

        In order to solve the above problems, we adopt the following process:

picture

       First using a reference model, freezing all the weights in it, as a reference point for our human aligned model, then based on this migration, we add a KL divergence penalty to the reward so that when the model hallucinates, it brings the model back to Near the reference model to provide a positive but not strangely positive response. We can use the PEFT adapter to train our PPO model and make the model more and more consistent as it is rolled out.

6. Fine-tuning practice using RLHF (PEFT + LORA + PPO)

6.1 Install related packages

!pip install --upgrade pip!pip install --disable-pip-version-check \    torch==1.13.1 \    torchdata==0.5.1 --quiet​​​​​
!pip install \    transformers==4.27.2 \    datasets==2.11.0 \    evaluate==0.4.0 \    rouge_score==0.1.2 \    peft==0.3.0 --quiet# Installing the Reinforcement Learning library directly from github.!pip install git+https://github.com/lvwerra/trl.git@25fa1bd

6.2 Import related packages​​​​​​​

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfigfrom datasets import load_datasetfrom peft import PeftModel, PeftConfig, LoraConfig, TaskType
# trl: Transformer Reinforcement Learning libraryfrom trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHeadfrom trl import create_reference_modelfrom trl.core import LengthSampler
import torchimport evaluate
import numpy as npimport pandas as pd
# tqdm library makes the loops show a smart progress meter.from tqdm import tqdmtqdm.pandas()

6.3 Add LLaMA 2 model​​​​​​​​​

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")huggingface_dataset_name = "knkarthick/dialogsum"dataset_original = load_dataset(huggingface_dataset_name)dataset_original

6.4 Preprocessing Data Set​​​​​​​

def build_dataset(model_name,    dataset_name,    input_min_text_length,    input_max_text_length):“””Preprocess the dataset and split it into train and test parts.Parameters:- model_name (str): Tokenizer model name.- dataset_name (str): Name of the dataset to load.- input_min_text_length (int): Minimum length of the dialogues.- input_max_text_length (int): Maximum length of the dialogues.Returns:- dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.“””    # load dataset (only “train” part will be enough for this lab).    dataset = load_dataset(dataset_name, split=”train”)    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.    dataset = dataset.filter(lambda x: len(x[“dialogue”]) > input_min_text_length and len(x[“dialogue”]) <= input_max_text_length, batched=False)    # Prepare tokenizer. Setting device_map=”auto” allows to switch between GPU and CPU automatically.    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map=”auto”)    def tokenize(sample):        # Wrap each dialogue with the instruction.        prompt = f”””        Summarize the following conversation.        {sample[“dialogue”]}        Summary:        “””        sample[“input_ids”] = tokenizer.encode(prompt)        # This must be called “query”, which is a requirement of our PPO library.        sample[“query”] = tokenizer.decode(sample[“input_ids”])        return sample    # Tokenize each dialogue.    dataset = dataset.map(tokenize, batched=False)    dataset.set_format(type=”torch”)# Split the dataset into train and test parts.    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)    return dataset_splitsdataset = build_dataset(model_name=model_name,    dataset_name=huggingface_dataset_name,    input_min_text_length=200,    input_max_text_length=1000)print(dataset)

6.5 Sampling model number​​​​​​​​

def print_number_of_trainable_model_parameters(model):    trainable_model_params = 0    all_model_params = 0    for _, param in model.named_parameters():        all_model_params += param.numel()        if param.requires_grad:            trainable_model_params += param.numel()    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

6.6 Add the adapter to the original salesforce code generation model. Now we need to pass them to the built PEFT model, also setting is_trainable=True. ​​​​​​​

lora_config = LoraConfig(    r=32, # Rank    lora_alpha=32,    target_modules=["q", "v"],    lora_dropout=0.05,    bias="none",    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5)​​​​​​
model = AutoModelForSeq2SeqLM.from_pretrained(model_name,                                               torch_dtype=torch.bfloat16)peft_model = PeftModel.from_pretrained(model,                                        '/kaggle/input/generative-ai-with-llms-lab-3/lab_3/peft-dialogue-summary-checkpoint-from-s3/',                                        lora_config=lora_config,                                       torch_dtype=torch.bfloat16,                                        device_map="auto",                                                                              is_trainable=True)print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')​​​​​​​
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,torch_dtype=torch.bfloat16,is_trainable=True)print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')print(ppo_model.v_head)​​​​​​​
ref_model = create_reference_model(ppo_model)print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

  Use Meta AI’s RoBERTa-based hate speech model (https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target) as the reward model. This model will output logits and then predict the probabilities of two classes: notate and hate. Outputting logits of another state will be considered a positive reward. The model will then use these reward values ​​for fine-tuning via PPO.​​​​​​​

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")print(toxicity_model.config.id2label)​​​​​​
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_idslogits = toxicity_model(input_ids=toxicity_input_ids).logitsprint(f'logits [not hate, hate]: {logits.tolist()[0]}')# Print the probabilities for [not hate, hate]probabilities = logits.softmax(dim=-1).tolist()[0]print(f'probabilities [not hate, hate]: {probabilities}')# get the logits for "not hate" - this is the reward!not_hate_index = 0nothate_reward = (logits[:, not_hate_index]).tolist()print(f'reward (high): {nothate_reward}')

6.7 Evaluating the toxicity of the model​​​​​​​

toxicity_evaluator = evaluate.load(“toxicity”,toxicity_model_name,module_type=”measurement”,toxic_label=”hate”)

References:

[1] https://medium.com/@madhur.prashant7/rlhf-reward-model-ppo-on-llms-dfc92ec3885f

Guess you like

Origin blog.csdn.net/wshzd/article/details/134875122