Large model reinforcement learning reward model training

In the previous blog Summon Shenlong to Build Your Own ChatGPT_gzroy's blog - CSDN blog , I introduced how to use supervised fine-tuning training (SFT) to train a GPT 2 model, so that the model has the ability of conversational question and answer. In OpenAI's InstructGPT paper, SFT is the first step of training. The second step is to train a reward model so that the model's answers can be scored according to human preferences. Then in the third step, reinforcement learning can be used , further train the model through the rewards given by the reward model, so that the model can meet the requirements of security, controllability and other aspects. In this article I will introduce how to train a reward model.

Data set preparation

In the InstructGPT paper, OpenAI introduces how to prepare data. Through the SFT model in the first step, a batch of prompts is prepared. For each prompt, the model generates multiple answers, for example, 9 answers are generated. The quality of these 9 answers was then manually ranked. The reason why these nine answers are not scored manually is because everyone has different scoring standards for answers. However, it is easier for everyone to agree on which answer is of higher quality, so the ranking method is adopted. After obtaining the ranking, we can calculate the pair-wise loss value, that is, comparing the answers in pairs, the difference between the scores should be large enough. For example, for answer A and answer B, the quality of A is higher than that of B, then we can use the following formula to express the quality difference between the two answers, where x represents the prompt and y represents the corresponding answer:

\log \left ( \sigma \left ( r_{\theta} \left ( x,y_{a } \right )-r_{\theta} \left ( x,y_{b } \right )\right ) \right )

If we have 9 answers, then after pairwise matching \binomial{k}{2}, there will be a total of k=9, and a total of 36 matchings. Therefore, the total loss value is:

loss\left ( \theta \right )=-\frac{1}{\binom{k}{2}}E_{\left ( x,y_{a},y_{b}^{}\right )\sim D}\left [ \log \left ( r_{\theta \left ( x,y_{a} \right )} -r_{\theta \left ( x,y_{b} \right )}\right ) \right ]

Minimizing this loss value means that the model can distinguish the scores between good and poor quality answers to the greatest extent.

OpenAI does not open relevant data sets, and manually sorting the answers is also a time-consuming task, so here I plan to use some open source data sets to complete the training of this reward model. There is a data set provided by Anthropic Company at Hugging Face, Anthropic/hh-rlhf · Datasets at Hugging Face . The employees of this company are also from OpenAI. They are a new company established because they are dissatisfied with OpenAI's lack of openness to technology. This data set provides two fields, chosen and rejected, which respectively represent good and poor quality answers to a Prompt. We can use this data set for training.

The following code performs some data conversion on this data set, replacing human and assistant with prompt and response respectively, because the model trained by SFT before me was trained in this format. In addition, it is added at the end of the data. A token of <|endoftext|>, the code is as follows:

from datasets import load_from_disk
import re
from tqdm import trange
from transformers import GPT2Tokenizer
import pickle

regex_human = re.compile(r'(\nHuman:)+')
regex_assistant = re.compile(r'(\nAssistant:)+')

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

ds = load_from_disk('rlhf')

train_data = []
for i in trange(ds['train'].__len__()):
    chosen = ds['train'][i]['chosen']
    rejected = ds['train'][i]['rejected']
    chosen = re.sub(regex_human, '### Prompt:', chosen)
    chosen = re.sub(regex_assistant, '### Response:', chosen)
    chosen += '<|endoftext|>'
    rejected = re.sub(regex_human, '### Prompt:', rejected)
    rejected = re.sub(regex_assistant, '### Response:', rejected)
    rejected += '<|endoftext|>'
    chosen_ids = tokenizer.encode(chosen)
    rejected_ids = tokenizer.encode(rejected)
    train_data.append((chosen_ids, rejected_ids))

with open('reward_train.pkl', 'wb') as f:
    pickle.dump(train_data, f)

test_data = []
for i in trange(ds['test'].__len__()):
    chosen = ds['test'][i]['chosen']
    rejected = ds['test'][i]['rejected']
    chosen = re.sub(regex_human, '### Prompt:', chosen)
    chosen = re.sub(regex_assistant, '### Response:', chosen)
    chosen += '<|endoftext|>'
    rejected = re.sub(regex_human, '### Prompt:', rejected)
    rejected = re.sub(regex_assistant, '### Response:', rejected)
    rejected += '<|endoftext|>'
    chosen_ids = tokenizer.encode(chosen)
    rejected_ids = tokenizer.encode(rejected)
    test_data.append((chosen_ids, rejected_ids))

with open('reward_test.pkl', 'wb') as f:
    pickle.dump(test_data, f)

Build a reward model

According to the introduction of the InstructGPT paper, the reward model is best constructed based on the SFT model, so I use the same method here to train based on the previously trained SFT model. In order to get a score based on the input text, it is necessary to remove the last (hidden_dim, vocab_size) linear transformation layer based on the original model, and instead add a linear transformation layer with a dimension of (hidden_dim, 1), so that Map the latent variable output by the model into a score.

The following code refers to the structure of the GPT 2 model to define a reward model:

import torch
from torch import nn
from torch.nn import functional as F
import math
import inspect

class MHA(nn.Module):
    def __init__(self, d_model, num_heads, attn_pdrop, resid_pdrop):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.attn_pdrop = attn_pdrop
        self.resid_dropout = nn.Dropout(resid_pdrop)
        self.c_attn = nn.Linear(d_model, d_model*3)
        self.c_proj = nn.Linear(d_model, d_model)

    def forward(self, x, attn_mask):
        B, T, C = x.size()
        x_qkv = self.c_attn(x)
        q, k, v = x_qkv.split(self.d_model, dim=2)
        q = q.view(B, T, self.num_heads, C//self.num_heads).transpose(1, 2)
        k = k.view(B, T, self.num_heads, C//self.num_heads).transpose(1, 2)
        v = v.view(B, T, self.num_heads, C//self.num_heads).transpose(1, 2)
        y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, dropout_p=self.attn_pdrop if self.training else 0, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.c_proj(y)
        y = self.resid_dropout(y)
        return y

class FeedForward(nn.Module):
    def __init__(self, d_model, dff, dropout):
        super().__init__()
        self.c_fc = nn.Linear(d_model, dff)
        self.c_proj = nn.Linear(dff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.gelu = nn.GELU()

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x
    
class Block(nn.Module):
    def __init__(self, d_model, num_heads, dff, attn_pdrop, resid_pdrop, dropout):
        super().__init__()
        self.ln_1 = nn.LayerNorm(d_model)
        self.attn = MHA(d_model, num_heads, attn_pdrop, resid_pdrop)
        self.ln_2 = nn.LayerNorm(d_model)
        self.mlp = FeedForward(d_model, dff, dropout)

    def forward(self, x, attn_mask):
        x = x + self.attn(self.ln_1(x), attn_mask)
        x = x + self.mlp(self.ln_2(x))
        return x

class RewardModel(nn.Module):
    def __init__(self, vocab_size, d_model, block_size, embed_pdrop, num_heads, dff, attn_pdrop, resid_pdrop, dropout, num_layer):
        super().__init__()
        self.wte = nn.Embedding(vocab_size, d_model, sparse=False)
        self.wpe = nn.Embedding(block_size, d_model, sparse=False)
        self.dropout_embed = nn.Dropout(embed_pdrop)
        self.h = nn.ModuleList([Block(d_model, num_heads, dff, attn_pdrop, resid_pdrop, dropout) for _ in range(num_layer)])
        self.num_layer = num_layer
        self.block_size = block_size
        #self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        #self.wte.weight = self.lm_head.weight
        self.reward_head = nn.Linear(d_model, 1, bias=False)
        self.ln_f = nn.LayerNorm(d_model)
        self.PAD_ID = vocab_size - 1

        self.apply(self._init_weights)

        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * num_layer))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, input_ids, reward_pos, attn_mask, return_loss=False):
        device = input_ids.device
        b, t = input_ids.size()
        pos = torch.arange(0, t, dtype=torch.long, device=device) 
        x = self.wte(input_ids) + self.wpe(pos)
        x = self.dropout_embed(x)
        for block in self.h:
            x = block(x, attn_mask)
        x = self.ln_f(x)
        rewards = self.reward_head(x).squeeze(-1)
        #x = torch.reshape(x, [b,t])
        #scores = torch.gather(x, dim=-1, index=reward_pos)

        chosen_end_scores = []
        rejected_end_scores = []

        bs = input_ids.shape[0] // 2
        chosen = input_ids[:bs]
        rejected = input_ids[bs:]
        chosen_rewards = rewards[:bs]
        rejected_rewards = rewards[bs:]

        loss = 0
        for i in range(bs):
            if torch.all(torch.eq(chosen[i], rejected[i])).item():
                c_inds = (chosen[i] == self.PAD_ID).nonzero()
                c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]
                chosen_end_scores.append(chosen_rewards[i, c_ind - 1])
                continue
            # Check if there is any padding otherwise take length of sequence
            c_inds = (chosen[i] == self.PAD_ID).nonzero()
            c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]
            r_inds = (rejected[i] == self.PAD_ID).nonzero()
            r_ind = r_inds[0].item() if len(r_inds) > 0 else rejected.shape[1]
            #c_truncated_reward = chosen_rewards[i][c_ind-1]
            #r_truncated_reward = rejected_rewards[i][r_ind-1]
            chosen_end_scores.append(c_truncated_reward)
            rejected_end_scores.append(r_truncated_reward)

            if return_loss:
                end_ind = max(c_ind, r_ind)
                # Retrieve first index where trajectories diverge
                divergence_ind = (chosen[i] != rejected[i]).nonzero()[0]
                assert divergence_ind > 0

                # Index into the correct rewards
                c_truncated_reward = chosen_rewards[i][divergence_ind:end_ind]
                r_truncated_reward = rejected_rewards[i][divergence_ind:end_ind]

                # Append the last rewards to the list of end scores
                chosen_end_scores.append(c_truncated_reward[-1])
                rejected_end_scores.append(r_truncated_reward[-1])

                # Compute loss based on truncated rewards (ignore padding)
                loss += -F.logsigmoid(c_truncated_reward - r_truncated_reward).mean()
                #loss += -F.logsigmoid(chosen_rewards[i][c_ind-1]-rejected_rewards[i][r_ind-1])
        loss = loss / bs
        
        return chosen_end_scores, rejected_end_scores, loss

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
        print(f"using fused AdamW: {use_fused}")

        return optimizer
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None, block_size=512):
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

The main structure of the model is consistent with GPT 2. It mainly adds a reward_head linear transformation layer to map the latent variable output by the final model into a numerical value. In addition, when calculating loss, take the difference between the two answers chosen and rejected, and calculate it according to the loss value formula introduced before.

train

Model training, this part of the code is nothing special. The main thing is that when we read the data, we need to combine the chosen and rejected parts of the data. The following is the code that defines the dataset:

import torch
from torch.utils.data import Dataset
import pickle

class RewardDataset(Dataset):
    def __init__(self, dataset_file, block_size):
        with open(dataset_file, 'rb') as f:
            self.data = pickle.load(f)
        self.block_size = block_size
        
    def __len__(self):
        return (len(self.data))
    
    def __getitem__(self, index):
        chosen = self.data[index][0]
        rejected = self.data[index][1]
        delta_len = self.block_size - len(chosen)
        if delta_len >= 0:
            reward_pos_chosen = torch.IntTensor([len(chosen)-1])
            attn_mask_chosen = [1 for _ in range(len(chosen))]
            attn_mask_chosen.extend([0 for _ in range(delta_len)])
            chosen.extend([0 for _ in range(delta_len)])
        else:
            reward_pos_chosen = torch.IntTensor([self.block_size-1])
            chosen = chosen[:self.block_size]
            attn_mask_chosen = [1 for _ in range(self.block_size)]

        delta_len = self.block_size - len(rejected)
        if delta_len >= 0:
            reward_pos_rejected = torch.IntTensor([len(rejected)-1])
            attn_mask_rejected = [1 for _ in range(len(rejected))]
            attn_mask_rejected.extend([0 for _ in range(delta_len)])
            rejected.extend([0 for _ in range(delta_len)])
        else:
            reward_pos_rejected = torch.IntTensor([self.block_size-1])
            rejected = rejected[:self.block_size]
            attn_mask_rejected = [1 for _ in range(self.block_size)]

        chosen = torch.IntTensor(chosen)
        rejected = torch.IntTensor(rejected)
        attn_mask_chosen = torch.FloatTensor(attn_mask_chosen)
        attn_mask_rejected = torch.FloatTensor(attn_mask_rejected)

        return chosen, rejected, reward_pos_chosen, reward_pos_rejected, attn_mask_chosen, attn_mask_rejected

Define a function to load the parameters of the model previously trained by SFT:

def load_sft(checkpointname, vocab_size, device):
    checkpoint = torch.load(checkpointname)
    config = checkpoint['config']
    config['num_heads'] = config['num_head']
    config.pop('num_head')
    config['vocab_size'] = vocab_size
    model_sft = GPT2(**config)
    model_sft = torch.compile(model_sft)
    model_sft.load_state_dict(checkpoint['model_state_dict'])
    sd_sft = model_sft.state_dict()
    sd_keys_sft = sd_sft.keys()
    sd_keys_sft = [k for k in sd_keys_sft if not k.endswith('lm_head.weight')] # ignore these, just a buffer
    #sd_keys_sft = [k for k in sd_keys_sft if not k.endswith('.attn.bias')] # same, just the mask (buffer)

    model = RewardModel(**config)
    model.to(device)
    model = torch.compile(model)
    sd = model.state_dict()

    for k in sd_keys_sft:
        assert sd_sft[k].shape == sd[k].shape
        with torch.no_grad():
            sd[k].copy_(sd_sft[k])

    del model_sft, sd_sft

    return model

The last step is to train the model. There is nothing special about the code:

    dataset = RewardDataset(args.dataset, 1024)
    dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True, num_workers=4)

    total_loss = 0

    scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

    for epoch in range(start_epoch, start_epoch+args.num_epoch):
        start = time.time()
        for batch, (chosen,rejected,pos_chosen,pos_rejected,attn_mask_chosen,attn_mask_rejected) in enumerate(dataloader):
            optimizer.zero_grad()
            lr = get_lr(batch+epoch*args.steps_epoch, args.warmup_steps, args.learning_rate, args.steps_epoch*args.total_epochs)
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr

            input_data = torch.cat((chosen, rejected), 0)
            pos = torch.cat((pos_chosen, pos_rejected), 0)
            input_data = input_data.to(args.device)
            pos = pos.long().to(args.device)
            attn_mask_temp = torch.cat((attn_mask_chosen, attn_mask_rejected), 0)
            attn_mask_temp = attn_mask_temp.to(args.device)
            attn_mask_temp = torch.unsqueeze(attn_mask_temp, -1)
            attn_mask = torch.bmm(attn_mask_temp, attn_mask_temp.transpose(1,2))
            attn_mask = torch.unsqueeze(attn_mask, 1)

            if mixed:
                with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
                    reward_c, reward_r, loss = model(input_data, pos, None, return_loss=True)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
            else:
                reward_c, reward_r, loss = model(input_data, pos, attn_mask, return_loss=True)
                loss.backward()
                optimizer.step()
            total_loss += loss.item()
            #total_accuracy += accuracy(logits, y)
            total_accuracy = 0
            if batch%100 == 0 and batch>0:
                line = f'Batch: {batch+epoch*args.steps_epoch}, Loss: {total_loss/100:.4f}, Learning_rate: {lr:.7f}'
                with open(args.logfile, 'a') as logfile:
                    logfile.write(line+'\n')
                print(line)
                total_loss = 0
                total_accuracy = 0
                if batch%args.steps_epoch == 0:
                    break

Summarize

The above is the specific implementation process of training reward model. After the reward model is trained, we can use it to score the answers given by the SFT model. This score will be used as a reward value for the final reinforcement learning, allowing the model to adjust the quality of the answer according to human preferences.

Guess you like

Origin blog.csdn.net/gzroy/article/details/132630418