Discussion on the basic process of large model training

Discussion of three fine-tuning steps

Step 1: Oversee fine-tuning

Supervised fine-tuning (SFT) has indeed made significant progress in the field of large language models (LLM). However, unexpected behavior may still occur, such as duplicate content generation and inconsistencies between Perplexity Perplexity (PPL) scores and generation features.

According to our testing, there are several terms that affect build behavior:

  • weight decay: The OPT model is pre-trained by weight decay. Later, spinners usually inherit this setting. However, it may not produce the desired model. In particular, for our OPT-1.3B example, we disabled weight attenuation.
  • dropout: Similar to above, dropout is used for OPT pre-training. However, SFT may not require it. In particular, for our OPT-1.3B example, we enable dropout.
  • dataset: Using more data usually provides better model quality. However, if the sources of the data sets vary too much, performance may be harmed. For our OPT-1.3B example, we use the following four datasets: .Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets
  • training epochsTypically, to avoid overfitting, we choose smaller training epochs over longer epochs if smaller epochs can achieve similar model quality (in this case we use PPL as a metric). However, similar to what InstructGPT pointed out, we found that even if overfitting occurs due to longer training times, longer training epochs are still recommended for better generation quality. In particular, for our OPT-1.3B example, we use 16 epochs, even though we found that 1 or 2 epochs of training can achieve the same PPL score.

Step 2: Reward model fine-tuning

Reward model (RM) fine-tuning is indeed similar to SFT, the main differences are: (1) the training data set is different - RM requires good and bad responses to the same query; (2) the training loss is different - RM requires ranking loss as the optimization target .

We provide the reward model with two metrics: (1) the reward score for accepted responses (and bad responses), and (2) accuracy, which is when accepted responses can get higher scores than rejected responses. Sometimes we observe very high accuracy, but the average reward score of the accepted answer is negative, or the score of the rejected answer is similar to the accepted answer. Will this affect the quality of the step 3 model? This may not be any problem if we use the metric reward score gain from step 3. However, this machine learning metric (reward score increase/increase) cannot really reflect the step 3 model generation quality. So we don't have a clear answer yet.

Here we share more about what we observed during our exploration:

  • weight decay: For our OPT-350m example, we enabled a weight falloff of 0.1.
  • dropout: For our OPT-350m example, we disabled dropout.
  • dataset: For our OPT-350m example, we use the following four datasets: .Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets
  • training epochsInstructGPT recommends using 1 epoch to fine-tune the model, because overfitting will hurt the performance of step 3. During our exploration, we did not see overfitting behavior when we increased the training epoch. However, follow the author's instructions. We set the training period to 1.

Additionally, we have more explorations available here, even if we haven't set them as options or included them in the current pipeline yet

  • multiple answers for one promptIn InstructGPT, the authors specifically mention that using paired reject and accept answers to a prompt is not conducive to reward model training. Therefore, InstructGPT interprets the dataset 4--9 answers for each prompt. However, we did not find a good dataset with this capability.
  • initialize RM with SFT or Pretrained checkpointWe tested this internally and didn't see much difference in accuracy or reward scores. Furthermore, in InstructGPT, the authors also found the same. However, users are encouraged to try it for their own use.
  • Reward score calculationWe use the final token (or the first filled token) to get the bonus score. However, it may not be the best choice. For example, the user can try the average score of the entire answer, etc.
  • Reward loss objectiveWe just use ranking loss as the target. However, others, such as MSE, can also be an option.

Step 3: RLHF fine-tuning

RLHF fine-tuning is the most complex step of the three-step training. Similar to SFT, reward scores do not truly reflect model generation quality. Furthermore, we observe that the reward score drops to the initial stage at a certain point in time and then recovers quickly. To make matters worse, we also see how easily training can diverge. We share our setup and observations here.

  • weight decay: For our OPT-1.3B/350m (Actor/Critic) example, we disabled weight falloff for both models.
  • dropout: We disabled exit for OPT-1.3B and enabled it for OPT-350m.
  • dataset: We use the following single dataset: .Dahoas/rm-static
  • training epochsBonus points quickly become platou. Therefore, we set the training epochs for the OPT-1.1B/3m (actor/critic) example to 350. However, as SFT, longer training may lead to better model quality.
  • ema checkpointWe observe that ema checkpointing often leads to bettr model generation quality, as described in InstructGPT.
  • PPO related hyperparametersThere are many hyperparameters for PPO training, see here . Currently, we have them hardcoded for users, but you may want to adjust them based on your own usage.
  • mix unsupervised trainingInstructGPT recommends a mix of PPO and unsupervised training to prevent loss of model baseline quality. However, when we directly apply the hyperparameters in Instruct, the model fails to converge. So we stop exploring this. However, users are encouraged to test it and adjust the hyperparameters for their own use.
  • diverging issueWe found that using different generation training batch sizes () and PPO training batch sizes (), using multiple PPO training epochs () or multiple generation batch sizes () is very unstable. These all point to the same problem: we cannot update the Actor model multiple times after the experimental data is generated. Therefore, in all our successful runs, we set One of the most likely reasons for this instability is that we found that the and used in the function diverge rapidly even within two consecutive iterations, which results in a huge correspondence. Setting a strict upper bound can alleviate this problem, but it cannot completely solve the convergence problem.--per_device_train_batch_size--per_device_mini_batch_size--ppo_epochs--generation_batch_numbersper_device_train_batch_size=per_device_mini_batch_sizeppo_epochs=generation_batch_numbers=1log_probsold_log_probsactor_loss_fnratio

Some training can also be divided into four steps:

1. Pretain model: In most cases at this stage, it is designed for unsupervised or weakly supervised learning, allowing the model to become a well-read and knowledgeable generalist.

2. Model fine-tuning: This part mainly adds a few labels or knowledge to the pretrain model, allowing generalists to sort out their knowledge structure into a system.

3. Upstream task learning: This part of the task trains the professional skills of the model, making the model more powerful when it has general knowledge, and it will also reshape the general knowledge system.

4. Alignment learning: knowledgeable and capable, but it needs to be able to understand people better and communicate with them more easily, so it needs to be aligned. The current mainstream in this part is RLHF

The above processes are not only done in one round, but often require many rounds of iterations to achieve better performance of the model. The above process division of labor is carried out sequentially in the first few rounds, with relatively clear boundaries. However, the borders of iterations become more blurred in the later stages, and several methods are often used together at the same time. So everyone knows that there are these processes and means, and there is no need to worry about their clear boundaries.

Large model training method
finetune

The core idea of ​​Fine-tune is to use a pre-trained model trained on a large dataset (such as ImageNet, COCO, etc.), and then use a smaller dataset (less than the number of parameters) to fine-tune it [3]. The advantage of this is that compared to training the model from scratch, Fine-tune can save a lot of computing resources and time costs, improve computing efficiency, and even improve accuracy [1][2].

Finetune refers to fine-tuning for specific tasks based on the pre-trained model to improve the performance of the model. There are many specific methods of Fine-tune, but in general, fine-tuning can be performed by adjusting the number of layers of the model, adjusting the learning rate, adjusting the batch size, etc. [2].

The advantage of Finetune is that it does not need to completely retrain the model, thereby improving efficiency, because the accuracy of the new training model generally starts to rise slowly from a very low value, but finetune allows us to get a better result after a relatively small number of iterations .

Although Fine-tune has many advantages, it also has some shortcomings. For example, Fine-tune requires a large data set to improve the performance of the model, which may make some tasks difficult to achieve. In addition, the performance of Fine-tune largely depends on the quality and applicability of the pre-trained model. If there are differences between the pre-trained model and the fine-tuned data set, Fine-tune may not improve the model performance [1].

In the future, Fine-tune technology will continue to be widely used. On the one hand, with the continuous development and improvement of deep learning models, the quality and applicability of pre-trained models will continue to improve, making them more suitable for Fine-tune technology. On the other hand, Fine-tune technology will also help solve some practical application problems, such as small data sets and difficulty in data set annotation [1][3].

prompt learn

The basic concept of Prompt Learning: Prompt Learning is a natural language processing technology that guides the model to complete different tasks by adding a short prompt text in front of the input of the pre-trained model [1]. These prompt texts are usually in the form of questions or instructions that tell the model how to understand the input and generate output. The advantage of Prompt Learning is that it can complete multiple tasks with a small amount of data [2].

Multi-Prompt Learning: Multi-Prompt Learning is an extended form of Prompt Learning, which can apply multiple Prompts to a problem to achieve data enhancement or problem decomposition [1]. Common Multi-Prompt Learning methods include parallel methods, enhancement methods and combination methods [2]. The parallel method performs multiple prompts in parallel, and aggregates the results of multiple single prompts by weighting or voting; the enhanced method inputs a case similar to the current problem together with the current input, so that the model can make predictions more accurately ; The combination method combines multiple Prompts to train the model to perform more complex tasks [2].

How to choose a suitable pre-training model: Choosing a suitable pre-training model is one of the key steps in Prompt Learning. When selecting a model, the following factors need to be considered: task type, data set, model size and training time, etc. [1]. Usually, the larger the size of the pre-trained model, the better its performance on various tasks, but at the same time it requires more computing resources [1].

How to adjust Prompt’s training strategy: Another key step in Prompt Learning is how to adjust Prompt’s training strategy. You can use the method of simply improving the model effect under full data, you can also use Prompt as an auxiliary method under few-shot/zero-shot, or you can fix the pre-trained model and only train Prompt [[1].

As shown in the figure above, finetune's approach to pre-traning uses PLMs as the basic encoder. During finetune's downstream tasks, additional neural layers are added to perform specific tasks and all parameters are adjusted. There is a gap between pre-training and fine-tuning tasks.

As shown in the prompt above, the same MLM task is used during pre-traing and finetuning downstream tasks. Bridge the gap between model tuning and pre-training to enhance few-shot learning capabilities. Using PLMs as the base encoder, adding additional context (template) and [MASK] positions, projecting labels to label words (verbalizer), closing the gap between pre-training and fine-tuning.

The above is a schematic diagram of the process of converting user comment questions to prompts, including template selection, template packaging, MLM output word selection, and word mapping to comment positivity.

template selection

Artificial template design includes that experts design a set of templates based on their understanding of the problem to convert the solution to a specific problem into an expression method suitable for natural language generation. The following is a structured template made by people for QA problems, which converts QA problems into problems where the generative model generates output.

Automatic search generates prompt templates, selects a meta-template, and then generates the optimal prompt template based on gradient search of existing words.

Use T5 to automatically generate templates for multiple input sentences. The operation is as shown in the figure below, the general steps: 1. Use the existing template to train a T5 model, let the model learn how to pass the corpus (level all the task input as vector input, and the output is the final template) 2. Use the task input as the input, use the training model for template generation

Let the pre-model automatically generate templates, the idea is as follows, fix the main pre-train model, let the model train the marked tasks, and modify the input sentence embbeding after the model is learned. Of course, the original sentence is not changed, just let The model changes the non-input sentence parts, and finally can automatically learn the most prompt template. Of course, people may not be able to understand this template.

P-tuning v1: Use hints for input layers (using reparameterization)

P-tuning v2: Use hints for each layer (like prefix tuning)

Fill in the word selection

When doing prompt task design, it is suitable to convert the tasks into generation mode, so there will be a transformation process of how to map the generated things to the desired results. The design and selection of the vocabulary in the middle have a great impact on the final result, so We need to design the output of deep words.

Positive: great, wonderful, good.. Negative: terrible, bad, horrible…

Manual generation

Brainburst generates a wave of keywords or sentence phrases, and then uses the existing knowledge base to recall more related words, concepts or phrase sentences, and then sorts and selects the recalled words, sentence phrases.

Automated generation

It is very similar to automatic template generation. The model is fixed and trained using annotated data. During gradient backpropagation, the input embedding words are changed.

delta learn

The overall idea is to make large expressive models controllable to learn and use by adding some control parameters. Let’s use an example as a metaphor: In cybernetics, simple linear control matrices are used to control large and complex systems; this metaphor may not be completely accurate, because deta learn can actually be merged into the original model, which is actually a link to the learned knowledge. 's rearranged.

What this does in practice is use incremental tuning to simulate models with billions of parameters and optimize a small subset of parameters.

This picture means that I am still me, but after simple changes and learning, I can become a variety of different me, but the pre-train model does not move, only the parameters involved, eyes, One picture, decoration. It expresses the training process very vividly, but I feel it is not expressive enough. But this picture has been widely circulated, so I posted it here by the way.

Addtion: The method introduces additional trainable neural modules or parameters that do not exist in the original model;

Specification: The method specifies that certain parameters in the original model or process become trainable, while other parameters are frozen;

Reparameterization: Method reparameterizes existing parameters into parameter-efficient forms through transformations.

Detaleran has three important factors:

1. Which to insert: serial insertion with the original network, or bridge insertion

2. How to insert: only certain layers, or every layer of the entire network

3. How big is the matrix control: How big is the parameter that is added to the control layer? One bit is still 0.5% of the original parameter.

Different insertion methods and different parameters have relatively large differences in model effects. You can experience this when actually fine-tuning the model. The above table is a mathematical abstract representation of different methods. When everyone finds that they have no ideas during practical operation, they will come and look at this table. It will be helpful to think about it in combination with the problem.

Practical part

This part is based on the chatglm 6B model for experiments. The specific code is in this link: GitHub - liangwq/Chatglm_lora_multi-gpu: chatglm multi-gpu with deepspeed and

The model does not have to be chatglm, llama or any other model. Huggingface's peft is used for delta learning, and deepspeed is used for multi-card distributed training.

Tested: 2-card A100 80G, 8-card A100 80G hardware configuration data and speed are as follows:
500,000 selefinstruct data, 2 cards, 32-core CPU, 128G mem

Batch 2, gd 4 means each batch size=16; 2 epoch lora_rank=8, the amount of inserted parameters is about 7M, and it takes 20 hours to train.

8 cards takes about 5 hours

Fine-tuning, the model convergence is very stable and the effect is good

Code explanation:

data processing logic

def data_collator(features: list) -> dict:
    len_ids = [len(feature["input_ids"]) for feature in features]
    longest = max(len_ids) + 1
    input_ids = []
    attention_mask_list = []
    position_ids_list = []
    labels_list = []
    for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
        ids = feature["input_ids"]
        seq_len = feature["seq_len"]
        labels = (
            [-100] * (seq_len - 1)
            + ids[(seq_len - 1) :]
            + [tokenizer.eos_token_id]
            + [-100] * (longest - ids_l - 1)
        )
        ids = ids + [tokenizer.eos_token_id] * (longest - ids_l)
        _ids = torch.LongTensor(ids)
        attention_mask, position_ids = get_masks_and_position_ids(
            ids, seq_len, longest, _ids.device, gmask=False
        )
        labels_list.append(torch.LongTensor(labels))
        input_ids.append(_ids)
        attention_mask_list.append(attention_mask)
        position_ids_list.append(position_ids)
    input_ids = torch.stack(input_ids)
    labels = torch.stack(labels_list)
    attention_mask = torch.stack(attention_mask_list)
    position_ids = torch.stack(position_ids_list)
    return {
        "input_ids": input_ids,
        "labels": labels,
        "attention_mask": attention_mask,
        "position_ids": position_ids,
    }

Inserting lora allows the training of lora trained on other data, which means that lora can be trained separately on part of the data and part of the data. If necessary, the trained lora can be integrated for joint training, which is very convenient and awesome. It’s definitely a good thing for friends who don’t have enough machine configuration.

# setup peft
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=finetune_args.lora_rank,
    lora_alpha=32,
    lora_dropout=0.1,
)
model = get_peft_model(model, peft_config)
 
if finetune_args.is_resume and finetune_args.resume_path:
    print("=====>load lora pt from =====》:", finetune_args.is_resume, finetune_args.resume_path)
    model.load_state_dict(torch.load(finetune_args.resume_path), strict=False)


Accelerate integration part, because it will not retain checkpoints, so I wrote hardcode to retain a checkpoint every 2000 steps. This part has not yet been rushed to write the code to retain only the latest two checkpoints, so many checkpoint folders will be generated. This If you don't need the block, you can comment it out, or write your own code to keep two blocks. Of course I will update later.

                if i%2000 ==0 and accelerator.is_main_process:
                    #accelerator.wait_for_everyone()
                    path = training_args.output_dir+'/checkpoint_{}'.format(i)
                    os.makedirs(path) 
                    accelerator.save(lora.lora_state_dict(accelerator.unwrap_model(model)), os.path.join(path, "chatglm-lora.pt"))
                    #save_tunable_parameters(model, os.path.join(path, "chatglm-lora.pt"))
                i +=1

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/131307332