Some pitfalls and judgments of large model training

Before chatgpt is completely reproduced, these things are the judgments of Mr. Bao based on public information and practical experience. They are for reference only, and all the conclusions may be overturned by new technologies.

1. A cold start can be a big deal.

The training process of the large model is to step by step the language model obtained from the whole network data gradually approaching the language habits of people.

Pretraining is for SFT cold start, and SFT is for RL cold start.

In each independent link, a cold start is also needed. For example, LLama2 mentioned a SFT bootstrap method, an iterative and rejected sampling method in RL .

Pretraining allows the large model to have basic knowledge reserves and generalization of the language model, which gives SFT a good initialization, reduces data consumption in the SFT stage, and allows him to initially align people's habits.

SFT is more like a cold-start ability for the sampling in the RL stage, preventing the sampled things from exceeding the RM's discrimination ability circle, and making it as close as possible to the good and bad ranges defined by the RM.

However, the data construction cost of SFT is very high. For example, there are gpt data in China, and the ability to distill gpt through SFT is quite intuitive. If there is no good manpower and material resources, it is also a method.

Compared with the RL training process, the upper limit of the generalization ability of SFT is relatively poor, because the data always has an end, and high-quality data is even more difficult. Direct training is more like a direction guide, and the SFT stage A cold start may also be necessary.

Finally, in the RL stage, the infinite data generation is handed over to the sampling process, and the process of judging good or bad is handed over to RM. During this process, the capabilities of LLM and RM need to evolve synchronously to prevent RM from losing its judgment when LLM is too strong . This is the iterative update of llama2 we see.


2. The water in RM is deep

The purpose of the RL process is very clear. The idea of ​​classic PPO is very intuitive. The key to solving the problem is the stability of RL used in LLM.

Stability can be regarded as some clear technology-oriented things, such as adding "normal reference" in the learning process to prevent the learning process from being too aggressive, etc., so that he can try to maintain a better LLM ability instead of simply fitting high scores.

But here in RM, there are many deep pits, and there is a phenomenon called reward hacking , which often appears.

The strategy space of LLM is too open, unlike RL playing games, there are only a few key combinations of up and down AABB.

Which word to choose under the LLM vocabulary and what sequence the words form are all a strategy.

Open decision-making is too difficult for the simulation scoring environment, and it has extremely high requirements for the generalization of RM.

Assuming a scenario, if your LLM has badcase, you want to set all known badcases to bad in RM, and set it to good if the punctuation data is normal.

Then use the rm to identify badcases for reinforcement learning, trying to eliminate all badcases. This intuitive idea has huge pits.

You will find that what your RL finally learns is an unknown high-scoring pattern, and you find another badcase as a high-scoring pattern besides your badcase.

It's like an ant walking on white paper, surrounded by big pits, only a small piece of land is safe, it walks randomly, and you keep typing X in the bad direction it passes by.

It turns out that there are endless bad directions, and there is no end to it.

In the end, there is a high probability that your model will learn to output a bunch of useless things, but the RM score is very high.

This is reward hacking.

If you do not essentially improve the omniscient scoring ability of RM, just relying on increasing the KL divergence penalty, value clipping, etc., will alleviate the occurrence of the problem rather than fundamentally solve it.

The last bag. .

This is reflected in the process of the LLama2 thesis. The meta team pays great attention to maintaining the ability of RM. When it is found that RM loses its judgment, it will be updated and iterated in time.

Prevent RL from encouraging weird things.

The ability of RM is not only reflected in the generalization, but also in the degree of discrimination, so we saw that the meta moved the margin in the face . .


3. The trade-off between efficiency and effect

In addition to hardware and pipeline optimization, sample construction, there are many such optimization points in the learning process.

For example, the construction trick of multiple rounds of dialogue mentioned above, the large model fine-tuning trick of sample construction , this is a method that can greatly improve learning efficiency, and we have also seen the same idea in LLama2.

However, llama did something more outrageous and radical, bringing different sessions together. The passed special token distinguishes the modules, this detail needs to be confirmed.

It is considered here that there is a special terminator that separates different sessions, and the common terminator is similar to <eos> to divide rounds.

In addition to the data structure, there are also some efficiency and effect trade-offs in the learning process. Methods like DPO can save the time of PPO sampling.

When DPO aligns, it converts the pressure of RM and sampling into the pressure of labeling data.

This method can also improve training efficiency, but in the alignment stage, it seems too difficult to pursue the absolute amount of data. Everyone basically uses relatively small data with high quality and high quality data, and uses the RM obtained from the existing LLM basis to train .

The DPO method seems to go in the opposite direction. You need to charge enough money and mark enough data. Whether the effect can reach the ceiling of PPO remains to be verified.

Efficiency and effect are always a trade-off. In the last stage, LLama2 chose to sacrifice efficiency to obtain quality , and used the method of rejecting sampling to prevent some unknown surprises from being learned during the RL process.

This multi-sampling selection basically increases the consumption of resources by a multiple of the number of sampling times.

On the whole, the closer to the back of the pipeline, the more attention should be paid to the quality . Of course, the closer to the back, the actual resource consumption in the whole is relatively small, so some efficiency can be relatively sacrificed.

It seems that DPO is not very scientific, but rejecting sampling is a relatively reasonable solution.


4. Large-scale model evaluation is critical, and the water is very deep.

I have written a large model evaluation before, it is too difficult , and it is too difficult to train with a large model! Some reasons are summarized in it. The key point is that if the evaluation is not done well, it will affect the efficiency of the experiment. The efficiency of the experiment can be converted into the computing power consumption per unit time.

It can be deduced that poor evaluation = costly and time-consuming.

So you are slow to do experiments, which is equivalent to having fewer GPUs than others, which is shocking enough.

Openai not only has more cards, but also has a buff that doubles the experimental efficiency, which is equivalent to the card*efficiency multiple.

So far there is no publicly available particularly reliable automated evaluation method


5. The water in the downstream trimmer is very deep.

What everyone thinks is that I mark some domain data, and then perform SFT and alignment on the domain data, so that it can be used for additional capabilities in the domain.

Here we divide it into two situations. If you regard it as a single scene model, it is fine to use it as bert T5.

If you want to let him maintain the ability of the original large model, and then embed some additional knowledge, the difficulty here is very great.

In practice, you will find that this is not the case at all. Basically, you have picked up sesame seeds and lost watermelon, unless you only care about picking up sesame seeds.

After trying it, I found that it is very easy to overfit to the domain data if only the domain data is used, and the processing of OOD will be very poor.

If you want to maintain the original capabilities, the requirements for the data ratio of each link will be very high. It is best to add additional scene data to the original scale of data and go through part of the process again.

But the difficulty is that the original data maintains a distributed sampling, and the model you get is a black box. Others have not given you the original data distribution, not to mention the detailed cleaning of the sea.

Therefore, the final domain model is almost just a domain generation model that has lost its general and basic capabilities.

As for if you want him to deepen his ability in a certain direction and keep the original ability from falling, the overall cost will be no less than reshaping a general model.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/132165368