OpenAI's latest research Let's verify step-by-step, the process is better than the result!

ab82ad532b90d63a781fde35c0fc93ac.png

Deep Learning Natural Language Processing Original
Author: Winni

097931127a0ab458d681981c83561c5e.png

OpenAI's latest research <Let's verify step-by-step> was released yesterday and has attracted widespread attention. The idea is very simple and can be summed up in one sentence:

For complex step-by-step reasoning problems, we give rewards at each step instead of just one reward at the end depending on the outcome. This dense reward signal achieves better results.

When we were young, our teacher told us that we should write down the problem-solving process when doing homework, and points will be deducted if we don’t write the problem-solving steps. It seems that this idea really makes sense!

76db1d3621e1fad964e9b17d16f1c7e4.png

Previous studies have found that large language models (LLMs) can solve multi-step reasoning tasks via Chain of Thought (CoT). However, even state-of-the-art models often generate misinformation and make up false facts.

An effective solution is to train a reward model to distinguish between good and bad outputs, and further optimize it through reinforcement learning. But the performance of the final model depends heavily on the quality of the reward model itself. Therefore, we need to investigate how to efficiently train reliable reward models.

To this end, OpenAI proposed a process supervision method (process supervision), trained a new reward model, and achieved new breakthroughs in mathematical problem solving. Unlike outcome supervision, which only rewards the final correct result, they lead to a significant performance improvement of the model by rewarding each inference step.

This process supervision not only improves performance, but also has important implications for the alignment of the model. In addition, this research also improved the hallucination problem in the GPT model, that is, the tendency to generate false information under uncertainty.

It should be noted that process supervision requires more manual annotation. OpenAI made their Human Feedback dataset publicly available, containing 75,000 solutions to 12,000 MATH problems with a total of 800,000 step-level labels.

If you want to learn more about this research by OpenAI, we have prepared a first-hand interpretation of the paper for you. Let's take a look at some details of the process supervision method!

Blog: https://openai.com/research/improving-mathematical-reasoning-with-process-supervision
Paper: https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf

Enter the NLP group —> join the NLP exchange group

experimental method

experiment settings

The authors conduct a comparative study of outcome supervision and process supervision, conducting experiments at two scales . Among them, result supervision can directly check the results of math problems without human participation, while process supervision requires human annotators to mark the correctness of each step.

In large-scale experiments, the authors first fine-tune all GPT-4 models, focusing on training the most reliable outcome reward model (ORM) and process reward model (ORM). However, the training sets for these reward models cannot be directly compared. Therefore, the authors also train some small-scale models for experiments, while using large models to guide the training of small models to reduce human labeling costs.

At each model scale, a fixed model (GPT-4 was used in the experiments) was used as the generator to generate all solutions. Instead of trying to improve the generator with reinforcement learning (RL), the authors focus on training the most reliable reward model. The reliability of the evaluation reward model is measured by performing a best-of-n-one search for the solutions produced by the generator, with automatic scoring based on the final answer. A more robust reward model will choose the correct solution more often.

All large-scale models are fine-tuned from the GPT-4 model. The model was pre-trained only to predict the next token, it was not trained with any RL from human feedback (RLHF). The small-scale base model is similar in design to GPT-4, but it is about 200 times less computationally intensive. As an additional pre-training step, the authors fine-tune all models on MathMix, a dataset containing about 1.5 billion math-related tokens.

Builder

To make it easier to parse individual steps, the authors first train the generator to generate solutions in a newline-separated step-by-step format. Specifically, the author first generates a small number of solutions to the MATH training problem, screens out solutions that achieve the correct final answer, and fine-tunes the GPT-4 model on this data set. This step is not intended to teach the generator new skills, but only to teach the generator to produce the solution in the desired format.

data collection

To collect process-supervised data, we sample step-by-step solutions to MATH problems from large-scale generators for human data annotators to annotate. The annotator's task is to assign a positive, negative, or neutral label to each step in the solution, as shown in the figure below. A positive label indicates that the step is correct and reasonable, a negative label indicates that the step is either incorrect or unreasonable, and a neutral label indicates that there is ambiguity.

55b19cf861b043a8cf54dcec604ac636.png

The authors only label solutions from large-scale generators, calling the collected step-level labeling dataset PRM800K. The PRM800K training set contains 800,000 step-level labels from 75,000 solutions to 12,000 problems. To reduce overfitting, the authors include data from 4,500 MATH test problems in the PRM800K training set and evaluate the model only on the remaining 500 MATH test problems.

Possible solutions need to be shown to data annotators during data collection. The most straightforward strategy is to uniformly display the solutions produced by the generator. However, the human feedback obtained is not as valuable if an obviously wrong solution is shown. Therefore, the research team selectively showed certain solutions to the data annotators—they were more likely to show solutions that were more likely to fool the best reward model.

In addition, the authors also iteratively retrain the process reward model PRM with the latest data at several time points during the data collection process. In each iteration, N solutions are generated for each question, and only the K highest-scoring persuasive wrong-answer solutions are shown to the data annotators. The authors try to apply this top-K filtering either at the problem level (K solutions per problem) or at the global level (K solutions in total, unevenly distributed among the problems).

Result Reward Model ORM

The authors uniformly sample a fixed number of solutions to each problem from the generator and train the ORM to predict the correctness of each solution. In practice, correctness is usually determined by automatically checking the final answer. At test time, use the ORM's prediction at the last mark as the overall score for the solution. It is important to note that the automatic scoring used to determine ORM goals is not entirely reliable: false positive solutions that lead to the correct answer through faulty reasoning will be misclassified.

Process Reward Model PRM

The authors trained a process reward model PRM to predict the correctness of the last marked step in each step. This prediction takes the form of individual markers, and they maximize the log-likelihood of these target markers during training. Therefore, PRMs can be trained in standard language model pipelines without any special adjustments. At test time, they can determine step-level predictions with just one PRM forward pass over the entire solution. In order to compare multiple solutions, they need to calculate a single score for each solution. This process is important, but also straightforward: they define a solution's PRM score as the probability of being correct at each step under the PRM, and realize it as the product of the correctness probabilities at each step.

The graph below is the large-scale PRM score for two different solutions. Two solutions to the same problem are scored by PRM. The solution on the left is correct and the solution on the right is wrong. A green background indicates a high PRM score and a red background indicates a low score. PRM correctly identified the error in the wrong solution.

f0f5e30b4a961a7dfc5ef192da4758e1.png

When providing process monitoring, they intentionally choose to only monitor up to the first wrong step. Doing so makes the comparison between outcome monitoring and process monitoring simpler and clearer. For a correct solution, both methods provide the same information that each step is correct. For incorrect solutions, both methods revealed at least one error, and process monitoring also revealed the exact location of the error. If they provide additional process supervision after the first error, then process supervision will have a greater information advantage. This decision also keeps the labeling cost similar to humans: determining the correctness of a solution is equivalent to determining its first error without relying on an easy-to-check final answer.

Large-Scale Supervised Learning

The research team trained a large-scale PRM, using step-level labels in PRM800K for training. To ensure that the large-scale ORM baseline is as robust as possible, they train on 100 uniform samples of each problem from the generator. This means that the ORM training set has no overlap with PRM800K, and the size of the ORM training set is an order of magnitude larger than that of PRM800K. It should be noted that training the ORM only on the PRM800K solution may be problematic, since the active learning strategy heavily biases the solution to the wrong answer. The authors have tried training the ORM on a superset of the PRM800K solutions, by mixing uniformly sampled solutions, but found that this did not improve the performance of the ORM.

The figure below shows how the best N performance of each reward model varies with N. Since majority voting is considered a strong baseline, the authors also compare the methods as a point of comparison. While ORM performed slightly better than the majority voting baseline, PRM significantly outperformed both. Not only is the performance of PRM higher at all values ​​of N, but the performance gap widens as N increases. This shows that PRM is more efficient than ORM and majority voting in searching for solutions generated by a large number of models.

1f46e82024082294114ad4d043951b96.pngThe research team also tried using RM weighted voting to combine the advantages of PRM and majority voting, but it did not significantly improve the performance.

Small-Scale Synthetic Supervised Learning

To better compare outcomes and process monitoring, two confounding factors need to be isolated. First, the training sets of ORM and PRM are not directly comparable: the PRM training set is constructed by active learning, is biased towards solutions to wrong answers, and is an order of magnitude smaller. Second, the scoring of the final answer gives positive labels to spurious solutions that reach the correct final answer but have a wrong reasoning process. This may hurt the performance of the ORM, and this effect cannot be attributed to general result supervision. These factors cannot be easily removed by human annotators due to the high cost of collecting human feedback. Instead, the authors use large-scale PRMs to supervise smaller models for relevant ablation experiments. This setup can simulate large amounts of data collection at low cost.

Process supervision and result supervision

The authors conduct experiments that directly compare process supervision with outcome supervision. First, they randomly select between 1 and 200 solutions to each problem from a small-scale generator. For each dataset, they provide three kinds of supervision: process supervision from PRMlarge (i.e., large-scale PRM, hereinafter referred to as PRMlarge), result supervision from PRMlarge, and result supervision from final answer check. The only difference between these three reward models is the supervision, otherwise they are trained on the same dataset.

In Figure a below, the authors evaluate each reward model through the 500 best choices for each model. The results show that process supervision significantly outperforms both forms of outcome supervision at all data collection scales. In panel b below, they evaluate the best N performance of the best reward model in each series by different values ​​of N.

5b97ac7590e9bfff822f885818746195.png

The results show that outcome supervision using PRMlarge is significantly more effective than final answer checking. This can be interpreted as PRMlarge provides better supervision for solutions that arrive at the correct final answer but use incorrect reasoning. It is unclear whether PRMlarge or final answer checking is more suitable as a benchmark for outcome supervision. While final answer supervision is more explicitly outcome-based, its main drawback (the presence of false positives) can be overemphasized in the MATH dataset. The resulting supervision provided by PRMlarge better represents that in domains that are less prone to false positives.

active learning

Finally, the authors study the impact of active learning, using a small-scale reward model, PRMselector, to select a sample from each question for training, and use the model to score 1000 samples per question. To train a larger-scale reward model, the authors select N samples from each question, of which 80% are the most convincing wrong answer samples and 20% are the remaining most convincing samples (correct or incorrect answers). The authors use PRMlarge to score selected samples and train based on these scores. This process ensures that all samples are relatively convincing under the PRMselector, and most are known to contain at least one error, and that the overall dataset is not overly biased towards solutions to wrong answers. The performance of this data labeling scheme is shown in Figure 4a. By comparing the slopes of the best-fit lines with and without active learning, it is estimated that this form of active learning is approximately 2.6 times more efficient than uniform data labeling. Note that the model trained on the largest active learning dataset (200 samples per question) seems to be slightly below the expected trendline. The best explanation for this observation is that 200 samples represent a sizable fraction of the overall selection pool (1000 samples), and this relative lack of diversity limits the potential advantages of active learning.

Generalization

To measure out-of-sample generalization, the authors evaluate large-scale ORM and PRM on a holdout set of 224 STEM questions from recent AP Physics, AP Calculus, AP Chemistry, AMC10, and AMC12 exams. These tests were published after the pre-training dataset was compiled, so there is a high degree of confidence that the model has not seen these problems. The results show that PRM outperforms ORM and majority voting. This shows that PRM has good generalization ability and can perform well on new test problems.

discuss

credit allocation

A clear advantage of process supervision over outcome supervision is that it provides more precise feedback than outcome supervision. A reward model trained with outcome supervision faces a difficult task of credit assignment—in order to generalize well, it must identify what is specifically wrong with wrong solutions. This is especially hard for hard problems: most solutions generated by the model have some kind of error, so the marginal value of negative labels in the resulting supervision is low. In contrast, procedural supervision provides a richer signal: it not only specifies how many of the previous steps were correct, but also pinpoints where the wrong steps were located.

Alignment influence

Process supervision has several advantages over outcome supervision in relation to AI alignment. Procedural supervision is more likely to produce interpretable reasoning because it encourages the model to follow human-approved processes. Process supervision is inherently safer: it directly rewards aligned chains of thought, rather than relying on outcomes as a proxy for aligned behavior. Outcome monitoring, by contrast, is more difficult to scrutinize and conveys less precise preferences. In the worst case, using the outcome as an imperfect proxy can cause the model to become misaligned after learning to exploit the reward signal. In some cases, AI systems for safety can result in reduced performance, which is known as the alignment tax. In general, any alignment tax is likely to hinder the adoption of alignment methods, as there is pressure to deploy the most powerful models. Process oversight actually creates a negative alignment tax. This could lead to widespread adoption of process supervision, which would have positive alignment side effects.

test set contamination

The test set of the MATH dataset contains problems that have been discussed in several online venues, and it is likely that some of these problems are present in the model's pre-training dataset. The authors try to use a string matching heuristic to remove all MATH questions from the MathMix dataset, but since people can post hard-to-detect rewrites of questions online, it's hard to make a robust statement about the overlap between the MathMix and MATH datasets guarantee. When examining the solutions generated by the model, the authors found no obvious signs that the model memorized the MATH problem. However, subtle forms of memory that might escape manual inspection cannot be ruled out, and it is still possible that some level of contamination slightly exaggerates the performance of the model on the MATH test set. Even in this case, we would expect any contamination to behave similarly across all methods, and relative comparisons made throughout the work would be largely unaffected. In addition, PRM often provides correct solutions to some MATH problems that have a low resolution rate under the generator. The generator's low resolution rate further suggests that it has not encountered these problems with test set pollution.

Summarize

This work shows that in the domain of mathematical reasoning, process supervision can train more reliable reward models than outcome supervision. At the same time, through active learning, only the most valuable models can be presented for completion, thereby reducing the cost of manual data collection. The research team released a full dataset called PRM800K, which contains human-feedback data for training state-of-the-art reward models. They hope that by removing this important research threshold, it will promote the alignment research of large language models in related fields. They believe that the current research on process supervision is not deep enough, and look forward to future work that can more fully explore the generalizability of these methods.


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130998660