Large models don’t have the ability to improve themselves? ETH Zurich and Meta AI propose a small model architecture to significantly improve the performance of large models

Some time ago, many experts posted articles pointing out that large models do not have the ability to improve themselves. Even after self-improvement, the quality of answers will significantly decline.

The reason why self-improvement does not work is that LLM cannot accurately determine whether the original answer is wrong and whether it needs improvement.

Recently, ETH Zurich and Meta AI proposed a strategy to improve large model inference answers - ART: Ask, Refine, and Trust , this method determines whether the LLM needs to improve the original output by asking necessary questions, and determines the final answer by evaluating the preliminary output and the improved output. In the two multi-step inference tasks GSM8K and StrategyQA, ART is compared with the previous model self The improved method has improved by about 5 percentage points.

Large model research test portal

GPT-4 portal (no wall, can be tested directly, if you encounter browser warning points, just advance/continue access):
http://hujiaoai.cn

论文标题
The ART of LLM Refinement: Ask, Refine, and Trust

论文链会
https://arxiv.org/pdf/2311.07961.pdf

method

Method quick overview

The overall framework is shown in the figure below. The author used task-related corpus to train two small models as Asker and Truster. Asker is responsible for asking questions about the original question and the output, asking whether the original output has answered the sub-question. If it is not correct Answer, will flow to the next step to improve the original output. The fourth step is to use Truster to determine which is better between the original output and the improved output, and determine the final result.

1. Generate initial value

First, LLM is used to generate an initial prediction for the problem. When generating the initial prediction, two methods, thinking chain and sub-problem decomposition, are used to enhance the correctness of the initial prediction answer.

2. Soldier

If improvements are made to each sample, it is easy to mislead the model and change more correct results into incorrect results, ultimately leading to reduced model performance. Therefore, the author uses task-specific knowledge and expected results to train a small model as Asker, judge whether the prediction results are correct, and only improve those samples for which Asker is uncertain.

So how to construct the data set for training Asker? Specifically, LLM is first used to generate k predictions on each sample of the training set. And add sub-questions of the data set for questioning to confirm whether the original problem has been truly solved, and then confirm whether improvements need to be made based on whether the prediction is correct. As shown below

▲The top determines improvement, the bottom does not need improvement

In this way, Asker learns to first ask relevant questions, then map them to predictions, and then decide whether the initial prediction answers all the questions, leading to a decision on whether improvements are needed.

3. Improve Refine

If the Asker prediction result is "yes" (needs improvement), use LLM to improve the original output based on the original input and the sub-problem generated by Asker, as shown in the following figure:

4. TrustTrust

At this point we have two prediction results: the initial output and the improved output. In order to decide which output is the final answer, the author trained a Truster model.

Because about 80% of the improved answers are the same as the initially predicted answers. In order for the Truster model to learn to identify the reasoning chain of the final correct answer, rather than the intermediate reasoning chain of a specific style. The author used the same training data as the Asker model, the input question was x, and selected samples with both correct and incorrect predictions to construct and compare. The loss function is as follows

Among them, r is the score of the "truster" model. Then based on the score of each sample, the prediction with the highest score is selected as the output.

experiment

data set

The dataset includes two multi-step inference tasks. The GSM8K data set is a primary school mathematics application problem data set. The training set contains 7473 samples and the test set contains 1319 samples. Each sample requires 2 to 8 steps to solve. The dataset also includes subproblems corresponding to the steps given the correct solution.

StrategyQA is a question-answering benchmark for open-domain problems that require an inference step to solve. StrategyQA contains 2290 training examples, the author uses the first 20% as the test set and the remaining 80% as training.

Experimental setup

  1. First, LLaMA variants (7B, 13B and 70B) are fine-tuned on the GSM8K and StrategyQA datasets respectively.

  2. The collected data is then used to train the Asker model based on the fine-tuned LLaMA variant to ask relevant questions and decide when to make improvements.

  3. Finally, the LLaMA 13B model is fine-tuned to obtain the Truster model, and the final result is selected between the original output and the improved output.

The training data size used in each stage is shown in the table below:

▲Comparison of the size of training data in each stage

Experimental results and analysis

The author uses LLaMA 70B (including pre-training and chat versions), ChatGPT (turbo and instruct), and GPT-4 as basic models for comparison. The experimental results are shown in the figure below.

Initial Prediction refers to the initial results generated from LLM, where Method refers to the reasoning strategy including thinking chain CoT or sub-problem decomposition Decomp. Refinement refers to the combination of the Ask and Refine stages in ART, and subquestions represent whether to use subquestions in the improvement process. Trust refers to the Trust stage in ART, where Self refers to self-refinement, Truster is the model trained in this article, and Most Recent refers to selecting refinement as the final result.

Yellow is the result of other people's work, blue represents the author's implementation of the baseline method, and green represents the method proposed in this article.

Accuracy of different methods and improvement strategies on GSM8K data set:

StrategyQA uses different model accuracy comparisons:

1. LLM lacks self-improvement capabilities

Overall, for the GSM8K dataset, the performance of LLaMA 70B is much lower than the ChatGPT Turbo model.

Furthermore, when performing sub-problem decomposition (Decomp) on ChatGPT, its performance is better than CoT, but the opposite is true in LLaMA 70B. Since the training data and model architecture of ChatGPT are not public, it is difficult to understand the reasons for the performance gap.

Self-improvement (self in the figure) can improve performance in some cases, but will lead to performance degradation in other cases. This article combines Refinement with the Trust module to steadily improve the performance of initial predictions in almost all cases. . This demonstrates the usefulness of the different components of the ART approach.

2.The importance of Ask

GSM8K:

  1. When using chatgpt as the baseline model, the Asker 7B model trained with LLaMA 7B improved by more than two points, while Asker13B improved it by more than 4 points (78.62 → 82.18) compared to the self-optimization strategy (Self).

  2. The trend is similar when using LLaMA 70B as the baseline model. Having the Asker module improves the task accuracy, and its performance is better than the self-optimization (Self) capability of LLaMA 70B.

  3. For the GPT-4 model, the results also follow a similar trend, with the 7B (Asker7B) and 13B (Asker13B) models improving the initially generated results by approximately 2 points (91.88 → 93.72).

StrategyQA:

  1. Following a similar trend on StrategyQA, Asker7B improved the LLaMA 70B score by 1 point and the ChatGPT result by more than 3 points (70.52 → 73.84).

  2. The gains are even greater for the Asker 13B model, with performance improvements of 3 points for LLaMA 70B and 5 points for ChatGPT, clearly demonstrating the importance of the Asker module for optimal decision making.

3. Don’t always believe in improved results

If the improved results are fully accepted, it will also cause performance loss in some cases. At this time, the Truster module plays its role. The Truster module sorts the initial prediction and improved output and decides which one to choose as the final result, which is equivalent to a double insurance for the result.

Sure enough, with the addition of the Truster module, whether LLaMA 70B or ChatGPT is used as the basic model on GSM8KS, the performance has improved by about 4-7 percentage points. For GPT-4, the gain is smaller, which may be due to the fact that the initial performance of the GPT-4 model is very high, reaching 93.10, but Truster still improves to 94.08.

For StrategyQA, the Trust module is not of much help. This may be because it is difficult to determine the merits of the original output versus the improved output without knowing the real facts.

4. Cost of fine-tuning LLMs vs. cost of ART-based

Since the training samples of the GSM8K dataset are available, the LLaMA 70B model can be fine-tuned directly. The accuracy of the fine-tuned LLaMA 70B on GSM8K is 63.2%. This is close to the results of training ART, but ART training costs and computational requirements are much lower.

As shown in the table below, Truster uses 13B model training, while Asker uses 7B model, which takes less time to directly fine-tune the 70B model. In addition, direct fine-tuning usually overfits the model to the training data set, reducing its generalizability in context learning, while the ART framework avoids this problem.

▲Compare the costs required to train LLaMA models of different sizes on GSM8K

Conclusions and limitations

This paper proposes an improvement strategy called ART: Ask, Refine, and Trust. The Asker model trained using a smaller model decides whether to improve, and Trust determines whether to adopt the improved answer. The results show that carefully trained small models can outperform the self-improvement capabilities of large models.

But this article still has some limitations:

  1. This article uses training data from the GSM8K and StrategyQA data sets to train Asker. For many tasks, training data may not be available. Although LLMs can be used to generate data, and in many cases the performance is close to real data. However, this article did not test whether the generated training data is valid.

  2. Additionally, for StrategyQA, the authors used the available facts provided by the dataset to support model decisions when improving predictions. But in the real world it may need to be extracted with the help of some tools or from some databases. The author has not tested whether this method is feasible in the ART framework.

  3. Although the ART framework is effective, training Asker and Trust step by step is cumbersome. The author also tested the effect of completing the entire process at once and found that the performance was lower than that of the step-by-step framework ART.

This shows that there are still challenges for LLM to generate the entire link at once, and we look forward to the emergence of effective end-to-end work in the future~

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/134663019