0 code to train GPT-5? MIT Microsoft confirms that GPT-4 has self-correcting ability iterations

4694e6165571f6cd07a4990b51a409e6.jpeg



We all know that the large model has the ability to introspect and can self-correct the written code.

How does the mechanism behind this self-healing work?

To what extent does the model provide accurate feedback on why the code is wrong?

Recently, scholars from MIT and Microsoft found that among GPT-4 and GPT-3.5, only GPT-4 showed effective self-repair. Moreover, GPT-4 can even provide feedback to programs generated by GPT-3.5.

a938341cc9c8d49918d11e7dcea7cd20.jpeg

Paper address: https://arxiv.org/pdf/2306.09896.pdf

NVIDIA scientist Jim Fan strongly recommends this research.

In his view, even the most professional human programmers cannot write programs correctly the first time. They need to look at the execution results, reason out the problem, give fixes, and try again and again. This is an agent loop: iteratively improving the code based on environmental feedback.

In all likelihood, OpenAI is training the next generation of GPT by hiring lots of software engineers. And they don't need to output code - Critique is all you need.

f5f76c808950e8c7ef3da5baf0ce2fa1.jpeg

- The core reason why GPT-4 can self-heal is its strong feedback ability. Its ability to self-reflect on the problems of the code effectively cannot compete with other models.


- Feedback model and code generation model do not have to be the same. In fact, the feedback model is the bottleneck.


- Based on the feedback from GPT-4, GPT-3.5 is able to write better code.


- Based on feedback from professionals, GPT-4 itself is capable of writing better code.

Demystifying GPT fixes for code generation


We all know that large language models have shown extraordinary capabilities in generating code.

However, they do not perform well in challenging programming tasks such as competitions and software engineer interviews.

Fortunately, many models will "introspect" through a self-healing workflow to self-correct errors in the code.

Researchers are eager to know to what extent these models provide correct feedback and explain why the code they generate is wrong.

As shown in the figure, the classic workflow based on the self-healing method.

First, given a specification, a program is sampled from the code generation model and then executed on a set of unit tests provided in the specification.

9bdd2200f9f36bd7337cb7333eef0cdd.jpeg

If the program fails any of the unit tests, the error message and the program are fed to a feedback generation model, which in turn outputs a short explanation of why the code failed.

Finally, the feedback is passed to a repair model that generates a fixed version of the program.

On the surface, this workflow seems perfect. It lets the system overcome errors due to bad samples during decoding, easily incorporating feedback from symbolic systems (compilers, static analysis tools, execution engines, etc.) during the repair phase.

And mimic the trial-and-error way human software engineers write code.

ea89a8732ebe802e382a4371a8c4bf99.jpeg

However, the workflow has a problem: self-healing requires more calls to the model, increasing the computational cost.

Moreover, the researchers discovered an interesting phenomenon: the effectiveness of large model self-healing depends not only on the model's ability to generate code, but also on its ability to recognize how the code made mistakes in the task.

No work has investigated this in detail, so the authors investigate the self-healing effectiveness of GPT-3.5 and GPT-4 in solving competition-level code generation tasks.

The researchers propose a new evaluation strategy called 7992efa6a843c2c5a15b0ab39900af52.jpeg, in which the pass rate of a task is measured in terms of the total number of tokens sampled from the model.

Because pass@t is used instead of the traditional pass@k (which measures the pass rate against the number of experiments), this allows for fair comparisons with purely sampling-based methods.

From the experiments, the researchers found that:

1. GPT-4 can only achieve the performance improvement brought by self-repair; for GPT-3.5, the pass rate after repair is lower than or equal to the baseline no repair method under all budgets.


2. Even for the GPT-4 model, the performance improvement is modest at best (passing rate increased from 66% to 71% with a budget of 7000 tokens, which is approximately equivalent to 45 IID GPT-4 sample), and depends on the diversity of the initial program being sufficiently rich.


3. Replacing GPT-3.5’s interpretation of errors with GPT-4-generated feedback yields better self-healing performance, even surpassing the baseline no-fix GPT-3.5 method (from 50% to 54 at 7000 tokens %).


4. Replacing GPT-4's own interpretation with the interpretation provided by human programmers can significantly improve the repair effect, and the number of programs that are repaired and passed the test has increased by 57%.

Four stages of self-healing


The self-healing approach involves 4 phases: code generation, code execution, feedback generation, and code repair. In response, the researchers formally defined these four stages.


Phase 1: Code Generation

Given a specification ea86bcde83e748da24d9912b112f301d.jpeg, a program model 80c2d995b21471bfa8c43cfc51021371.jpeg, first generate 69ace349c8f0a5188c029b05de01e53c.jpegthe sample1ee8f75747e3536268f1e688ab7d8bbf.jpeg

Expressed in a formula:

7d3eaedffecab53233cb8f67eb820327.jpeg

Phase 2: Code Execution

The code example is then executed on the test bench f5105483c98e87cddc0c37f38a513add.jpeg, assuming the full test suite is accessible in executable form.

If any sample passes all tests, it stops because a satisfactory program has been found at this point.

Otherwise, collect the error information returned by the execution environment37215d7a4aa561b3fe7d6c122ceb8975.jpeg .

These error messages either contain compile/runtime error information, or sample inputs where the program output is different than expected.

Phase Three: Feedback Generation

Here, the researchers used a feedback model to generate more detailed explanations of errors.

8fb8b6e4b04ef4a0abdf3bc49bf0a067.jpegAt this stage, a feedback string is generated for each erroneous program , d9cf6418afea20230aa60aedceaa113f.jpeg, as follows:

c04b3625ee54701e4ec8a1d15c5ed808.jpeg

Phase Four: Code Fixes

In the final step, for each initial program 8fb2fc98e611e987a9c85f08a5d216f2.jpegand feedback 61feaca4fdf8b7e7bd2e0444d354f1f6.jpeg, f81f6db5e220b088b219bfa8b4bd8b84.jpegcandidate fixes are sampled 10bdc19a2e6dc45b62d8f5932197610b.jpegfrom :

eba5b566bec33e1c763ef76c8b1790dd.jpeg

The researchers call this process the interleaved text and procedural tree repair tree T

- Rooted in spec 6c6c052eeacbd5fa4044540d9bd37477.jpeg, then branched to initial program ef163d2f7d509f810a82da22449f828b.jpeg, each program branched to feedback 7bb3d6977e6e8ae92379ea8d9eda413f.jpeg, then fix 7af966e73c55b106c1e2c6af9aaa8c0f.jpeg.

Specifically as shown in the figure:

7c7678946905d4e6d33a1d17eb1fd4fc.jpeg

Since self-healing requires several correlated model calls with inconsistent costs, in this setting, cbe1c384b10ded8de2dbc14e86d93018.jpeg( eb9d181b2b006aba0c176766763b4b4b.jpegthe likelihood of obtaining the correct program in the sample) is not an appropriate metric to compare and evaluate various hyperparameter choices for self-healing.

Instead, researchers measure pass rate as a function of the total number of tokens sampled from the model, calling it cc728e8e31434aa996421913666ce85a.jpegthe metric of .

experiment procedure


The researchers went a step further and tested three questions:

1. For more challenging programming tasks, do self-healing of these models sample better than iid without repairing?

2. Will a stronger feedback model improve the inpainting performance of the model?

3. 如果让人类参与功能最强模型的自我修复循环,提供人工反馈,是否可以解锁更好的修复性能?

首先研究团队引入了一个很有挑战的编程任务:Automated Programming Progress Standard (APPS)数据集中的编程任务。

这个数据集中的任务包括从入门级到大学竞赛级的编程任务,可以用来评估人类程序员解决问题和代码能力。

研究人员选取了300个任务,包括60个入门级别的任务和60个竞赛级别的任务。

17a886e888d3102055042d310d2334ff.jpeg

研究人员选取了GPT-3.5和GPT-4作为模型,使用模板字符串连接和单次提示词来进行自我修复。

下图为提示词的实例之一。

85d67fb9563d69d4e246bfd53ede7133.jpeg

自修复需要强大的模型和多样化的初始样本

研究人员让单个模型分别进行代码的修复生成和反馈生成。

在右边的图中,我们沿轴显示了具有两个超参数的热图,其中每个单元格中的值表示平均通过率,当给定相同的token预算(即t的相同值pass@t)时,自我修复由基线的平均通过率归一化。

fe0f02a1558d0b40922b13c2117b2741.jpeg

从图中可以看到,对于GPT-3.5模型,pass@t在所有设置下都低于或等于相应的基线(黑),清楚地表明自我修复对GPT-3.5并不是一种有效的策略。

而在GPT-4(下图)中,有几个值的自修复通过率明显优于基线。

b45d2019721705a9f5389cd39cc3a817.jpeg

下图是6a9b1beaf1b2d7f9e41bbe185f240e12.jpeg和基线的无修复方法。

GPT-4反馈改进了GPT3.5的修复结果

研究人员又进一步进行了新的实验,评估使用单独的、更强的模型来生成反馈的效果,目的是为了测试一个假设:由于模型无法内省和调试自己的代码,阻碍了自我修复(比如说对于GPT-3.5)。

3bfe7bcc15b74ad9280a6294dedadf5f.jpeg

这个实验的结果如上图(亮蓝色)所示。

在绝对性能方面,GPT-3.5,GPT-4确实突破了性能障碍,并且比GPT-3.5的i.i.d.采样略微更高效。

这表明文本反馈阶段本身是至关重要的,改进它可以缓解GPT-3.5自修复的瓶颈。

人工反馈显著提高了GPT-4修复的成功率

在最后一项实验中,想要研究在用更强的模型(GPT-4)进行修复时,加入专家人类程序员的反馈的影响。

The goal of the study was to understand how the model's ability to identify errors in code compares to that of humans, and how this affects the downstream performance of self-healing.

Researchers The researchers recruited 16 participants, including 15 graduate students and 1 professional machine learning engineer.

Each participant had five different base programs, writing code based on their Python experience.

Each program is taken from a different task, and participants never see two different programs belonging to the same task.

Participants were then asked to explain in their own words what the program did wrong.

The experimental results are shown in the figure below:

ddf785c0e1345f807ee6f68e28e5b860.jpeg

The researchers found that when we replaced GPT-4's own tuning with human participants' tuning, the overall success rate increased by more than 1.57×.

Not surprisingly, the relative difference increases as the problem gets harder, suggesting that when the task (and code) gets more complex, GPT-4's ability to generate accurate and useful feedback lags far behind human participation By.



Guess you like

Origin blog.csdn.net/zhaomengsen/article/details/131613952