We all know that the large model has the ability to introspect and can self-correct the written code.
How does the mechanism behind this self-healing work?
To what extent does the model provide accurate feedback on why the code is wrong?
Recently, scholars from MIT and Microsoft found that among GPT-4 and GPT-3.5, only GPT-4 showed effective self-repair. Moreover, GPT-4 can even provide feedback to programs generated by GPT-3.5.
Paper address: https://arxiv.org/pdf/2306.09896.pdf
NVIDIA scientist Jim Fan strongly recommends this research.
In his view, even the most professional human programmers cannot write programs correctly the first time. They need to look at the execution results, reason out the problem, give fixes, and try again and again. This is an agent loop: iteratively improving the code based on environmental feedback.
In all likelihood, OpenAI is training the next generation of GPT by hiring lots of software engineers. And they don't need to output code - Critique is all you need.
- The core reason why GPT-4 can self-heal is its strong feedback ability. Its ability to self-reflect on the problems of the code effectively cannot compete with other models.
- Feedback model and code generation model do not have to be the same. In fact, the feedback model is the bottleneck.
- Based on the feedback from GPT-4, GPT-3.5 is able to write better code.
- Based on feedback from professionals, GPT-4 itself is capable of writing better code.
Demystifying GPT fixes for code generation
We all know that large language models have shown extraordinary capabilities in generating code.
However, they do not perform well in challenging programming tasks such as competitions and software engineer interviews.
Fortunately, many models will "introspect" through a self-healing workflow to self-correct errors in the code.
Researchers are eager to know to what extent these models provide correct feedback and explain why the code they generate is wrong.
As shown in the figure, the classic workflow based on the self-healing method.
First, given a specification, a program is sampled from the code generation model and then executed on a set of unit tests provided in the specification.
If the program fails any of the unit tests, the error message and the program are fed to a feedback generation model, which in turn outputs a short explanation of why the code failed.
Finally, the feedback is passed to a repair model that generates a fixed version of the program.
On the surface, this workflow seems perfect. It lets the system overcome errors due to bad samples during decoding, easily incorporating feedback from symbolic systems (compilers, static analysis tools, execution engines, etc.) during the repair phase.
And mimic the trial-and-error way human software engineers write code.
However, the workflow has a problem: self-healing requires more calls to the model, increasing the computational cost.
Moreover, the researchers discovered an interesting phenomenon: the effectiveness of large model self-healing depends not only on the model's ability to generate code, but also on its ability to recognize how the code made mistakes in the task.
No work has investigated this in detail, so the authors investigate the self-healing effectiveness of GPT-3.5 and GPT-4 in solving competition-level code generation tasks.
The researchers propose a new evaluation strategy called , in which the pass rate of a task is measured in terms of the total number of tokens sampled from the model.
Because pass@t is used instead of the traditional pass@k (which measures the pass rate against the number of experiments), this allows for fair comparisons with purely sampling-based methods.
From the experiments, the researchers found that:
1. GPT-4 can only achieve the performance improvement brought by self-repair; for GPT-3.5, the pass rate after repair is lower than or equal to the baseline no repair method under all budgets.
2. Even for the GPT-4 model, the performance improvement is modest at best (passing rate increased from 66% to 71% with a budget of 7000 tokens, which is approximately equivalent to 45 IID GPT-4 sample), and depends on the diversity of the initial program being sufficiently rich.
3. Replacing GPT-3.5’s interpretation of errors with GPT-4-generated feedback yields better self-healing performance, even surpassing the baseline no-fix GPT-3.5 method (from 50% to 54 at 7000 tokens %).
4. Replacing GPT-4's own interpretation with the interpretation provided by human programmers can significantly improve the repair effect, and the number of programs that are repaired and passed the test has increased by 57%.
Four stages of self-healing
The self-healing approach involves 4 phases: code generation, code execution, feedback generation, and code repair. In response, the researchers formally defined these four stages.
Phase 1: Code Generation
Given a specification , a program model , first generate the sample
Expressed in a formula:
Phase 2: Code Execution
The code example is then executed on the test bench , assuming the full test suite is accessible in executable form.
If any sample passes all tests, it stops because a satisfactory program has been found at this point.
Otherwise, collect the error information returned by the execution environment .
These error messages either contain compile/runtime error information, or sample inputs where the program output is different than expected.
Phase Three: Feedback Generation
Here, the researchers used a feedback model to generate more detailed explanations of errors.
At this stage, a feedback string is generated for each erroneous program , , as follows:
Phase Four: Code Fixes
In the final step, for each initial program and feedback , candidate fixes are sampled from :
The researchers call this process the interleaved text and procedural tree repair tree T
- Rooted in spec , then branched to initial program , each program branched to feedback , then fix .
Specifically as shown in the figure:
Since self-healing requires several correlated model calls with inconsistent costs, in this setting, ( the likelihood of obtaining the correct program in the sample) is not an appropriate metric to compare and evaluate various hyperparameter choices for self-healing.
Instead, researchers measure pass rate as a function of the total number of tokens sampled from the model, calling it the metric of .
experiment procedure
The researchers went a step further and tested three questions:
1. For more challenging programming tasks, do self-healing of these models sample better than iid without repairing?
2. Will a stronger feedback model improve the inpainting performance of the model?
3. 如果让人类参与功能最强模型的自我修复循环,提供人工反馈,是否可以解锁更好的修复性能?
首先研究团队引入了一个很有挑战的编程任务:Automated Programming Progress Standard (APPS)数据集中的编程任务。
这个数据集中的任务包括从入门级到大学竞赛级的编程任务,可以用来评估人类程序员解决问题和代码能力。
研究人员选取了300个任务,包括60个入门级别的任务和60个竞赛级别的任务。
研究人员选取了GPT-3.5和GPT-4作为模型,使用模板字符串连接和单次提示词来进行自我修复。
下图为提示词的实例之一。
自修复需要强大的模型和多样化的初始样本
研究人员让单个模型分别进行代码的修复生成和反馈生成。
在右边的图中,我们沿轴显示了具有两个超参数的热图,其中每个单元格中的值表示平均通过率,当给定相同的token预算(即t的相同值pass@t)时,自我修复由基线的平均通过率归一化。
从图中可以看到,对于GPT-3.5模型,pass@t在所有设置下都低于或等于相应的基线(黑),清楚地表明自我修复对GPT-3.5并不是一种有效的策略。
而在GPT-4(下图)中,有几个值的自修复通过率明显优于基线。
下图是和基线的无修复方法。
GPT-4反馈改进了GPT3.5的修复结果
研究人员又进一步进行了新的实验,评估使用单独的、更强的模型来生成反馈的效果,目的是为了测试一个假设:由于模型无法内省和调试自己的代码,阻碍了自我修复(比如说对于GPT-3.5)。
这个实验的结果如上图(亮蓝色)所示。
在绝对性能方面,GPT-3.5,GPT-4确实突破了性能障碍,并且比GPT-3.5的i.i.d.采样略微更高效。
这表明文本反馈阶段本身是至关重要的,改进它可以缓解GPT-3.5自修复的瓶颈。
人工反馈显著提高了GPT-4修复的成功率
在最后一项实验中,想要研究在用更强的模型(GPT-4)进行修复时,加入专家人类程序员的反馈的影响。
The goal of the study was to understand how the model's ability to identify errors in code compares to that of humans, and how this affects the downstream performance of self-healing.
Researchers The researchers recruited 16 participants, including 15 graduate students and 1 professional machine learning engineer.
Each participant had five different base programs, writing code based on their Python experience.
Each program is taken from a different task, and participants never see two different programs belonging to the same task.
Participants were then asked to explain in their own words what the program did wrong.
The experimental results are shown in the figure below:
The researchers found that when we replaced GPT-4's own tuning with human participants' tuning, the overall success rate increased by more than 1.57×.
Not surprisingly, the relative difference increases as the problem gets harder, suggesting that when the task (and code) gets more complex, GPT-4's ability to generate accurate and useful feedback lags far behind human participation By.