Microsoft launches AI code review tool that boosts accuracy to >80%

Microsoft has announced Jigsaw, a new tool that can improve the performance of large language models. "Large pre-trained language models (like GPT-3, Codex, etc.) that can be tuned to generate code from a natural language specification of the programmer's intent. Such automated models have the potential to increase the productivity of every programmer in the world; but , since these models may have difficulty understanding program semantics, the quality of the generated code cannot be guaranteed."

According to the presentation, Jigsaw deploys post-processing technology that understands program syntax and semantics, and then leverages user feedback to improve future performance; the tool is designed to synthesize code for the Python Pandas API using multimodal input. Pandas is a widely used API in data science with hundreds of functions for manipulating dataframes or tables with rows and columns.

Microsoft says its experience shows that Jigsaw can play an important role in improving system accuracy as these large language models evolve to synthesize code based on intent.

Large-scale language models like OpenAI's Codex are redefining the field of programming. Software developers can provide English descriptions for expected code snippets when solving programming tasks, and Codex can synthesize the expected code in languages such as Python or JavaScript. But the synthesized code may not be correct and may not even compile or run. Codex users are responsible for reviewing the code before using it. The Jigsaw team explained that with Project Jigsaw , its goal is to automate parts of the review to improve the productivity of developers who use large language models such as Codex for code synthesis.

Microsoft believes Jigsaw can "completely automate" the entire process of checking that code compiles, handling error messages, and testing that code produces what the developer expects to output . "Jigsaw takes as input an English description of the expected code along with an I/O instance. In this way, it pairs the input with the associated output; and provides quality assurance that the output Python code will compile on the provided input and produce the expected Output."

In its ICSE 2022 paper Jigsaw: Large Language Models meet Program Synthesis , Microsoft evaluated this approach on Python Pandas. Using Jigsaw, the user can provide an English description of the expected transformation, an input dataframe, and a corresponding output dataframe, and then let Jigsaw synthesize the expected code.

Jigsaw takes English queries and preprocesses them with the appropriate context to build inputs that can be fed into large language models. Microsoft found in experiments that Jigsaw could create the correct output 30 percent of the time. If the code fails, the repair process begins in the post-processing stage.

During post-processing, Jigsaw applied three transformations to fix the code. Each of these transitions was motivated by the failure modes they observed in GPT-3 and Codex. GPT-3 and Codex both fail in similar ways, so Jigsaw's post-processing to address these failure modes is useful for both.

Microsoft evaluated Codex and Jigsaw (with Codex) on various datasets and measured accuracy. Codex gives about 30% accuracy out of the box, Jigsaw increases the accuracy to over 60%; with user feedback, the accuracy can be increased to over 80%. Moving forward, they will continue to work on improving Jigsaw, working to extend their experience with the Python Pandas API to other APIs and other languages; playing an important role in increasing programmer productivity through automation.

More details can be found on the official blog .

Microsoft launches AI code review tool that boosts accuracy to >80%

Guess you like