Google uses machine learning to tackle code review comments

Code reviews are an important part of the large-scale software development process, taking up a significant amount of time from code committers and code reviewers. In this process, reviewers review the code for issues and write comments asking the author to make code changes. At Google, we see millions of reviewer comments each year , and authors spend about 60 minutes responding to them and proposing code changes based on the text of the comments. Our research finds that the work time that code authors must dedicate to addressing reviewer comments grows almost linearly with the number of comments. However, with the help of machine learning (ML), we can automate and simplify the code review process, for example, automatically corresponding code changes based on code review comments.
insert image description here

Today, we're adopting recent advances in sequence models to automatically resolve code review comments in Google's daily development workflow (coming soon). As of today, Google's engineers have been able to handle a large number of code reviewer review suggestions by applying ML suggested changes. We estimate this will save Google hundreds of thousands of hours of code review time each year. We've received a lot of very positive feedback that code changes suggested by machine learning have really improved the productivity of Google developers, allowing them to focus on more creative and complex tasks.

Anticipate Code Modifications

We first trained a model to predict the code modifications required to address comments. The model is pretrained on various coding tasks and related developer activities (e.g. renaming a variable, resolving errors during build, editing a file). The model is then fine-tuned for a specific task using the reviewed code modifications, the reviewers' comments, and the modifications performed by the authors to address those comments.

figure1.gif

This is an example of code refactoring based on ML recommendations.

Google uses a single warehouse (monorepo), which is a source code management strategy, all code and resources are stored in the same warehouse, rather than scattered in multiple warehouses. This strategy has many advantages, including code sharing and reuse, large-scale refactoring, collaboration, and dependency management, among others. This allows our training dataset to contain the vast amounts of code used to build Google's latest software and its predecessors.

In order to improve the model quality, we continuously iterate the training dataset. For example, we compare model performance on a dataset containing a single reviewer's comments per file versus multiple reviews per file, and use a classifier to clean the training data based on a small, curated dataset, to select the model with the best offline precision and recall metrics.

Service Infrastructure and User Experience

We designed and implemented this feature on top of the trained model, focusing on overall user experience and developer efficiency. As part of this, we explored different user experience (UX) alternatives through a series of user studies. We then refine the feature based on insights from internal betas (i.e., testing of features in development), including user feedback (e.g., adding a "Was this helpful" button next to suggested edits). Optimized.

The final model is calibrated to a target accuracy of 50%. That is, we tuned the model and recommendation filtering so that 50% of the suggested modifications in our evaluation dataset were correct. In general, increasing the target accuracy rate will reduce the number of suggested edits shown, and decreasing the target accuracy rate will result in more incorrectly suggested edits. Incorrectly suggested edits take up developer time and reduce developer trust in the feature. We found that 50% target accuracy provided a good balance.

At a high level, for each new reviewer comment, we generate model inputs in the same format as for training, query the model, and generate suggested code modifications. If the model is confident in its predictions, and some additional heuristic rules are satisfied, we then send the suggested modification to the downstream system. Downstream systems, namely the code review front end and the integrated development environment (IDE), present suggested edits to the user and record user interactions such as preview and apply events. A dedicated pipeline collects these logs and generates aggregated insights, such as the overall acceptance rate reported in this blog post.

figure 2c.gif

ML-suggested edits ( ML-suggested edits ) the architecture of the infrastructure. We process code and infrastructure from multiple services, take model predictions and display them in code review tools and IDEs.

Developers interact with ML-suggested edits in code review tools and IDEs. Based on insights from user research, integration into code review tools works best for a streamlined review experience. IDE integration provides additional functionality and supports a three-way merge of ML-suggested edits (left image) to the merge result (middle image) in the case of conflicting local changes on the reviewed code state (right image) .

insert image description here

Demonstration of a three-way merge in the IDE

result

Offline evaluation shows that the model resolves 52% of reviews with 50% target accuracy. Online metrics for beta and full internal releases confirm these offline metrics, namely, we see confidence in model proposals exceeding our target model confidence on ~50% of all relevant reviewer comments. Code authors applied 40% to 50% of all previewed suggested edits.
insert image description here

During beta, we leveraged "not helpful" feedback to identify common failure modes for models. We implemented service time heuristic rules to filter these and thus reduce the number of incorrect predictions displayed. With these changes, we traded quality for quantity and observed an increase in real world acceptance. Our beta rollout revealed a potential problem: Code authors only previewed about 20% of all suggested edits generated. We optimized the user experience by introducing a prominent "Show ML-edit" button next to reviewer comments (see image above), increasing the total preview rate at launch to about 40%. We also found that suggested edits in code review tools were often not applicable due to conflicting changes made by authors during the review process. We solved this with a button that opens a merge view in the code review tool for suggested edits. We now observe that more than 70% of these are implemented in code review tools and less than 30% are implemented in IDEs. All of these changes allowed us to triple the overall percentage of reviewer comments resolved with ML-suggested edits, from beta to full internal release. At Google helps us automatically resolve hundreds of thousands of reviews every year.

figure 7.png

Recommended filter funnel

We see ML-suggested edits addressing a wide range of reviewer comments in production. This includes simple localized refactorings and refactorings that are scattered throughout the code, as shown in the example in the aforementioned blog post. This feature addresses long and informally worded comments that require code generation, refactoring, and importing.
insert image description here

An example of a longer code review recommendation that requires code generation, refactoring, and importing

Models can also respond to complex comments and generate large-scale code modifications (as shown below). The generated test cases follow the existing unit test pattern while changing the details described in the comments. Additionally, the editor suggested a comprehensive test name that reflects the semantics of the test.
insert image description here

An example of the model's ability to respond to complex comments and generate large numbers of code modifications

Conclusions and future work

In this post, we introduce a machine learning (ML) assistance feature to reduce the time spent on code review related modifications. Currently, the vast majority of all actionable code review comments in the programming languages ​​used by Google can be resolved by applying ML-suggested edits. A 12-week A/B experiment we will be running across all Google developers will further measure the impact of this feature on overall developer productivity.

We are fully optimizing the entire technology stack. This includes improving the quality and recall of the model, providing developers with a smoother experience, and improving discovery (providing a clear interface and navigation to help users quickly find the features they need without spending too much time) and energy) to improve the experience of the entire review process. As part of this, we are working on the ability to display ML-suggested edits as a reviewer drafts a comment, and integrating functionality into the IDE so that authors of code changes can get ML's code modifications as soon as they get the reviewer's description suggestion.

thank you

This is the result of joint research by many people from Google's core systems and experience team, Google Research, and DeepMind. We especially thank Peter Choy for supporting our collaboration, and all members of our team for their key contributions and helpful advice, including Marcus Revaj, Gabriela Surita, Maxim Tabachnyk, Jacob Austin, Nimesh Ghelani, Dan Zheng, Peter Josling, Mariana Stariolo, Chris Gorgolewski, Sascha Varkevisser, Katja Grünwedel, Alberto Elizondo, Tobias Welp, Paige Bailey, Pierre-Antoine Manzagol, Pascal Lamblin, Chenjie Gu, Petros Maniatis, Henryk Michalewski, Sara Wiltberger, Ambar Murillo, Satish Chandra, Madhura Dud hgaonkar, Niranjan Tulpule, Zoubin Ghahramani, Juanjo Carin, Danny Tarlow, Kevin Villela, Stoyan Nikolov, David Tattersall, Boris Bokowski, Kathy Nix, Mehdi Ghissassi, Luis C. Cobo, Yujia Li, David Choi, Kristóf Molnár, Vahid Meimand, Amit Patel, Brett Wiltshire, Laurent Le Brun, Mingpan Guo, Hermann Loose, Jonas Mattes, Savinee Dancs.

Original link: https://ai.googleblog.com/2023/05/resolving-code-review-comments-with-ml.html

Guess you like

Origin blog.csdn.net/w605283073/article/details/131117624
Recommended