Automatic Code Generation - Literature Review Read "Code Generation Using Machine Learning: A Systematic Review"

A literature review reading record on automatic code generation. This document is "Code Generation Using Machine Learning: A Systematic Review".


1. Summary

1. Main research contents in the field of code generation: description-to-code, code-to-description, code-to-code.
2. The most popular applications: code generation from natural language descriptions, documentation generation, and automatic program repair.
3. The most commonly used machine learning methods: RNN, Transformer, CNN.

2. Introduction

In order to improve the efficiency of software development, some researchers began to use neural networks to complete some automated programming tasks. For example, use RNN, transformer and other structures to perform work from annotation to code, from code to annotation, cross-PL translation, etc. Some commercial products have also been produced, such as Tabnine, Copilot.

3. Three main research contents in the field of code generation

(1) Description-to-code
(2) Code-to-description
(3) Code-to-code
insert image description here

4. Tokenizer word segmentation

Tokenization is a preprocessing job used to divide the input string into blocks. These blocks/tokens (tokens) are mapped to numbers, which are input into the ML model.
There are three types of tokenizers:

1. Word-based
2. Character-based
3. Subword-based

Subword-based tokenization is a compromise between word-based and character-based, and its vocabulary includes all basic characters as well as frequently occurring character sequences.

However, when performing tokenization work in the code, there will be a problem: the number of unique "words" in the code will be much larger than the number of words in the natural language. This is partly because function names and variable identifiers are usually concatenated with multiple English words. For example: if you want to print "hello world" to the console, its function name can be ''helloWorld'' ''hello_world'' and so on. So in this case, word-based tokenization is not ideal. In addition, the method of character_based is rarely used in various studies. Some literature adopts the subword-based method.

Some studies use a custom tokenization process to encapsulate useful information while keeping the size of the vocabulary small. For example: token copying、token abstractionthese two strategies.

5. Data

There are three sources of training data/test data: open source code, manually compiled code, and machine-generated code.

How to evaluate the quality of training data/test data?
Automated assessment, manual assessment.

6. How to evaluate the quality of synthetic code

Compare synthesized code with fully correct code, perform static analysis, run-time analysis, and more. The main methods are shown in Table 6.
(1) token match: methods such as BLEU, CIDEr, ROUGE and METEOR in NLP.
(2) dynamic analysis: This method is used less and analyzes the behavior of the code at runtime. Dynamic analysis involves evaluating executable code at runtime for functional correctness, time to completion.
(3) static analysis: This method is rarely used. Static analysis does not require the code to be executable, but if only the syntax is considered, it will lead to degradation.
insert image description here

7. Future work

1. Improve the efficiency of the language model. Help to reduce costs.
2. Integrated learning.
3. Use AST (source abstract syntax tree). Multiple studies discussed in this paper, such as [21], [95], use AST representations of codes in their models.

Guess you like

Origin blog.csdn.net/rellvera/article/details/129817793