Automatic code generation: literature reading and learning "A Deep Learning Model for Source Code Generation"

Automatic Code Generation: Literature Reading and Learning
Literature Name: "A Deep Learning Model for Source Code Generation"
Authors: Raymond Tiwang, Timothy Oladunni, Weifeng Xu

PS: The blogger has just come into contact with this field, and a lot of content is still unclear. If the content of the blog post is wrong, please criticize and correct~


Abstract : Inspired by models such as n-grams, a model is developed for analyzing source code via Abstract Syntax Trees (AST). The model is built on deep learning based Long Short-Term Memory (LSTM) and Multi-Layer Perceptron (MLP) architectures. Through evaluation, it can effectively predict the token sequence (tokens) of the source code based on the python language.

1.Introduction

In this model, we use two methods to improve the repeatability of the data set: 1) generate the abstract syntax tree AST from the source code; 2) use the dump file (this file is the pre-order traversal of the AST). Then use LSTM and MLP to train the model.

2.Related Work

2.1 Literature Review

The information that is more useful to me:
1) Roos[7] proposed an N-gram language model method for quickly and accurately completing API codes. According to him, the model can work in real time and complete the work of code completion in seconds.

2) Li et. al.[13] used neural attention and pionter networks to study code completion. They developed an attention mechanism capable of exploiting structured information on a program's abstract syntax tree.

3) Ginzberg et al. [14] implemented a common LSTM model to complete the code generation task.

2.2 Sentences as probability

How to calculate the probability that a sentence may appear?
Suppose the sentence S = "the car runs fast". The words s1-s4 are "the", "car", "runs" and "fast", respectively. The article is derived like this (P means probability):
insert image description here

According to the chain rule, the above formula can be transformed into:
insert image description here

Right now:
insert image description here

However, like the right half of formula 4, it is too difficult to calculate. So simplify this formula according to Markov Assumption. Simplified to the following parts:

insert image description here

3. THEORETICAL BACKGROUND

This study is essentially a multi-category study, and the number of categories is the number of different tokens in the vocabulary. This classifier is represented by a set of discriminant functions: gi(x). Among them, i represents the category, x represents the feature vector, and gi(x) represents the assignment of the feature vector x to a certain class i. (I don’t really understand this place)

Loss function used: categorical cross entropy
activation function used: sigmoid/softmax

4. METHODOLOGY

4.1 Natural Language Processing Approach

First train and test with a simple LSTM network.
Dataset source : a github repository containing 1274 python source codes.

Preprocessing stage : the individual files containing the source code are spliced ​​into one large file. Clean the data (remove spaces, meaningless characters), then tokenize and organize the data into sequences of fixed length (tokenized and organized the dataset into
sequences of fixed length). These sequences are then fed into the embedding layer of the model.

Training and testing : 15 epochs were trained, the accuracy was 53.43%, and the loss was 3.0.82. Training for 30 epochs, the accuracy is 53.36%, and the loss is 3.3532.

Conclusion : The accuracy of this method for predicting the next token of the code is very low, indicating that the traditional NLP method has limited application range in source code pattern recognition or knowledge discovery.

4.2 Proposed Approach

In view of the shortcomings of the above methods, the abstract syntax tree AST is applied.
Dataset : Same as the dataset in 4.1.

Data preprocessing : Preprocessing the dataset with an abstract syntax tree. This preprocessed data is then loaded to define a reference dictionary for all tokens in the vocabulary, which is stored in a list where each row is a single token. Because the model only accepts integer values ​​as input, the author uses a custom encoder to encode the token. Finally, split it into input-output ratios using sklearn

4.3 Experimental design

Divide the training data set into windows of size n, each window contains n tokens, where the first n-1 tokens are used as input and the nth token is used as output.

The experimental design mainly consists of three stages: data processing stage, data structure stage, training and testing stage.
Data processing phase : Source code (data) is loaded into the model. For efficiency, the loaded file is transferred into a large python object f. AST is generated from a unique file f.
Data structuring phase : organize text into tokens and create dictionaries for each unique token.
Training and testing phase : split the text into training data and testing data, and feed them into the model.

I don’t quite understand the data structure stage. According to the paper, here are a few steps of the data structure:
①encode token sequences.
②load processed data
③define a reference dictionary for all unique tokens
④organize files into sequence of tokens to convert files For sequence of tokens
⑤split each line into individual tokens
insert image description here

4.4 AST Processing

In AST-processed source code, each node of the tree has a predetermined structure, depending on the type of keywords or operations involved. For example, leaf nodes are usually function names, object names, or parameter names. Each leaf node is identified by an identifier (id) and a context (ctx).
id can be a string variable, a numeric literal, or a native function, while ctx indicates the task performed by name. The possible values ​​of ctx are Load, Store, Del, etc.
Because the structure of the nodes is fixed, it is easier to predict using AST than the method in 4.1. Using AST tree instead of plain text source code to represent the program can easily predict the program structure, thereby ensuring better accuracy of the next token generation.

4.4.1 RNN-LSTM Learning

The previous integer-encoded tokens are fed into the embedding layer of the keras model. The embedded output sequence is then fed into the LSTM layer.

4.4.2 MLP Learning

Continue training with MLP

5. RESULT

insert image description here

As shown in Figure 9, the accuracy of the LSTM model is 90.32%, and the loss is 0.3505.
In addition to using AST dumps to train the data, the authors also use regularization. Running the model with an L2 regularization of 0.1 can prevent overfitting to a certain extent.

insert image description here

As shown in Figure 10, the accuracy rate is 90.11%, and the loss is 0.314.

6. MODEL PERFORMANCE

Comparison of this model with other models.
insert image description here

7. CONCLUSION

Main contributions:
1. Design, development and evaluation of the ASTLSTM/MLP code completion model.
2. Tokenized 1274 python source codes.
3. Compared with NLP and Pointer Mixture Network methods, it has increased by 69.5% and 29% respectively.
4. The accuracy rates of LSTM and MLP are 90.3% and 90.1%, respectively.

The significance of this study is as follows:
1. Traditional natural language processing methods have limited scope of application in source code pattern recognition or knowledge discovery.
2. LSTM and MLP learning algorithms have high code completion or generation accuracy.
3. AST-LSTM is an effective mechanism for python code completion or generation. Although both MLP and LSTM have an accuracy rate above 90%, LSTM is superior to MLP on code completion tasks because it learns much faster than the MLP algorithm.

Summarize

This post implements python code generation. In the data preprocessing stage, the abstract syntax tree AST is introduced to analyze the source code, and then two methods of LSTM and MLP are used for training. This model can effectively predict the token sequence (tokens) of the source code, and finally use the astunparse module or astor to convert the AST back to the source code in a one-to-one communication.

doubt

  1. The data structure part of the experimental design is not well understood. I feel that the introduction in the article is also very vague. What exactly does this part do? Why can't you jump directly from data preprocessing to training and testing?
  2. I don't really know much about AST yet. How is a program converted from code to abstract syntax tree? How to perform preorder traversal on an AST? (ie: how are figure 5 and figure 6 in the article generated?)
  3. So, this article can only achieve code extension and generation? Does not have the function from description to code? (For example: like copilot, write a comment, and you can form a recommended code after typing tab)

Guess you like

Origin blog.csdn.net/rellvera/article/details/129838485