AI that can write code is open source, C language is better than Codex, master 12 programming languages

Source: Qubit | QbitAI

The AI ​​code generation model that can write C language better than Codex is now open source!

During this time, writing code with AI can be said to be a big fire, the most famous of which are OpenAI's Codex and DeepMind's AlphaCode.

Codex-based Copilot

However, none of these two AI models are open source:

Among them, AlphaCode only gave some test samples, while Codex only opened the API.

To this end, several researchers from CMU have used GPT-2 to create an AI code generation model called PolyCoder , which is also open source .

According to the researchers, although PolyCoder has a maximum of 2.7 billion parameters (compared to Codex's 12 billion parameters), its code written in C language is even better than Codex.

What's the secret here?

Trained with 12 programming language code sets

Let's first look at the data set used for training , which is also one of the biggest features of PolyCoder.

Previously, AI code generation models such as Codex and CodeParrot were mainly trained based on codes in the Python language.

For example, HumanEval, one of Codex's evaluation datasets, also evaluates the effect of generating Python code.

In contrast, PolyCoder is trained with multiple programming language code sets, 12 in total:

C、C#、C++、Go、Java、JavaScript、PHP、Python、Ruby、Rust、Scala和TypeScript。

Among them, the amount of code in C language is the largest, reaching 221GB; while the amount of data in Python code is less than that of Codex and CodeParrot.

Here PolyCoder uses the public code on GitHub, mainly selects the more popular libraries in various programming languages, and each library has at least 50 Stars.

According to the researchers, the total number of Stars for each programming language library does not add up to more than 25k to avoid the effect of model-generated code being too skewed towards the most popular programming languages ​​(usually the more popular the programming language, the more Stars the library has).

By extracting the files in the library, after simple processing (including eliminating duplicate code), a total of about 254GB of data was filtered out for training.

Then there is the pre-training method.

There are usually three ways to pre-train language models.

The first is a left-to-right language model, which predicts the context based on the above, which is more suitable for code generation , etc.; the second is a masked language model, which predicts and masks fragments based on the context, which is more suitable for code classification , etc.; the third is coding The decoder model is more suitable for tasks such as code annotation .

Here PolyCoder mainly uses the first pre-training method.

Compared with CodeParrot and Codex, which are also trained by GPT-2, PolyCoder also has slightly different hyperparameter settings:

PolyCoder provides a total of three different models, with 2.7 billion parameters, 400 million parameters and 160 million parameters respectively. Researchers can choose the appropriate model according to their own needs and different training capabilities.

So, what is the code generation effect of the final trained AI model?

C is especially well written, but not Python

The researchers compared PolyCoder with existing AI code generation models.

Since AlphaCode is not easy to compare (the interface is not open), the researchers mainly analyzed the following models, including GPT-Neo, CodeParrot, and Codex.

The blue ones are open source, and the orange ones are not open source:

In terms of the number of parameters, PolyCoder is not the best, and the largest model with 2.7 billion parameters is less than a quarter of that of Codex.

The researchers first compared a range of models using perplexity commonly used to assess language models.

Perplexity, which is used to measure the quality of the language model (LM). The lower the perplexity, the less perplexed the language model is to the code, and the better the model generation.

From the graph, PolyCoder unexpectedly achieves the best results (lowest perplexity) in C.

The results of training PolyCoder with a large amount of C language show that even if the overall principle of the model remains unchanged (based on GPT-2), simply changing the code set used for training can train AI code generation models that are good at different language styles.

Unfortunately, from other languages, the generated effect is completely incomparable to Codex:

For example, on HumanEval, which is mainly used for evaluating Python code, PolyCoder is far less capable than Codex:

According to the analysis of the paper, this may be caused by the amount of Python code data and insufficient model parameters.

In addition, the authors also mentioned that the purpose of making PolyCoder is mainly to open source an AI code generation model, so that more people can participate in research and use.

At present, the code is open source, whether you use it directly or try to develop new models based on it.

Interested friends can try it out~

about the author

The first author, Frank Xu, is currently a Ph.D. student at CMU. His research direction is NLP, information extraction, etc., and he has published many top conference papers, including ICLR, ACL, and EMNLP. Graduated from Shanghai Jiaotong University with a bachelor’s degree and master degree under the tutelage of Professor Zhu Qili.

Uri Alon is a postdoctoral fellow at CMU with research interests in programming language processing (PLP), NLP and deep learning.

Graham Neubig, assistant professor at CMU, researches NLP, machine translation, and machine learning-based natural language understanding.

Vincent J. Hellendoorn is an assistant professor of computer science at CMU. His main research interests are software engineering and machine learning. He is committed to using intelligent methods to help software developers reduce the time for tedious tasks such as code debugging and program optimization.

I don't know if the authors are already using this AI to code (manual dog head)

Project address:
https://github.com/VHellendoorn/Code-LMs

Paper address:
https://arxiv.org/abs/2202.13169

Guess you like

Origin blog.csdn.net/osfront/article/details/123444440