CodeGeeX paper published: Demystifying the big model behind AI-assisted programming tools

Recently, the CodeGeeX model iteration v1.5 version was launched, and users reported that the model effect and usage efficiency have been greatly improved compared with the previous ones.

file

Coincidentally, the CodeGeeX team published a paper on arxiv, detailing the code generation large model architecture, training process, and inference acceleration behind the CodeGeeX AI programming assistant tool.

file

Today we interpret the core information of this paper, so that more developers can understand the evolution of the large model behind CodeGeeX, so as to better use CodeGeeX as a new generation of productivity tools for developers.

Transformer-based techniques show code generation potential

Can a machine automatically generate an executable program that solves a need based on a description of human intent, such as "write a factorial function"? This is the problem of automatic programming, which has been explored in various ways since the early days of computer science in the 1960s.

From "LISP-based pioneering deductive synthesis approaches" to "Modern Program Synthesis Systems", to exploration through deep neural networks, in order to enable machines to automatically write correct programs, researchers have been looking for the right direction.

It wasn't until 2020 that Transformer-based techniques began to show potential for automatically generating code that is both syntactically correct and contextually consistent. Immediately afterwards, large-scale language models encountered large-scale open source code data, and the progress of code generation was once again significantly promoted.

Among them, it is very worth noting that: OpenAI's Codex model has 12 billion (12B) parameters. In 2021, the potential of large code generation models after pre-training on billions of lines of public code was demonstrated for the first time. By using a generative pre-training (GPT) strategy, Codex solves Python entry-level programming problems with high probability. Since then, large pre-trained code models have been developed extensively.

3 important features of the CodeGeeX model

The protagonist of this article: the CodeGeeX model is a multi-programming language code generation pre-training model with 13 billion parameters. It is completely implemented using domestic platforms and frameworks, and it took two months to train on the code corpus of more than 20 programming languages.

CodeGeeX's research on automatic code generation, under the correct direction of the large language model, is different from Codex and has its own characteristics:

file

First, CodeGeeX itself and how to pre-train code models of this scale are open source, which helps to understand and advance the development of pre-trained code generation models. CodeGeeX also supports cross-platform inference on Ascend and NVIDIA GPUs.

file

Second, in addition to supporting code generation and code completion like Codex and other tools, CodeGeeX also supports code interpretation and code translation tasks between multiple languages.

Third, it has consistent performance advantages among well-known multilingual code generation models of similar scale, including CodeGen-16B, GPT-NeoX-20B, InCode-6.7B, and GPT-J-6B, etc.

file

CodeGeeX also builds free CodeGeeX plugins in several IDEs, including Visual Studio Code, JetBrains IDEs, and Tencent Cloud Studio (Web IDE).

It supports several different modes - code auto-completion, function-level generation, code translation, code explanation, and customizable hints to help users complete programming tasks in real-time.

file

Since its release, there are tens of thousands of active users every day, and each user initiates more than 250 API calls per working day on average. As of this writing, the CodeGeeX model generates 4.7 billion tokens per week. Our user survey shows that 83.4% of users believe that CodeGeeX has improved their programming efficiency.

file

CodeGeeX is fully validated by HumanEval-X

In addition to the CodeGeeX model, the team also built a multi-programming language code generation evaluation benchmark HumanEval-X. HumanEval-X is the first multi-language, multi-task benchmark that supports functional correctness evaluation, including 820 high-quality code generation questions, test cases and reference answers written by humans, covering 5 programming languages ​​(Python, C++, Java , JavaScript, Go), supports the evaluation of code generation and code translation capabilities. The capability of the CodeGeeX model has been fully validated in HumanEval-X.

file

See the full text of the paper: https://arxiv.org/abs/2303.17568

Click to read the original text, understand and try the AI ​​programming aids based on the CodeGeeX large model (VS Code plug-in, JetBrains IDEs plug-in): https://codegeex.cn

This article is published by OpenWrite, a multi-post platform for blogging !

Supongo que te gusta

Origin blog.csdn.net/mp817/article/details/130104026
Recomendado
Clasificación