From 0 to 1: How to build a large-scale multilingual code generation pre-training model

The domestic AI-assisted programming tool CodeGeeX is an auxiliary programming tool that uses AI large models as the base to help developers write code faster. The writing of the entire function can be completed automatically, just according to the comment or the Tab key. It has been trained on more than two dozen languages, including Java, JavaScript, and Python, and optimizes its algorithms based on a large number of open source codes, official documents, and codes on public forums. As an original Chinese AI-assisted programming tool, CodeGeeX is now available to all developers free of charge, and it is completely open source. Programmers generally believe that the efficiency of writing code has increased by more than 2 times.

Core functions include: code generation and intelligent completion, automatic addition of Chinese and English comments to the code, accurate translation between codes in different programming languages, including the newly updated "Ask CodeGeeX" function, which integrates the intelligent question-and-answer mode into the actual In the development scenario, developers can focus and immerse themselves in programming. Without leaving the programming environment of the current IDE, they can write codes while talking with AI to realize intelligent Q&A for programming problems. You can try these core functions immediately without a waitlist!

Let’s take a look at what your experience on CodeGeeX is like:file

file

Here I recommend you to download and use the AI-assisted programming tool CodeGeeX for free .

Behind CodeGeeX is an open source large-scale multilingual code generation model. The biggest feature of this model is the realization of national production. CodeGeeX connects natural language to an interactive process of codes. Users can generate specific codes by writing comments, and can also translate codes in one language into codes in another language, or add Some notes on it. In September 2022, the CodeGeeX open-source plug-in will be free and available for use. Currently, 100,000+ programmers have installed and used it, with more than 2.7 million downloads, generating millions of lines of code for programmers every day.

So, how is the large-scale multilingual code generation pre-training model behind CodeGeeX built from 0 to 1? There are mainly the following steps:file

First, large-scale code data collection. The training data is mainly divided into two parts: one is the open source data set. For example, the code subset in The Pile, and CodeParrot (Python), etc.; the second is to crawl additional data. Crawl high-quality open source repositories from GitHub and clean the data according to a series of rules. In the end, the entire corpus contains 23 programming languages, covering Python, Java, C++, JavaScript, C, Go, HTML, Rust, C# and other mainstream languages, with a data volume of more than 158 billion tokens. Next, the data processing form is also very simple. First, the code data is segmented and identified, that is, the code segment is segmented to obtain the token sequence, and then the token is mapped to the ID in the vocabulary to obtain the ID sequence; secondly, for Add language tags to files in different languages. After sufficient study, the grammar models of more than 20 languages ​​can be fully mastered.

Second, the CodeGeeX model architecture. The CodeGeeX model is based on the autoregressive model of the GPT architecture, consisting of 40 layers of transformers, with a total of 13 billion parameters. It uses natural language or code tokens as input, outputs the probability of the next token, and supports various programming language-related downstream tasks, such as code generation, code completion, code translation, code comments, etc. At the same time, many designs have been made in the process of implementing the architecture, including which precision each operator needs to use to ensure the stability of model training and so on.

Third, CodeGeeX model training. CodeGeeX is implemented based on the Huawei Mindspore framework, using a total of 1,536 Ascend 910AI processors, equivalent to more than 1,500 GPUs, for two months of training. In terms of mixed precision training, most of the parameters use FP16 as the precision, but in the past practice, if all the parameters are FP16, some operators on some computing nodes are prone to precision overflow, and the model It will crash during training, so FP32 will be used in Layernorm and Softmax to ensure stability. At the same time, the training adopts a parallel training strategy, that is, 192-way data parallelism and 8-way model parallelism. After a long period of training, CodeGeeX has trained 850 billion tokens, and has basically seen all the codes crawled from GitHub.

Fourth, CodeGeeX model evaluation. How to properly evaluate the performance of code generation? In the past, the multilingual code benchmark CodeXGLUE and XLCoST used CodeBLEU/BLEU as the evaluation index. It is actually a semantic similarity, but it cannot correctly reflect the quality of the generated code in the code task, and it is not satisfactory for the current evaluation code generation. Model needs. In terms of model evaluation, CodeGeeX extended the HumanEval dataset, which is an existing Python dataset, to more languages, including C++, Java, JavaScript, Go, etc., to form HumanEval-X. The characteristic of this data set is that the input to the model includes the necessary reference files and describes what tasks to do, and then there may be one or two examples of input and output for the model to complete the function, you can use the already written test code and To do an automated test for the test case, you will know whether the code written by the model is correct or not. It can be said that CodeGeeX is currently the best open source multilingual code generation model with the best average performance.

Fifth, CodeGeeX code generation plugin. In the future, the CodeGeeX model will be truly practical, and the automatic code generation plug-in on VS Code/Jetbrains has been developed to provide multiple interactive modes, support code generation, completion, translation, annotation and other functions, free to use, and better assist programmers in development. We conducted a questionnaire survey on hundreds of users, covering front-end and back-end engineers, algorithm engineers, students, researchers, etc. 83.4% of users believed that the CodeGeeX plug-in can help improve programming efficiency, but the specific improvement needs further research. At the same time, the performance of different languages ​​is different. For example, the PHP language will be weaker. This is also the goal of future improvement, and strive to achieve better results in more languages.

Sixth, CodeGeeX open source development plan. Although CodeGeeX is trained on Ascend, it has also been ported to Nvidia to realize cross-platform model code training, fine-tuning, reasoning, evaluation code, etc. Users can apply for download on the official website, and they can deploy a set locally that is basically the same as CodeGeeX . The same set of processes.

There will be more and more AIGC application scenarios like MicroSoft Copilot, GitHub Copilot X, and CodeGeeX, and will greatly improve productivity. It is foreseeable that human beings are accelerating towards the AGI era, and there will definitely be more product forms in the next few months. Don't worry, just embrace the changes.

This article is published by OpenWrite, a multi-post platform for blogging !

Guess you like

Origin blog.csdn.net/mp817/article/details/131047092