[Natural Language Processing] [Large Model] CodeGeeX: A Multilingual Pre-Training Model for Code Generation

CodeGeeX: Multilingual Pretrained Models for Code Generation
《CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X》

Paper address: https://arxiv.org/pdf/2303.17568.pdf

Related Blogs
[Natural Language Processing] [Large Model] CodeGen: A Code-Large Language Model for Multi-Round Program Synthesis
[Natural Language Processing] [Large Model] CodeGeeX: A Multilingual Pre-Training Model for Code Generation
[Natural Language Processing 】【Large Model】LaMDA: Language Model for Conversational Applications
【Natural Language Processing】【Large Model】DeepMind's Large Model Gopher
【Natural Language Processing】【Large Model】Chinchilla: Large Language Model with Optimum Training and Computational Utilization
[Natural Language Processing] [Large Model] Large Language Model BLOOM Reasoning Tool Test
[Natural Language Processing] [Large Model] GLM-130B: an open source bilingual pre-trained language model
[Natural Language Processing] [Large Model] for large Transformers Introduction to 8-bit matrix multiplication
[Natural Language Processing] [Large Model] BLOOM: A multilingual model with 176B parameters and open access
[Natural Language Processing] [Large Model] PaLM: A large language model based on Pathways
[Natural Language Processing] [chatGPT series] Large language models can improve themselves
[Natural Language Processing] [ChatGPT series] FLAN: Fine-tuning language models is a Zero-Shot learner
[Natural Language Processing] [ChatGPT series] Where does the intelligence of ChatGPT come from?

1. Introduction

The goal of code generation is: given a description of human intent (eg: "write a factorial function"), the system automatically generates an executable program. This task has been around for a long time, and the solutions to it are endless. Recently, the quality of code generation has been significantly improved by treating programs as language sequences and modeling them with deep learning transformer architectures. Especially when large-scale open source code data is combined with large language models.

OpenAI's 12B model CodeX demonstrates the potential of large models pre-trained on billions of lines of public code. By using generative pre-training, CodeX can solve entry-level programming problems in python very well. Research shows that 88% of users of GitHub Copilot report increased programming productivity. Subsequently, a large number of large code language models have been developed, including: DeepMind's AlphaCode , Salesforce's CodeGen , Meta's InCoder and Google's PaLM-Coder-540B .

​ This paper proposes CodeGeeX, a multilingual code generation model with 13B parameters , which is pre-trained on 23 programming languages. The model was trained for 2 months on a cluster with 1536 Ascend 910 AI processors, training a total of 850 billion tokens. CodeGeeX has the following characteristics: (1) CodeGeeX is different from CodeX. Its model itself and training code are open source , which is helpful for understanding and improving the pre-training code model. CodeGeeX also supports inference on different platforms such as Ascend and NVIDIA GPUs . (2) In addition to code generation and code completion, CodeGeeX also supports code interpretation and code translation . (3) Compared with well-known code generation models (CodeGen-16B, GPT-NeoX-20B, InCode-6.7B, and GPT-J-6B), CodeGeeX consistently outperforms other models.

This paper also develops the HumanEval-X benchmark to evaluate multilingual code models because: (1) HumanEval and other benchmarks only contain programming problems in a single language ; (2) existing multilingual datasets use similar strings like BLEU evaluation metrics rather than verifying the correctness of the generated code . Specifically, for each Python problem in HumanEval, its prompts, standard solutions, and test cases are manually rewritten in C++, Java, JavaScript, and GO. In total, 820 handwritten "problem-solution pairs" are included in HumanEval-X. Furthermore, HumanEval-X supports both code generation and code translation evaluation.

2. CodeGeeX model

insert image description here

1. Model Architecture

​Transformer Backbone . CodeGeeX uses a decoder-only GPT architecture and uses autoregressive language modeling. The core architecture of CodeGeeX is a 39-layer transformer decoder. In each transformer layer, it includes: multi-head self-attention mechanism, MLP layer, layer normalization and residual connection. Use GELU-like FaastGELU activations, which are more efficient on the Ascend 910 AI processor.
FastGELU ( X i ) = X i 1 + exp ⁡ ( − 1.702 × ∣ X i ∣ ) × exp ⁡ ( 0.851 × ( X i − ∣ X i ∣ ) ) (1) \text{FastGELU}(X_i)=\ frac{X_i}{1+\exp(-1.702\times|X_i|)\times\exp(0.851\times(X_i-|X_i|))} \tag{1}FastGELU ( Xi)=1+exp(1.702×Xi)×exp(0.851×(XiXi))Xi( 1 )
Generative pre-training objective. Adopt the paradigm of GPT to train the model on large-scale unsupervised code data. In general, iteratively takes the code token as input, predicts the next token and compares it with the real token. Specifically, for lengthnnAny input sequence of n { x 1 , x 2 , … , xn } \{x_1,x_2,\dots,x_n\}{ x1,x2,,xn} , the output of CodeGeeX is the probability distribution of the next token
P ( xn + 1 ∣ x 1 , x 2 , … , xn , Θ ) = pn + 1 ∈ [ 0 , 1 ] 1 × v (2) \mathbb{ P}(x_{n+1}|x_1,x_2,\dots,x_n,\Theta)=p_{n+1}\in[0,1]^{1\times v} \tag{2}P(xn+1x1,x2,,xn,i )=pn+1[0,1]1×v( 2 )
Among them,Θ \ThetaΘ denotes all parameters,vvv is the vocabulary size. By comparing the predicted token with the true distribution, the cross-entropy loss function can be optimized:
L = − ∑ n = 1 N − 1 yn + 1 log ⁡ P ( xn + 1 ∣ x 1 , ) (3) \mathcal{L} =-\sum_{n=1}^{N-1}y_{n+1}\log \mathbb{P}(x_{n+1}|x_1,) \tag{3}L=n=1N1yn+1logP(xn+1x1,)( 3 )
​Top Query layer and decoding. The original GPT uses the pooler function to obtain the final output. We add an additional query layer (also used by Huawei "Pangu") on all transformer layers to obtain the final embedding. As shown in the figure above, the input of the top query layer is replaced by positionn + 1 n+1n+1 query embedding. The final output is multiplied by the transpose of the word embedding matrix to obtain the output probability distribution. For solving strategies, CodeGeeX supports greedy, temperature sampling, top-k sampling, top-p sampling and beam search.

2. Pre-training settings

insert image description here

​Code corpus . The training corpus consists of two parts. The first part is open source code datasets: Pile and CodeParrot. The second part is Python, Java and C++ code crawled directly from GitHub to complement the first part. Select the code warehouse with at least one star and less than 10MB, and then filter the files: (1) each line exceeds 100 characters; (2) automatically generated; (3) the letter ratio is less than 40%; (4) greater than 100KB or less than 1KB. The figure above shows the proportion of 23 programming languages ​​in the training data. The training data is divided into segments of equal length . In order to help the model distinguish between multiple languages, language-related tags are added before each fragment, for example: language: Python.

​Tokenization . Considering that there are a large number of natural language annotations in the code data and the names of variables, functions, and categories are usually meaningful words, the code data is also used as text data and GPT-2 tokenizer is used . The initial vocabulary size is 50000, and multiple spaces are encoded as extra tokens to increase encoding efficiency. Specifically, L blank tokens are represented as <|extratoken_X|>, where X=8+L. Since the vocabulary contains tokens in various languages , this allows CodeGeeX to handle tokens in various languages, such as Chinese, French, etc. The final vocabulary size is v = 52224 v=52224v=52224

​Word embeddings and positional embeddings . The word embedding matrix is ​​expressed as W word ∈ R v × h W_{word}\in\mathbb{R}^{v\times h}WwordRv × h , the position embedding matrix is ​​expressed asW pos ∈ R nmax × h W_{pos}\in\mathbb{R}^{n_{max}\times h}WposRnmax× h , whereh = 5120 h=5120h=5120 andnmax = 2048 n_{max}=2048nmax=2048 . Each token corresponds to a learnable word embeddingxword ∈ R h x_{word}\in\mathbb{R}^hxwordRh and alearnableposition embeddingxpos ∈ R h x_{pos}\in\mathbb{R}^{h}xposRh . The two embeddings are added to get the input embedding vectorxin = xword + xpos x_{in}=x_{word}+x_{pos}xin=xword+xpos. Finally, the entire sequence is transformed into an embedding matrix X in ∈ R n × h X_{in}\in\mathbb{R}^{n\times h}XinRn×h n n n is the sequence length.

3. CodeGeeX training

​Ascend 910 parallel training . CodeGeeX uses Mindspore on a cluster of Ascend 910 AI processors (32GB) for training. The training took place for 2 months on 1526 AI processors in 192 nodes. A total of 850B tokens were consumed, about 5 epochs (213000steps). In order to improve training efficiency, 8-way model parallelism and 192-way data parallelism are used, and the ZeRO-2 optimizer is used to further reduce memory consumption. Finally, the micro-batch size on each node is 16, and the global batch size is 3072.

​ Specifically, use the Adam optimizer to optimize the loss. The model weights are in FP16 format, and layer-norm and softmax use FP32 for higher accuracy and stability. The model occupies 27GB of GPU memory. The initial learning rate is 1e-4, and a cosine learning rate schedule is applied:
lrcurrent = lrmin + 0.5 ∗ ( lrmax − lrmin ) ∗ ( 1 + cos ⁡ ( ncurrentndecay π ) ) (4) lr_{current}=lr_{min}+ 0.5*(lr_{max}-lr_{min})*(1+\cos(\frac{n_{current}}{n_{decay}}\pi)) \tag{4}lrcurrent=lrmin+0.5(lrmaxlrmin)(1+cos(ndecayncurrentp ))( 4 )
​ The detailed training parameters are shown below.

insert image description here

​Optimized training . To optimize the Mindspore framework to unleash the potential of the Ascend 910. Two techniques are employed to significantly improve training efficiency: (1) Kernel fusion; (2) Auto Tune optimization. The following table is a comparison before and after optimization.

insert image description here

4. Fast reasoning

​Quantify . Apply post-training quantization technology to reduce the memory consumption of CodeGeeX inference. Quantize all linear layer weights WW using absolute maximumW is converted from FP16 to INT8:
W q = Round ( W λ ) , λ = Max ( ∣ W ∣ ) 2 b − 1 − 1 (5) W_q=\text{Round}(\frac{W}{\lambda} ),\lambda=\frac{\text{Max}(|W|)}{2^{b-1}-1} \tag{5}Wq=Round(lW),l=2b11Max(W)( 5 )
wherebbb is the bit width,b = 8 b=8b=8 λ\lambdaλ is the scaling factor.

​accelerate . After 8bit quantization, a faster version of CodeGeeX was implemented using NVIDIA's FasterTransformer.

3. HumanEval-X benchmark

The HumanEval benchmark, similar to MBPP and APPS, only contains handwritten Python programming problems and cannot be directly applied to the systematic evaluation of multilingual code generation. Therefore, this paper develops a multilingual variant of HumanEval, HumanEval-X. Every problem in HumanEval is defined in Python, and we rewrote prompts, standard solutions, and test cases in C++, Java, JavaScript, and Go. There are a total of 820 "problem-solution pairs" in HumanEval-X.

​task . HumanEval-X evaluates 2 tasks: code generation and code translation. The code generation task takes a function declaration and a textual description as input and generates the implementation code of the function. The code translation task takes as input a solution implemented in the source language and generates a corresponding implementation in the target language.

​Metric . Use test cases to evaluate the correctness of generated code and measure its pass@k. Specifically, use an unbiased method to estimate pass@k:
pass@k : = E [ 1 − ( n − ck ) ( nk ) ] , n = 200 , k ∈ { 1 , 10 , 100 } (6) \ text{pass@k}:=\mathbb{E}[1-\frac{\left(\begin{array}{l}nc \\ k\end{array}\right)}{\left(\begin{ array}{l}n \\ k\end{array}\right)}],\quad n=200,k\in\{1,10,100\} \tag{6}pass@k:=E[1(nk)(nck)],n=200,k{ 1,10,100}( 6 )
wherennn is the total number generated (200), k is the number of samples,ccc is the number of samples that pass all test cases.

4. CodeGeeX Evaluation

  • Multilingual code generation

insert image description here

  • Multilingual code translation

insert image description here

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/130544322