CodeGeeX2: A more powerful multi-language code generation model

CodeGeeX2 is the second generation model of the multilingual code generation model CodeGeeX ( KDD'23 ). Different from the first generation of CodeGeeX (which was completely trained on the domestic Huawei Ascend chip platform), CodeGeeX2 is based on the ChatGLM2 architecture and adds code pre-training. Thanks to the better performance of ChatGLM2, CodeGeeX2 has achieved performance improvements in multiple indicators (+107% > CodeGeeX ; Only 6 billion parameters (nearly 10% of StarCoder-15B exceeding 15 billion parameters), and more features include:

More powerful coding capabilities : Based on the ChatGLM2-6B base language model, CodeGeeX2-6B has further been pre-trained with 600B code data. Compared with the first-generation model, it has comprehensively improved its coding capabilities. All six programming languages in the HumanEval-X evaluation set have Significant improvement (Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, Rust +321%), reaching a Pass@1 first-time pass rate of 35.9% on Python, exceeding the scale The larger StarCoder-15B.
Better model features : Inheriting the ChatGLM2-6B model features, CodeGeeX2-6B better supports Chinese and English input, supports a maximum sequence length of 8192, and the inference speed is greatly improved compared to the first generation CodeGeeX-13B. After quantification, it only needs 6GB of video memory to run, and supports Lightweight localized deployment.
More comprehensive AI programming assistant : CodeGeeX plug-in ( VS Code , Jetbrains ) back-end upgrade, supports more than 100 programming languages, and adds practical functions such as contextual completion and cross-file completion. Combined with the Ask CodeGeeX interactive AI programming assistant, it supports Chinese and English dialogue to solve various programming problems, including but not limited to code explanation, code translation, code error correction, document generation, etc., helping programmers develop more efficiently.

quick start

Use `transformers`Quick Call CodeGeeX2-6B :

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True, device='cuda')
model = model.eval()

# remember adding a language tag for better performance
prompt = "# language: Python\n# write a bubble sort function\n"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_length=256, top_k=1)
response = tokenizer.decode(outputs[0])

>>> print(response)
# language: Python
# write a bubble sort function


def bubble_sort(list):
    for i in range(len(list) - 1):
        for j in range(len(list) - 1):
            if list[j] > list[j + 1]:
                list[j], list[j + 1] = list[j + 1], list[j]
    return list


print(bubble_sort([5, 2, 1, 8, 4]))

Start Gradio DEMO:

python ./demo/run_demo.py

usage: run_demo.py [-h] [--model-path MODEL_PATH] [--example-path EXAMPLE_PATH] [--quantize QUANTIZE]
                   [--chatglm-cpp] [--fastllm] [--n-gpus N_GPUS] [--gpu GPU] [--cpu] [--auth] [--username yourname]
                   [--password yourpassword]
                   [--port PORT] [--listen ADDRESS]

# 若要启用身份验证，请先启用--auth，然后定义--username与--password，如：
python run_demo.py --auth --username user --password password  # 若要监听所有地址请指定 --listen 0.0.0.0

Supports quantitative inference acceleration using ChatGLM.cpp :

python ./demo/run_demo.py --quantize 4 --chatglm-cpp

Start FAST API:

python ./demo/fastapicpu.py
usage: fastapicpu.py [-h] [--model-path MODEL_PATH] [--listen ADDRESS] [--port PORT] [--workders NUM] [--cpu] [--half] [--quantize QUANTIZE] [--chatglm-cpp]
# --cpu启用cpu --half启用.half()

Supports quantitative reasoning acceleration using ChatGLM.cpp , just add --quantize 4 --chatglm-cpp parameters.

API usage examples

curl -X POST "http://127.0.0.1:7860" \
    -H 'Content-Type: application/json' \
    -d '{"lang": "Python", "prompt": "# Write a quick sort function"}'

❗️Please note:

CodeGeeX2-6B is a base code generation model without chat capabilities. Please go to the plug-in to experience a more comprehensive Ask CodeGeeX chat function.
When using the completion function of CodeGeeX2-6B, the input prompt needs to follow a specific format to obtain the best results. For example, you need to add a programming language tag at the beginning ( # language: Pythonsee the complete language list ), write prompts in the form of comments, etc. run_demo.pyProcessing in ref .
If the graphics card does not support bfloat16the format, incorrect content will be output, and the model needs to be converted to the float16format:
```
model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True).half().cuda()
```

If you need to use multiple graphics cards to load the model, you can add the following code:

tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True, device='cuda')
model = model.eval()

Replace with

def get_model():
    tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
    from gpus import load_model_on_gpus
    # The gpus file is in the demo folder
    model = load_model_on_gpus("THUDM/codegeex2-6b", num_gpus=2)
    model = model.eval()
    return tokenizer, model

tokenizer, model = get_model()

Code ability assessment

As a multi-language code generation base model, CodeGeeX2's coding capabilities have been greatly improved compared to the previous generation. The following are the evaluation results on the HumanEval, HumanEval-X, and DS1000 benchmarks (the definition of the evaluation indicator Pass@k is consistent with the paper ) :

HumanEval (Pass@1,10,100)

Model	Pass@1	Pass@10	Pass@100	Introduction	company
CodeGen-16B-multi	19.2	34.6	55.2	Open source	Saleforce
CodeGeeX-13B	22.9	39.6	60.9	Open source	Tsinghua University
Codex-12B	28.8	46.8	72.3	Descendants of GPT-3, not open source	OpenAI
CodeT5Plus-16B-mono	30.9	51.6	76.7	Open source	Saleforce
Code-Cushman-001	33.5	54.3	77.4
LLaMA-65B	23.7	-	79.3
LLaMA2-70B	29.9	-	-
CodeGen2.5-7B-mono	33.4	58.4	82.7	Open source	Saleforce
StarCoder-15B	33.2	61.0	84.7		BigCode
CodeGeeX2-6B	35.9	62.6	88.3	Open source	Tsinghua University

Pass@1 is used n=20, t=0.2, top_p=0.95; Pass@10, Pass@100 is used n=200, t=0.8, top_p=0.95.

HumanEval-X (Pass@1)

Model	Python	C++	Java	JavaScript	Go	Rust	Overall
CodeGen-16B-multi	19.2	18.1	15.0	18.4	13.0	1.8	14.2
CodeGeeX-13B	22.9	17.1	20.0	17.6	14.4	4.3	16.0
Replit-code-v1-3B	22.0	20.1	20.1	20.1	12.2	8.6	17.2
CodeGen2.5-7B-multi	30.6	24.3	29.0	27.5	18.9	20.1	25.1
StarCoder-15B	35.5	28.2	31.5	33.2	21.3	17.8	27.9
CodeGeeX2-6B	35.9	29.3	30.8	32.2	22.5	18.1	28.1

Pass@1 is used n=20, t=0.2, top_p=0.95.

scripts/run_humanevalx.shThe above results can be reproduced using scripts . For environment configuration and description, see the evaluation environment .

DS1000 (Pass@1)

Model	Matplotlib	Numpy	Pandas	Pytorch	SciPy	Scikit-learn	TensorFlow	Overall
# Samples	155	220	291	68	106	115	45	1000
CodeGen-16B-Mono	31.7	10.9	3.4	7.0	9.0	10.8	15.2	11.7
code-cushman-001	40.7	21.8	7.9	12.4	11.3	18.0	12.2	18.1
Codex-001	41.8	26.6	9.4	9.7	15.0	18.5	17.2	20.2
CodeGeeX2-6B	40.5	25.5	14.5	17.3	19.3	24.0	23.0	23.1
StarCoder-15B	51.7	29.7	11.4	21.4	20.2	29.5	24.5	26.0
Codex-002	57.0	43.1	26.5	41.8	31.8	44.8	39.3	39.2

Pass@1 is used n=40, t=0.2, top_p=0.5.

The above results can be reproduced using the DS1000 evaluation code .

Quantify inference performance

CodeGeeX2 is more deployment-friendly than the previous generation. Thanks to the use of Multi-Query Attention and Flash Attention, inference is faster and only requires 6GB of video memory to run after quantization:

Quantify

Model	FP16/BF16	INT8	INT4
CodeGeeX-13B	26.9 GB	14.7 GB	-
CodeGeeX2-6B	13.1 GB	8.2 GB	5.5 GB

Based on PyTorch 2.0 test, torch.nn.functional.scaled_dot_product_attentionefficient Attention calculation is implemented.

reasoning

Model	Inference speed (characters/second)
CodeGeeX-13B	32
CodeGeeX2-6B	94

batch_size=1, max_length=2048, all use the acceleration framework, and the test hardware is GeForce RTX-3090.