CodeGeeX2: A more powerful multi-language code generation model

CodeGeeX2 is the second generation model of the multilingual code generation model  CodeGeeX  ( KDD'23 ). Different from the first generation of CodeGeeX (which was completely trained on the domestic Huawei Ascend chip platform), CodeGeeX2 is based on  the ChatGLM2  architecture and adds code pre-training. Thanks to the better performance of ChatGLM2, CodeGeeX2 has achieved performance improvements in multiple indicators (+107% > CodeGeeX ; Only 6 billion parameters (nearly 10% of StarCoder-15B exceeding 15 billion parameters), and more features include:

  • More powerful coding capabilities : Based on the ChatGLM2-6B base language model, CodeGeeX2-6B has further been pre-trained with 600B code data. Compared with the first-generation model, it has comprehensively improved its coding capabilities. All six programming languages ​​in the HumanEval-X  evaluation set have Significant improvement (Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, Rust +321%), reaching a Pass@1 first-time pass rate of 35.9% on Python, exceeding the scale The larger StarCoder-15B.
  • Better model features : Inheriting the ChatGLM2-6B model features, CodeGeeX2-6B better supports Chinese and English input, supports a maximum sequence length of 8192, and the inference speed is greatly improved compared to the first generation CodeGeeX-13B. After quantification, it only needs 6GB of video memory to run, and supports Lightweight localized deployment.
  • More comprehensive AI programming assistant : CodeGeeX plug-in ( VS CodeJetbrains ) back-end upgrade, supports more than 100 programming languages, and adds practical functions such as contextual completion and cross-file completion. Combined with the Ask CodeGeeX interactive AI programming assistant, it supports Chinese and English dialogue to solve various programming problems, including but not limited to code explanation, code translation, code error correction, document generation, etc., helping programmers develop more efficiently.

quick start

Use transformersQuick Call CodeGeeX2-6B :

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True, device='cuda')
model = model.eval()

# remember adding a language tag for better performance
prompt = "# language: Python\n# write a bubble sort function\n"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_length=256, top_k=1)
response = tokenizer.decode(outputs[0])

>>> print(response)
# language: Python
# write a bubble sort function


def bubble_sort(list):
    for i in range(len(list) - 1):
        for j in range(len(list) - 1):
            if list[j] > list[j + 1]:
                list[j], list[j + 1] = list[j + 1], list[j]
    return list


print(bubble_sort([5, 2, 1, 8, 4]))

Start Gradio DEMO:

python ./demo/run_demo.py

usage: run_demo.py [-h] [--model-path MODEL_PATH] [--example-path EXAMPLE_PATH] [--quantize QUANTIZE]
                   [--chatglm-cpp] [--fastllm] [--n-gpus N_GPUS] [--gpu GPU] [--cpu] [--auth] [--username yourname]
                   [--password yourpassword]
                   [--port PORT] [--listen ADDRESS]

# 若要启用身份验证,请先启用--auth,然后定义--username与--password,如:
python run_demo.py --auth --username user --password password  # 若要监听所有地址请指定 --listen 0.0.0.0

 Supports quantitative inference acceleration using  ChatGLM.cpp :

python ./demo/run_demo.py --quantize 4 --chatglm-cpp

Start FAST API:

python ./demo/fastapicpu.py
usage: fastapicpu.py [-h] [--model-path MODEL_PATH] [--listen ADDRESS] [--port PORT] [--workders NUM] [--cpu] [--half] [--quantize QUANTIZE] [--chatglm-cpp]
# --cpu启用cpu --half启用.half()

Supports quantitative reasoning acceleration using  ChatGLM.cpp  , just add  --quantize 4 --chatglm-cpp parameters.

API usage examples

curl -X POST "http://127.0.0.1:7860" \
    -H 'Content-Type: application/json' \
    -d '{"lang": "Python", "prompt": "# Write a quick sort function"}'

❗️Please note:

  • CodeGeeX2-6B is a base code generation model without chat capabilities. Please go to the plug-in to experience a more comprehensive Ask CodeGeeX chat function.

  • When using the completion function of CodeGeeX2-6B, the input prompt needs to follow a specific format to obtain the best results. For example, you need to add a programming language tag at the beginning ( # language: Pythonsee the complete language list ), write prompts in the form of comments, etc. run_demo.pyProcessing in ref .

  • If the graphics card does not support bfloat16the format, incorrect content will be output, and the model needs to be converted to the float16format:

    model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True).half().cuda()
  • If you need to use multiple graphics cards to load the model, you can add the following code:

    tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
    model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True, device='cuda')
    model = model.eval()

    Replace with

    def get_model():
        tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
        from gpus import load_model_on_gpus
        # The gpus file is in the demo folder
        model = load_model_on_gpus("THUDM/codegeex2-6b", num_gpus=2)
        model = model.eval()
        return tokenizer, model
    
    tokenizer, model = get_model()

Code ability assessment

As a multi-language code generation base model, CodeGeeX2's coding capabilities have been greatly improved compared to the previous generation. The following are the evaluation results on the HumanEval, HumanEval-X, and DS1000 benchmarks (the definition of the evaluation indicator Pass@k is consistent with the paper ) :

HumanEval (Pass@1,10,100)

Model Pass@1 Pass@10 Pass@100 Introduction company
CodeGen-16B-multi 19.2 34.6 55.2 Open source Saleforce
CodeGeeX-13B 22.9 39.6 60.9 Open source Tsinghua University
Codex-12B 28.8 46.8 72.3 Descendants of GPT-3, not open source OpenAI
CodeT5Plus-16B-mono 30.9 51.6 76.7 Open source Saleforce
Code-Cushman-001 33.5 54.3 77.4
LLaMA-65B 23.7 - 79.3
LLaMA2-70B 29.9 - -
CodeGen2.5-7B-mono 33.4 58.4 82.7 Open source Saleforce
StarCoder-15B 33.2 61.0 84.7 BigCode
CodeGeeX2-6B 35.9 62.6 88.3 Open source Tsinghua University

Pass@1  is used  n=20, t=0.2, top_p=0.95; Pass@10, Pass@100  is used  n=200, t=0.8, top_p=0.95.

HumanEval-X (Pass@1)

Model Python C++ Java JavaScript Go Rust Overall
CodeGen-16B-multi 19.2 18.1 15.0 18.4 13.0 1.8 14.2
CodeGeeX-13B 22.9 17.1 20.0 17.6 14.4 4.3 16.0
Replit-code-v1-3B 22.0 20.1 20.1 20.1 12.2 8.6 17.2
CodeGen2.5-7B-multi 30.6 24.3 29.0 27.5 18.9 20.1 25.1
StarCoder-15B 35.5 28.2 31.5 33.2 21.3 17.8 27.9
CodeGeeX2-6B 35.9 29.3 30.8 32.2 22.5 18.1 28.1

Pass@1  is used  n=20, t=0.2, top_p=0.95.

scripts/run_humanevalx.shThe above results can be reproduced using scripts . For environment configuration and description, see the evaluation environment .

DS1000 (Pass@1)

Model Matplotlib Numpy Pandas Pytorch SciPy Scikit-learn TensorFlow Overall
# Samples 155 220 291 68 106 115 45 1000
CodeGen-16B-Mono 31.7 10.9 3.4 7.0 9.0 10.8 15.2 11.7
code-cushman-001 40.7 21.8 7.9 12.4 11.3 18.0 12.2 18.1
Codex-001 41.8 26.6 9.4 9.7 15.0 18.5 17.2 20.2
CodeGeeX2-6B 40.5 25.5 14.5 17.3 19.3 24.0 23.0 23.1
StarCoder-15B 51.7 29.7 11.4 21.4 20.2 29.5 24.5 26.0
Codex-002 57.0 43.1 26.5 41.8 31.8 44.8 39.3 39.2

Pass@1  is used  n=40, t=0.2, top_p=0.5.

The above results can be reproduced using the DS1000 evaluation code .

Quantify inference performance

CodeGeeX2 is more deployment-friendly than the previous generation. Thanks to the use of Multi-Query Attention and Flash Attention, inference is faster and only requires 6GB of video memory to run after quantization:

Quantify

Model FP16/BF16 INT8 INT4
CodeGeeX-13B 26.9 GB 14.7 GB -
CodeGeeX2-6B 13.1 GB 8.2 GB 5.5 GB

Based on PyTorch 2.0 test, torch.nn.functional.scaled_dot_product_attentionefficient Attention calculation is implemented.

reasoning

Model Inference speed (characters/second)
CodeGeeX-13B 32
CodeGeeX2-6B 94

batch_size=1, max_length=2048, all use the acceleration framework, and the test hardware is GeForce RTX-3090.

Guess you like

Origin blog.csdn.net/qq837993702/article/details/132351820