Some thoughts on commercialization of code generation

Code generation solutions

There are three major categories of solutions for generating project code:

1. Bottom-up generation, partial code generation generates a line of code or a method to generate a small piece of code. The basic idea of ​​ide plug-in code generation is

2. The large language model acts as an agent with different roles in software engineering. The user gives an idea and each agent automatically divides the work to generate code.

3. Abstract the process framework for code generation under specific projects, and use large models to generate module codes in the framework

The first technical route is the current mainstream route for to c products. What this technical route actually solves is to solve the problem of a complex general language API, many types of function methods, a large amount of general language code, and strong consistency of non-business logic code. question. It is impossible to remember all the methods, functions, and functions of a general language, and the amount of code is also large. If you can directly write out the required functions, methods, and a short section of general non-business core logic code by inputting your own ideas, this will It can greatly improve the development efficiency of programmers. The essence of this method is to integrate the common development environments used by programmers such as checking development manuals, referring to sample codes, developing Q&A communities, and assisting in modifying codes into a large model, and can assist development through text Q&A.

The second route of large-scale model role-playing of technical solutions has the advantage that people can generate a project code as long as they give an idea; the disadvantage is that it is more random, has weak controllability, and the code may not necessarily meet the requirements, and the project code verification cost is relatively high. high. The essence of this solution is to skip the program development process and let the machine do code development based on the software engineering process.

The third technical solution route is a neutralization of the previous two solutions. Many companies have begun to implement the idea of ​​code less, and they will design appropriate business codes according to the needs of their own companies; the syntax logic and functional functions of these business codes are much simpler than common languages. Most developers are converting business logic into code, so The few lines and a small piece of code generation capabilities provided by the first technical route are not necessarily applicable in this scenario. The accuracy of generated code will be relatively low, and the complexity of business development description problems is equivalent to the complexity of writing business code implementation. Therefore, if you want to assist in business code, you cannot follow the first technical solution. You must achieve business development with a small amount of input to at least output a fully usable functional module, or directly generate a code project. The second option theoretically states that the delivery form meets the expectations of business development students. However, this multi-role playing method currently has problems such as low stability, poor controllability, and relatively low quality of generated code that does not meet expectations. So the best way is to combine 2 or 2 technologies. The large model only acts as a code programmer. The overall code process design and framework construction are all abstracted by people. It is equivalent to the large model solving problems under a limited framework and generating The code is more controllable; however, there are also problems such as rigid processes and complex tasks such as abstraction, framework assumptions, and code generation prompt management. Of course, these problems can be appropriately solved through the automation of engineering systems.

mutli-agnet mode exploration

This part is exploratory work. According to the organizational structure and event process required by the software engineering code development process (waterfall development process), llm is allowed to play different roles to simulate software development activities; the final code program is generated based on the user's input demand points. .

This picture describes the software development process under the waterfall development model. It also turns the development process into a conversational model, producing intermediate products to constrain and ensure the reliability of downstream output code.

The above picture describes how the actual LLM executes the above process to generate code, and the flow chart of continuous dialogue to try to know the location of the output executable code.

The three ideas for code generation are mentioned above, and no matter which idea is used, a powerful base model is required. The following section will introduce what open source models are available and what training technology routes can make the model more powerful.

Open source code generation model parameter comparison

The figure below is an open source code generation model parameter and effect comparison data table.

Model

size

Architecture

pass@1

codeT5+

T5

code-davinci-2

GPT

59.86%

codegeex2

6B

GLM

35%

starcode

15.5B

decode only

38.2%

codegen16b

16B

decode only

29.28%

InCoder-6.7B

6.7B

Fairseq

15.6%

Palm- coder

540B

36%

code calls

34B

decode only

43%

Codellama is a language model that does large-scale language pre-training, allowing the model to have the ability to generate code according to human needs; it has strong capabilities in code generation, code completion, code comment generation, and semantic understanding. In some cases, it exceeds chatgpt in generating code based on human language. Code capabilities, support 16k or even longer contexts.

Starcoder is a code model that fills in the blanks and generates comments, allowing the model to have the ability to generate code according to human needs; it has code generation, code comments, code completion, semantic understanding capabilities, and supports 8k context tokens.

starcode

starcode pre-training

Use mega-ml to implement

 

#下载megatron-ml
git clone https://github.com/bigcode-project/Megatron-LM.git

#安装apex
git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

#下载wiki_zh数据
******/pleisto___json

#数据预处理
bash preprocess_santacoderpack.sh

#跑预训练模型
bash pretraning_santacoderpack.sh

'''
附件preprocess_santacoderpack.sh脚本
'''
INPUT=****/pleisto___json # merge datasets jsonl from commitpack-subset-cf
NAME=wiki_zh # you want data name
TOKENIZER_FILE=******/starcoderplus/tokenizer.json
VOCAD=******/starcoderplus/vocab.json

# File Path setup
SCRIPT_REPO=******/Megatron-LM
pushd $SCRIPT_REPO

python tools/preprocess_data.py \
    --input $INPUT \
    --output-prefix $NAME \
    --dataset-impl mmap \
    --tokenizer-type TokenizerFromFile  \
    --tokenizer-file $TOKENIZER_FILE \
    --json-keys 'completion'\
    --workers 30 \
    --chunk-size 1000

'''
附件pretraning_santacoderpack.sh脚本
'''
#! /bin/bash

# set -u # stop on unset variables

#export WANDB_API_KEY= # your wandb key
#export WANDB_PROJECT= # your wandb project name

NNODES=1 #$WORLD_SIZE  # Adjust
GPUS_PER_NODE=4
RANK=0
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=9001
export CUDA_DEVICE_MAX_CONNECTIONS=1


GPU_NUM=$(($GPUS_PER_NODE*$NNODES))
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

echo "================================================"
echo "GPU_NUM: $GPU_NUM"
echo "================================================"

DISTRIBUTED_ARGS="\
              --nproc_per_node $GPUS_PER_NODE \
              --nnodes $NNODES \
              --node_rank $RANK \
              --master_addr $MASTER_ADDR \
              --master_port $MASTER_PORT \
"

echo $DISTRIBUTED_ARGS

CHECKPOINT_PATH=****/starcoderplus  # Adjust: Directory to store the checkpoints 
DATA_PATH=******/Megatron-LM/wiki_zh_completion_document  # Adjust: Prefix of the preprocessed dataset.
TOKENIZER_FILE=******/starcoderplus/tokenizer.json  # Adjust: starcoder-tokenizer/tokenizer.json

GPT_ARGS="\
       --tensor-model-parallel-size 1 \
       --pipeline-model-parallel-size 1 \
       --recompute-activations \
       --num-layers 24 \
       --hidden-size 2048 \
       --num-attention-heads 16 \
       --attention-head-type multiquery \
       --init-method-std 0.022 \
       --seq-length 8192 \
       --max-position-embeddings 8192 \
       --attention-dropout 0.1 \
       --hidden-dropout 0.1 \
       --micro-batch-size 2 \
       --global-batch-size 64 \
       --lr 0.0002 \
       --train-iters 250000 \
       --lr-decay-iters 600000 \
       --lr-decay-style cosine \
       --lr-warmup-fraction 0.02 \
       --weight-decay .1 \
       --adam-beta2 .95 \
       --clip-grad 1.0 \
       --bf16 \
       --log-interval 10 \
       --save-interval 1000 \
       --eval-interval 500 \
       --eval-iters 10 \
       --initial-loss-scale 65536 \
"

TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"

torchrun $DISTRIBUTED_ARGS \
       pretrain_gpt.py \
       $GPT_ARGS \
       --tokenizer-type TokenizerFromFile \
       --tokenizer-file $TOKENIZER_FILE \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       $TENSORBOARD_ARGS \

starcoder does sft

#代码下载
git clone https://github.com/bigcode-project/octopack.git

#数据准备
有问答的数据对,可以设计数据结构,比如:
{instruction:"",input:"",history:"",respond:""}

#参数设置
/mnt/user/caifu/252256/WizardLM/WizardCoder/configs/deepspeed_config.json
/mnt/user/caifu/252256/WizardLM/WizardCoder/WZsft.sh

#执行脚本
bash WZsft.sh

'''
附件WZsft.sh
'''
deepspeed --num_gpus 8 --master_port=9901 src/train_wizardcoder.py \
    --model_name_or_path "/mnt/user/caifu/252256/llm_model/starcode_instruct2/checkpoint-3600" \
    --data_path "/mnt/user/caifu/252256/LLaMA-Efficient-Tuning/data/alpaca_data_zh_51k.json" \
    --output_dir "/mnt/user/caifu/252256/llm_model/starcode_instruct3" \
    --num_train_epochs 3 \
    --model_max_length 8192 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 30 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

starcoder efficient reasoning

#下载代码
git clone https://github.com/bigcode-project/starcoder.cpp
cd starcoder.cpp

# Convert HF model to ggml
python convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

# Build ggml libraries
#只支持cpu推理编译
make
#支持gpu和cpu推理编译
make clean && LLAMA_CUBLAS=1 make -j

# quantize the model
./quantize models/bigcode/gpt_bigcode-santacoder-ggml.bin models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3

# run inference
./main -m models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2

codellama

code llama pre-training

#下载llama训练代码
git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning.git

#准备预训练数据
wiki_zh
内部代码数据

#下载模型
******/CodeLlama-13b-Instruct-hf
******/CodeLlama-13b-hf
******/CodeLlama-34b-Instruct-hf
******/CodeLlama-34b-hf

#设定参数
******/LLaMA-Efficient-Tuning/ds_zero3_config.json
******/LLaMA-Efficient-Tuning/pretrain_finetune.sh

#执行预训练脚本
bash code_llama_fintune.sh

'''
附件pretrain_finetune.sh
'''
deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \
    --deepspeed ds_zero3_config.json \
    --stage pt \
    --model_name_or_path ******/CodeLlama-13b-Instruct-hf \
    --do_train \
    --dataset wiki_zh_pre \
    --template default \
    --finetuning_type full \
    --lora_target q_proj,v_proj \
    --output_dir ******/CodeLlama-13b-Instruct-pre1 \
    --overwrite_cache \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 2 \
    --learning_rate 5e-5 \
    --warmup_steps 30 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --report_to "tensorboard" \
    --fp16

'''
附件ds_zero3_config.json
'''
{
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 0,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 5e-5,
          "betas": [
            0.9,
            0.999
          ],
          "eps": 1e-8,
          "weight_decay": 0
        }
    },
    "train_batch_size":256 ,
    "train_micro_batch_size_per_gpu": 16,
    "gradient_accumulation_steps":2,
    "wall_clock_breakdown": false
}

codellama do sft

#数据准备
数据准备和任务设计是强绑定的
这部分是重点

#执行脚本
bash code_llama_fintune.sh

#如果用lora训练,参数合并
bash export_lora_sft.sh

'''
附件code_llama_fintune.sh
'''
deepspeed --num_gpus 4 --master_port=9901 src/train_bash.py \
    --deepspeed ds_zero3_config.json \
    --stage sft \
    --model_name_or_path ******/CodeLlama-13b-Instruct-hf  \
    --do_train \
    --dataset code_alpaca \
    --template default \
    --finetuning_type full \
    --lora_target q_proj,v_proj \
    --output_dir ******/CodeLlama-13b-Instruct-full0 \
    --overwrite_cache \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 2 \
    --learning_rate 5e-5 \
    --warmup_steps 30 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --report_to "tensorboard" \
    --fp16

'''
附件export_lora_sft.sh
'''
python src/export_model.py \
    --model_name_or_path ******/CodeLlama-34b-hf \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir ******/CodeLlama-34b-Instruct-sft1 \
    --output_dir ******/CodeLlama-34b-Instruct-0

llama2 dpo training

#下载训练代码
git clone https://github.com/shibing624/MedicalGPT.git

#准备数据
{"question": "维胺酯维E乳膏能治理什么疾病?", "response_chosen": "痤疮;暴发性痤疮;寻常痤疮;婴儿痤疮;聚合性痤疮;沙漠疮;背痈;肺风粉刺;职业性痤疮", "response_rejected": "维埃胶可以治疗湿疹、荨麻疹和过敏性鼻炎等皮肤病。"}
******/MedicalGPT/data/reward/test.json

#执行训练脚本
bash run_dpo.sh

'''
附件run_dpo.sh
'''
CUDA_VISIBLE_DEVICES=0,1,2,3 python   dpo_training.py \
    --model_type llama \
    --model_name_or_path ******/CodeLlama-13b-Instruct-hf \
    --train_file_dir ./data/reward \
    --validation_file_dir ./data/reward \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --use_peft True \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --max_steps 10000 \
    --eval_steps 20 \
    --save_steps 1000 \
    --max_source_length 1024 \
    --max_target_length 1024 \
    --output_dir  ******/CodeLlama-13b-Instruct-dpo1\
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --fp16 True \
    --device_map auto \
    --report_to tensorboard \
    --remove_unused_columns False \
    --gradient_checkpointing True \
    --cache_dir ./cache

llama efficient reasoning:

1.下载和安装llama.cpp代码项目
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
#只支持cpu推理编译
make
#支持gpu和cpu推理编译
make clean && LLAMA_CUBLAS=1 make -j


2.把模型转成gmml格式,方便后面cpp推理使用
# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

# [Optional] for models using BPE tokenizers
python convert.py models/7B/ --vocabtype bpe

# quantize the model to 4-bits (using q4_0 method)
./quantize /mnt/qian.lwq/CodeLlama-34b-Instruct-0/ggml-model-f16.gguf /mnt/qian.lwq/CodeLlama-34b-Instruct-0/ggml-model-q4_0.gguf q4_0

3.用转化好的模型推理
#仅用cpu推理
./main -m /mnt/qian.lwq/CodeLlama-34b-Instruct-0/ggml-model-f16.gguf -p "给下面这段代码添加注释 \n def querySymbolInfo = { row, symbolList, fields ->\n def symbolFacadeClient = row.get('symbolFacadeClient') as SymbolFacadeClient \n SymbolRequest req = new SymbolRequest() \t\n req.setSymbols(symbolList as List<String>) \n req.setFields(fields as String) \t\n Result<SymbolDTOWrapper> result = symbolFacadeClient.querySymbolInfo(req) \n AssertUtils.assertTrue(Status.SUCCESS.equals(result.getStatus()), ErrorCodeEnum.REMOTE_UNEXPECTED_RESULT,\t\n result.getStatus().getMessage()) \t\b return result.getData().getDatas() \t\n } \n " -n 512 -ngl 15 -e
#cpu和gpu混合推理
./main --color --interactive --model /mnt/qian.lwq/CodeLlama-34b-Instruct-0/ggml-model-f16.gguf --n-predict 512 --repeat_penalty 1.0 --n-gpu-layers 15 --reverse-prompt "User:" --in-prefix " " -p "Building a website can be done in 10 simple steps:\nStep 1:"

Follow-up

1. Data sorting

2.Cot and multi-round dialogue training adjustments

3. Speculative sampling speeds up reasoning

4. Research on multi-robot collaborative code generation

5.starcoder dpo training

summary

This article summarizes several commercialization possibilities of code generation and describes applicable scenarios and possible problems in different modes. At present, the main product form of code generation is to c products, which is equivalent to integrating the query development manual, community Q&A, code modification, and code sample capabilities in programmer development into a large model. Then for to b code generation with greater commercial value, accurate, controllable and executable code modules and project code generation products have not yet been seen on the market. The difficulty of to B code generation often does not lie in the complexity of the code, but in the ability to accurately generate actually usable code, which truly reduces the workload of business developers. This is actually a software engineering automation process, which is to automate software development; soft engineering and automation itself are an automation of production, and this automation is the automation of automation.

The article also introduces the technology of starcoder and codellama model enhancement training to enhance model capabilities.

Guess you like

Origin blog.csdn.net/liangwqi/article/details/132767257