Code generation overview

The large code generation model is a subcategory of the LLM model, and theoretically is also a special case of the language model. The code itself is actually a special language representation, so the implementation of the code model should have the ability to have both general natural language and code parts. There are also two paths to implement the actual code model. Let the trained NLP LLM model be trained by code, or let the code LLM model be trained by NLP corpus to implement the code generation model. In fact, there should be another path, which is to treat code as nlp corpus and train directly without distinguishing between code and nlp.

The training corpus is actually the same as the NLP LLM model, including at least three types: pretrain corpus, instruct supervised training corpus, and RLHF comparative training corpus. In fact, the first two are more commonly used: pretrain corpus and instruct corpus.

There are three things and a link that are more important for model training. The so-called three things are nothing more than: model, data, and task design. A link refers to the number of rounds of data training, the ratio of data training, and the order in which it is added. Below we will introduce around three things. Why not explain a link? There is no secret. The problem is that this thing is difficult to stabilize into a set of theories. It is somewhat similar to the "heat" and "yiguan" in traditional culture. This kind of thing is closely integrated with the actual situation. It is difficult to say when and what to do. Even if it can be said, it is often The summary made in hindsight after the decision was made at that time may not be fully usable the next time. So there is no better way. You can only practice more and understand more, and you will know something the more you use it. Often it may be an intuitive decision that may be effective, but the premise is that you have to encounter the problem enough and think about it enough. many.

Model

base model

Model

size

Architecture

pass@1

codeT5+

T5

code-davinci-2

GPT

59.86%

codegeex2

6B

GLM

starcode

15.5B

decode only

codegen16b

16B

decode only

29.28%

InCoder-6.7B

6.7B

Fairseq

15.6%

Palm- coder

540B

36%

Bloom

instruction model

Model

size

Instruction Set

pass@1

OctoCoder

16B

CommitPack、CommitPackFT

35.5%

OctoGeeX

6B

CommitPack、CommitPackFT

30.9%

WizardCoder

16B

Evol-Instruct

57%

InstructCodeT5+

16B

22.3%

PanGu-Coder2

15B

RRTF framework extracts data

61%

Instruct-Codegen-16B

16B

code alpaca 250k

37.1%

data

Pre-training data

 

The Stack(6TB)

Download link: https://huggingface.co/datasets/bigcode/the-stack

The Stack data set, which is a legal open source code corpus of 3.1TB, with 30 programming languages ​​(note: the latest version of The Stack v1.1 has been expanded to 308 languages, 6TB data);

CodeParrot github-code(500GB)

Download link: https://huggingface.co/datasets/codeparrot/github-code

PolyCoder(249GB)

Download link: https://github.com/VHellendoorn/Code-LMs

The public code on GitHub is used. The most popular libraries in various programming languages ​​are mainly selected. Each library has at least 50 stars. Multiple programming language code sets are used for training. There are 12 types in total.

Google BigQuery (2B file)

Download link: https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code

Google BigQuery provides a snapshot of licensed repositories on GitHub that can be filtered via SQL queries. AlphaCode , BLOOM, InCoder , CodeGen ) all include this part of the data in their pre-training data sets.

CodeSearchNet(20GB)

Download link: https://github.com/github/CodeSearchNet

It contains about 6 million functions taken from open source code in six programming languages: Go, Java, JavaScript, PHP, Python and Ruby.

ProjectCodeNet (500 million rows)

Download link: https://github.com/IBM/Project_CodeNet

The data set contains 14 million code samples with a total of 500 million lines of code written in 55 programming languages, with C++ being the most used language in the sample and Python in second place.

CodeXGLUE

Download link: GitHub - microsoft/CodeXGLUE: CodeXGLUE

Microsoft open source, including 10 tasks and 14 data sets

The Pile

Download link: The Pile

The Pile dataset also contains questions, answers, and comments from StackOverflow that make up these questions and answers, but does not contain comments. Qualitatively, the authors found notes.

command data

Code generation:

TnT/Multi_CodeNet4Repair · Datasets at Hugging Face

Fill in the blanks with code:

https://huggingface.co/datasets/code_x_glue_cc_cloze_testing_all

Code Q&A:

Dahoas/code-review-instruct-critique-revision-python · Datasets at Hugging Face

Code judgment:

reshinthadith/pairwise-code-review-instruct-critique-revision-python · Datasets at Hugging Face

Code reading comprehension:

https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-python/viewer/Nan-Do--instructional_code-search-net-python/train?row=1

Code test:

codeparrot/apps · Datasets at Hugging Face

deepmind/code_contests · Datasets at Hugging Face

Fill in the blanks with code:

code_x_glue_cc_cloze_testing_all · Datasets at Hugging Face

Comprehensive instruction:

nickrosh/Evol-Instruct-Code-80k-v1 · Datasets at Hugging Face

HuggingFaceH4/CodeAlpaca_20K · Datasets at Hugging Face

iamtarun/python_code_instructions_18k_alpaca · Datasets at Hugging Face

mission design

Pre-training tasks

forward generation

The so-called forward generation should actually be the simplest in terms of training form. It is to let the model source code be read. The difficulty lies in choosing what to let him read, when to read it, and how many times to read it.

1. Reading the annotated source code

2. Read source code with project requirements

3. Language rules + code examples

mask fill

This part is how to mask out a part of the source code, which may be one word or several words, or one or several lines. Then let the code fill in the blanks. This can be a random mask, or you can summarize the keyword mask or variable name mask of the code, or you can do some random mask for semantic understanding.

1. What is strategyDecisionDrm.__________?

2. What type is the return value of the getDefaultStrategyMap() method?

3. How to randomly select a specified number of strings from a List<String>?

4. Please add type declarations for the following variables: row, strategyCode, recommendCount

5. Which method in Java 8 can be used to limit the elements in a List<String> to a specified number?

segment generation

This is very similar to the mask filling task above. You can mask the entire code or the core implementation mask, or define a partial mask. Or you can put a code annotation mask, or a function description mask, and let the model fill it in.

judgment

This part can design tasks in code repair and code complexity selection. For example, you can determine which implementation is correct, which implementation will be faster, which variable name is correct, which execution speed will be faster, and which test result is correct. .

Keyword extraction

This part can be designed to extract parameters corresponding to specified semantics, identify keywords, and identify the relationship between abstracts and codes.

choose

Select a suitable code to fill in multiple variables, and select multiple code segments to fill in

Summary

Extract the code structure framework, extract the code running process, extract the core implementation of the code, and extract the implementation logic of the code.

command tasks

single problem point

generate:

NL-->Code generates code for text

NL-->NL+code generates codes and descriptions for text

NL+code -->NL generates code logic, process, description; generated code function description

Note:

Code-->NL generates comments for the code

Code-->NL+code gives code generation code function implementation ideas, code continuation

Code+NL-->code code fill in the blanks, code rewriting

Q&A:

Code+NL--> NL+code Given code, answer the question based on the question

rewrite:

Code+NL--> code code rewriting

Code+NL-->code code error correction

Code+NL--> code code translation

Code+NL-->code code adds functions

Multiple rounds of dialogue/CoT

NL-->NL+code generates code framework according to requirements

NL-->code generates a class with multiple functional modules based on the requirement description

NL+code-->NL+code completes the details according to the function description and code framework

NL+code-->NL+code generates an executable small project based on the input and output description.

{
"代码生成能力prompt": [
"1.根据以下需求生成一段代码:需要一个函数,该函数接受策略代码和推荐数量作为参数,然后从默认的策略地图中获取对应的股票,如果没有找到相应的策略代码,则返回预设的股票代码,并按照请求的推荐数量进行限制。"
],
"代码补全能力prompt": [
"1.我有一段未完成的代码,需要你帮助完成,代码如下:def useDefaultSymbolPool = { row, strategyCode, recommendCount -> 此处需要你的补全"
],
"代码续写能力prompt": [
"1.给定下面这段代码,能否进行逻辑续写:def ASSET_ID = 'finscprod.chooseStockCard'"
],
"代码纠错能力prompt": [
"1.这段代码中存在一个错误,能否帮助找出并修正:def useDefaultSymbolPool = { row, strategyCode, recommendCount -> def strategyDecisionDrm = row.get('strategyDecisionDrm') as StrategyDecisionStorm"
],
"代码注释能力prompt": [
"1.请为以下代码添加注释:def useDefaultSymbolPool = { row, strategyCode, recommendCount -> def strategyDecisionDrm = row.get('strategyDecisionDrm') as StrategyDecisionDrm"
],
"代码理解能力prompt": [
"1.请解释这段代码的功能:def useDefaultSymbolPool = { row, strategyCode, recommendCount -> def strategyDecisionDrm = row.get('strategyDecisionDrm') as StrategyDecisionDrm"
],
"代码问答能力prompt": [
"1.在本段代码中,getOrDefault方法是如何工作的?"
],
"注释生成代码prompt": [
"1.根据以下注释生成相应的代码:'使用默认的股票池,根据给定的策略代码和推荐数量从策略决策DRM获取股票代码,如果没有找到则返回预设的股票代码,并限制返回的数量'"
],
"功能描述生成代码prompt": [
"1.根据以下功能描述生成代码:'创建一个函数,该函数接受三个输入参数,即row,策略代码和推荐数量,从“strategyDecisionDrm”获取默认策略图,如果策略代码不存在,则返回默认的股票代码,并且返回的数量不超过推荐数量。'"
],
"摘要生成代码prompt": [
"1.从以下摘要生成代码:'这是一个使用默认股票池的函数,其基于输入的策略代码和推荐数量从策略决策DRM中提取股票代码,如果没有找到匹配的策略代码,将返回预定义的股票列表,并根据推荐数量进行限制。'"
],
"代码功能描述prompt": [
"1.为下面这段代码提供一个功能描述:def useDefaultSymbolPool = { row, strategyCode, recommendCount -> def strategyDecisionDrm = row.get('strategyDecisionDrm') as StrategyDecisionDrm"
],
"代码注释生成prompt": [
"1.请根据给定的功能描述生成相应的代码注释:'这个功能会从默认的策略决策DRM地图中获取股票代码,如果无法根据给定的策略代码找到,就会返回一个预设的股票代码数组,并且返回的数量不超过给定的推荐值。'"
],
"功能描述到代码COT prompt": [
"1.请根据以下的功能描述转化为具体的代码操作任务(COT):'我们需要一个函数来从默认的策略地图中获取策略代码,如果没有找到相应的策略代码,则返回预设的股票代码,并按照请求的推荐数量进行限制。'"
]
}

example:

You can analyze a code through the following three levels, design appropriate prompts to extract and generate data, and train the comprehensive capabilities of the model.

1. Requirement description, API description, code implementation, magic example
2. API description includes: input, output, rpc, service
3. rpc is the calling process link, service is the service implementation function
4. The code implementation part includes the import function package part , implementation functions of each functional block, various small services to support the service, pipeline serial execution process
5. The magic example is a reference implementation service example.
Question:
a. What functions does this code implement, and which functional modules and functional modules are included? What is the calling sequence and logic?
b. What is the implementation logic of each functional module, input and output parameters and parameter types? c.
What parts does the api description include and what is the function of each part?
d. Give each functional function Generate detailed comments
e. Given the service name or rpc name, pull out the implementation code
f. Generate code execution process description
g. Given the description, generate source code
h. Given the code process, generate framework code
i. Given the code, code as required
j. Code Q&A, where is the function implemented, what is the principle, and what parameters does the function contain?

summary:

This article comprehensively sorts out what is needed for the practical use of large code generation models. It provides an overview of the executable level from the model, data, and task design, and provides some of the current mainstream code generation models, the data that can be used for training, and the tasks that should be considered in order to make the model capable. It also provides An example is given to give an idea of ​​how to design data.

1. Consider the model from two dimensions: pre-training, instruct training and the different needs of enterprise applications.

2. For data design, it provides two parts: pre-training and instruction, and at the same time, a further granular subdivision is made for the two parts.

3. For task design, it also provides two parts: single-point task design and multi-round dialogue/cot.

4. The model training block will be introduced in the next article

 

Guess you like

Origin blog.csdn.net/liangwqi/article/details/132416219