Dry stuff! Detailed explanation of multi-task fine-tuning technology in MFTCoder paper

Code LLMs have become a specialized field of research that improves the coding capabilities of pre-trained models by using code-related data to fine-tune pre-trained models. Previous fine-tuning methods are usually customized for specific downstream tasks or scenarios, which means that each task needs to be fine-tuned separately, requires a large amount of training resources, and is difficult to maintain and deploy due to the coexistence of multiple models. Furthermore, these methods fail to exploit the intrinsic connections between different coding tasks.

To overcome these limitations, we propose a multi-task fine-tuning framework - MFTCoder, which can perform fine-tuning on multiple tasks simultaneously and in parallel. By combining multiple loss functions, we effectively solve the common challenges in multi-task learning such as imbalanced data volume, varying difficulty, and inconsistent convergence speeds between tasks.

A large number of experimental results show that our multi-task fine-tuning method performs better than fine-tuning a single task alone or mixing multiple tasks into one. In addition, MFTCoder has efficient training features, including providing efficient data tokenization mode and supporting PEFT fine-tuning, which can effectively increase the speed of fine-tuning training and reduce the demand for resources.

MFTCoder has been adapted to support multiple mainstream open source LLMs, such as LLama-1/2, CodeLLama, Qwen, CodeGeeX2, StarCoder, Baichuan2, ChatGLM2/3, GPT-Neox, etc. Using CodeLLama as the base and using MFTCoder to fine-tune CODEFUSE-CODELLAMA-34B, the pass@1 score in the HumaneEval test is as high as 74.4%, exceeding the performance of GPT-4 (67%, zero-shot, March 2023).

The paper on the technical details of MFTCoder has been released to Arxiv : https://arxiv.org/pdf/2311.02303.pdf ; the corresponding code has also been open sourced to github : https://github.com/codefuse-ai/MFTCoder . This article aims to provide a detailed technical interpretation of the MFTCoder paper.

I. Introduction

The emergence of ChatGPT and GPT-4 has led to an explosion in the research and development of large models (LLMs), which has also further ignited the research and development boom in applying large models to code generation and understanding. This branch is called code large models (LLMs). That is, Code LLMs) direction. By pre-training on a large amount of code data (such as GitHub public data) and natural text data, the large code model can effectively complete various code-related tasks, such as automatic code completion, code generation based on descriptions, and adding comments to the code. Explain code functions, generate single test cases, fix code, translate code, etc.

Although the pre-training phase of (code) LLMs is designed to ensure their generalization ability to different downstream tasks, the subsequent fine-tuning phase is usually only performed for specific tasks or scenarios. This approach ignores two key challenges. First, it involves resource-intensive individual fine-tuning for each task, which hinders efficient deployment in production environments; second, the interrelatedness of code-domain tasks suggests that joint fine-tuning can improve performance compared to individual fine-tuning. Therefore, it is crucial to perform multi-tasking fine-tuning to handle all tasks simultaneously and use the strengths of related tasks to enhance the performance of other tasks.

To better illustrate, imagine we have two related tasks: code completion and code summarization. Code completion predicts the next line of code based on partial code snippets, while code summarization aims to generate a concise and readable summary of a given code snippet. Traditionally, each task was fine-tuned separately, resulting in resource-intensive duplication. However, there is an inherent connection between code completion and code summarization. Completion of code snippets relies on an understanding of overall functionality and purpose, while generating accurate summaries requires an understanding of structure, dependencies, and intended functionality. By employing multi-task learning, a single model can be trained to learn both tasks jointly, leveraging shared knowledge and patterns to improve performance on both tasks. The model understands the contextual dependencies between code elements, helping to predict the next code snippet and generate informative summaries. Furthermore, multi-task learning provides additional benefits beyond individual task performance: shared representations between tasks help mitigate overfitting problems, promote better generalization, and enhance models to handle task-specific data scarcity Ability. If code completion has a larger training dataset than code summarization, the model can leverage the rich completion data to improve the performance of summarization, effectively addressing the data scarcity challenge. Multi-task learning even enables models to handle unseen but related tasks, even without specific training data. Overall, multi-task learning allows models to jointly learn multiple related tasks, benefit from shared knowledge, improve performance, enhance generalization capabilities, and cope with data scarcity.

Despite the importance of multi-task learning, only a few existing studies have explored this approach in the field of natural language processing (Raffel et al., 2023; Aghajanyan et al., 2021; Aribandi et al., 2022). These studies combine multi-task data for large model learning without explicitly distinguishing tasks. Even more unfortunately, these studies tend to prioritize tasks with larger sample sizes and ignore tasks with smaller sample sizes. Furthermore, they fail to ensure equal convergence speeds across tasks, resulting in some tasks being over-optimized and others under-optimized.

This article focuses on multi-task fine-tuning (MFT) for large models, aiming to enable tasks with different sample sizes to receive equal attention and achieve similar optimization. Although our method is not limited to the field of code large models, in this article we focus on the code large model, considering that downstream tasks in the code field are often more relevant, which is also the origin of the name MFTCoder. We emphasize that MFTCoder can be easily extended to any set of related NLP tasks. In order to improve the efficiency of MFTCoder, we adopt parameter-efficient fine-tuning techniques including LoRA (Hu et al., 2021) and QLoRA (Dettmers et al., 2023). Experimental results show that multi-task models trained using the MFT method outperform models that are fine-tuned for each task individually or fine-tuned by merging data from multiple tasks. We further adapted and verified the effectiveness of MFTCoder on various currently popular pre-trained LLMs, such as Qwen, Baichuan2, Llama, Llama 2, StarCoder, CodeLLama, CodeGeex2, etc. It is worth mentioning that when using CodeLlama-34B-Python as the base model, the CodeFuse-CodeLLama-34B model fine-tuned by MFTCoder achieved a pass@1 score of 74.4% on the HmanEval evaluation set, exceeding GPT-4 (67 %, zero-shot, March 2023) performance.

The main contributions of the article are summarized as follows: 

  • MFTCoder is proposed, which applies multi-task learning to fine-tuning of large code models, focusing on solving the common problems of data imbalance and inconsistent convergence speed in previous multi-task fine-tuning methods.
  • Extensive experiments show that the MFT method outperforms individual fine-tuning and multi-task hybrid fine-tuning methods in performance. Based on the CodeLlama-34B-Python base, the model CodeFuse-CodeLLama-34B fine-tuned by MFTCoder achieved a pass@1 score of 74.4% on the HumanEval evaluation set, exceeding GPT-4 (67%, zero sample), and the model was open sourced model and a high-quality instruction dataset.
  • We adapted and verified the performance of MFTCoder on multiple popular LLMs, including Qwen, Baichuan2, Llama, Llama 2, StarCoder, CodeLLama, CodeFuse and CodeGeex2, etc., proving its compatibility and scalability with different base models.

 

2. Method

Figure 1: MFT architecture diagram

 

2.1 Framework

The overall framework of MFTCoder is shown in Figure 1, including multi-task support, multi-model adaptation, high-quality data set construction, efficient data usage, efficient training methods and multi-task balanced design .

  • ( Multi-tasking ) MFTCoder is designed to seamlessly adapt LLMs to different scenarios and maximize their performance in specific scenarios. When applying MFTCoder to a new scenario, the first step is to break the scenario into smaller tasks that correspond to the target capabilities. For example, in the field of code LLMs, the overall goal of enhancing the code capabilities of the model can be further broken down into more fine-grained tasks such as code completion, text-to-code generation, unit test case generation, code repair, code debugging, and even cross-language translate. Our extensive practical experience shows that MFTCoder can effectively handle multi-task scales from a single task to tens or even hundreds of tasks.
  • ( Dataset Construction and Efficient Training ) After splitting, the next step is to collect and organize fine-tuned datasets for each task. However, data collection for some tasks may present challenges. To overcome this problem, MFTCoder utilizes Self-Instruct (Wang et al., 2022) and Agents technology to generate instruction data sets. Multi-task fine-tuning often means that a larger amount of training data will be used for one fine-tuning. In order to ensure an efficient training process, MFTCoder adopts two efficient data Tokenization modes and supports PEFT (Parameter-Efficient Fine-Tuning) technology to improve training efficiency  . .
  • ( Task balancing design ) In response to the common challenges in the field of multi-task learning: imbalanced data volume, varying difficulty and inconsistent convergence speeds between tasks, MFTCoder introduces or adjusts different loss functions to achieve task balancing.
  • ( Multi-model adaptation ) In view of the different advantages and capabilities of different large-scale models, in order to support the selection of suitable model bases on demand for fine-tuning to achieve the best performance, MFTCoder has adapted to several mainstream open source LLMs, including LLama, LLama 2, CodeLLama, Qwen, Baichuan 1/2, ChatGLM 2, CodeGeeX 2, GPT-NEOX, CodeFuse-13B, StarCoder, AntLLM, etc. At the same time, we are constantly updating and adapting to new models.

 

2.2 Instruction data set construction

For challenging tasks of data collection, we adopt Self-Instruct technology to generate fine-tuning data for downstream code-related tasks in MFTCoder. This involves providing customized hints to GPT-3.5 or GPT-4 that clearly describe our instruction generation needs, and thus generate instruction data. Furthermore, we are inspired by the PHI-1/1.5 work (Gunasekar et al., 2023) and further apply the Self-Instruct technique to generate high-quality code practice datasets for downstream code-related tasks.

In terms of specific implementation, we have two options. One is an autonomous multi-round dialogue method with the help of Agents (such as Camel (Li et al., 2023c)), and the other is a single-round dialogue method implemented by directly calling the ChatGPT API. In the multi-turn dialogue method, we use Camel to start two Agents, each Agent is assigned a specific role and task goal, and drive the dialogue between them to generate instruction data consistent with the given theme. For example, when generating Python exercise data, we specify two Agents as "teacher" (simulating the user role of ChatGPT) and "student" (simulating the assistant role of ChatGPT). Among them, the teacher's responsibility is to provide generated data to students. instructions for practice problems, and the student's task is to provide solutions to the corresponding instructions. This iterative process will continue to generate multiple practice questions until the task requirements are met or the maximum input length of ChatGPT is reached. It is worth mentioning that in order to adapt to the input length limit of ChatGPT, we cannot directly use broader topics as task topics. For example, when creating exercises to assess students' mastery of the Python language, we need to break the topic into smaller and specific Python knowledge points (such as binary search trees) and launch a separate Camel session for each knowledge point. The specific example is shown in Figure 2 below (extracted from Appendix A of the paper  https://arxiv.org/pdf/2311.02303.pdf  ).

Figure 2: Example of code practice question generation system and user prompt setting through Camel Agents

 

The multi-round dialogue solution provides higher automation capabilities, but the cost is higher because two Agents need to be maintained, and each agent needs to make multiple round conversation calls to the ChatGPT API. In order to alleviate this problem, we propose a more cost-effective single-round dialogue generation method, the overall process of which is shown in Figure 3. We first create an initial seed set consisting of hundreds of basic Python knowledge points. These seeds are then combined with the prepared fixed prompt templates to generate a set of patterned task prompts. In order to solve the problem of diversity reduction caused by fixed templates and ensure accurate prompt description, we utilize Camel's task prompt refinement function to obtain accurate and diverse task prompts. Each task prompt is used to generate a set of instructions related to the corresponding seed (such as an exercise problem related to a binary search tree). Using ChatGPT, we generate corresponding directive solutions. Finally, we assemble and deduplicate the instructions and their corresponding solutions to obtain a training dataset. We have open sourced the Python Code Exercises dataset built using this approach ( https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k ).

 

Figure 3: Code Exercises instruction data set single-round generation solution process

 

2.3 Efficient Tokenization Mode

Figure 4: Schematic diagram comparing the data existence forms of three Tokenization modes in a Batch

 

In the pre-training and fine-tuning process of LLMs, tokenization is a key step, in which input and output text are divided into smaller units for subsequent use. Tokenization, together with the loss function used in the training process, defines how the data is used during the training process, and therefore plays a key role in the effectiveness of the model and training efficiency. In a typical SFT (Supervised Fine-Tuning) Tokenization scheme, samples in the same batch are uniformly aligned to the maximum input length (seq-length) of the model, and additional padding Tokens are used for filling if the length is insufficient, as shown in (a) is shown. However, in practice, we found that this approach resulted in a large number of padding tokens. For example, when using the Tokenizer of CodeFuse-13B (Di et al., 2023) to process 35 downstream task data, the average proportion of filled Tokens is 92.22% (seq-length is 4096). This means that there are a large number of Tokens used only for alignment without any value to the training process, which results in reduced training efficiency and a waste of storage space used to store offline Tokenization results. In order to solve this problem, we adopted two Tokenization modes, namely Dynamic Padding mode and Packing mode, and optimized them.

In dynamic fill mode, the micro batch window size for each GPU is determined by the maximum sample length within it. Shorter samples are padded with additional padding tokens to align to that size, as shown in Figure 4(b). Although filling tokens will not affect the training effect of the model, they will increase the computational overhead during the training process, thereby affecting the training speed, and the dynamic filling mode effectively reduces the proportion of filling tokens used, thereby speeding up training. According to our experience, this approach can achieve approximately twice the speedup compared to the traditional SFT Tokenization mode (the actual increase depends on the data set). It should be noted that this mode is only applicable to online Tokenization scenarios.

Unlike dynamic filling mode, which improves efficiency by reducing the micro batch window size, packing mode starts from the perspective of maximizing the utilization of the maximum input window length (seq-length) allowed by the model. This mode is consistent with Llama 2's SFT The Tokenization model (Touvron et al., 2023b) is similar. In the packing mode, multiple fine-tuning samples are packed into a window of seq-length in sequence, and two adjacent samples are separated by an EOS Token, as ) . In the figure, samples 1-4 of Figure 4(a) are combined and placed sequentially in a window. If a sample cannot completely fit into the current window, it will be placed in the next window and fill the remaining space with padding tokens. For example, in Figure 4(c) , sample 5 is placed in the second window and filled with padding tokens. The tail space is filled, and sample 6 is placed in the third window. Compared with the dynamic filling mode, the packaging mode further reduces the proportion of filling Tokens, which can further improve the training speed. Our practical experience shows that in the 35 tasks mentioned earlier, this method reduces the average proportion of filled Tokens to less than 10%, thereby significantly increasing the training speed while maintaining the training effect. It should be emphasized that MFTCoder supports online and offline packaging tokenization scenarios, not only serving the SFT stage, but also suitable for the pre-training stage.

 

2.4 PEFT efficient fine-tuning

Currently popular open source LLMs usually contain billions or even tens of billions of parameters, and multi-task learning scenarios usually involve a large number of tasks, which means that a large number of fine-tuning samples will be involved in training. If we choose to use a large amount of data to fully fine-tune these large models, we will face two challenges: first, a large amount of storage and computing resources are required; second, we may face the risk of catastrophic forgetting during the training process. In order to solve these problems, MFTCoder adopts PEFT (Parameter-efficient fine-tuning) technology (Houlsby et al., 2019), which enables efficient fine-tuning to be achieved in a short time with minimal resource requirements.

Figure 5: Schematic of Lora’s core ideas

 

Specifically, MFTCoder supports two PEFT methods: Lora (Large-scale Language Model Low-Rank Adaptation) (Hu et al., 2021) and QLora (Quantized Large-scale Language Model Low-Rank Adaptation) (Dettmers et al., 2023 ). The basic concept of Lora is very simple, as shown in Figure 5. It will add a bypass branch to the original model. During the training process, the parameters W ∈ R (shape dxd) of the original training model remain unchanged, and only the dimensionality reduction matrix A ∈ R (shape d × r) and the parameters of the ascending-dimensional matrix B ∈ R (shape rxd) are trainable. After training is completed, the matrix product BA will be added to the original model parameters W to obtain the newly trained model. Since the size of r is significantly reduced relative to d, the number of trainable parameters is greatly reduced. Based on Lora, QLora adopts a new high-precision quantization technology called NF4 and introduces dual quantization to quantize the pre-trained model into 4 bits. Furthermore, it introduces a set of learnable low-rank adapter weights that are fine-tuned by optimizing the gradients of the quantized weights. As a result, QLoRA can fine-tune larger models using fewer GPU resources. For example, MFTCoder can fine-tune a 70B model on a single Nvidia A100 card (80GB video memory).

 

2.5 Multi-task balanced loss function

As a multi-task learning framework, MFTCoder faces major challenges such as unbalanced data volume, varying difficulty and different convergence speeds between tasks. To address these challenges, MFTCoder employs a set of loss functions specifically designed to mitigate these imbalance issues.

First, in order to solve the problem of data imbalance, MFTCoder will ensure that every sample of all tasks within a single epoch is used and only used once. To avoid the model being biased towards tasks with more data, we introduce a weight allocation strategy during the loss calculation process. Specifically, we support two weight calculation schemes: one based on the number of task samples, and the other based on the number of effective Tokens included in the loss calculation. The former is more straightforward, but may not perform well when dealing with tasks with extreme differences in the number of samples and the number of valid Tokens (such as binary classification tasks with "yes" or "no" responses or single-choice exam tasks). On the other hand, a weight allocation scheme based on the number of effective Tokens incorporated into the loss calculation can alleviate this problem. The weighted loss function is specifically shown in formula (1) . In Formula 1, N represents the total number of tasks, M_i represents the number of samples of the i-th task, T_ij represents the number of valid Tokens (that is, Tokens involved in loss calculation) in the j-th sample of the i-th task, and t_ijk represents the The k-th valid token of the j-th sample of task i.

In order to solve the problem of varying difficulty of tasks, we borrowed the idea of ​​Focal Loss and incorporated it into MFTCoder. We implemented two different levels of Focal Loss functions to adapt to different fine-grained levels. One operates at the sample level, as shown in Equation (2) , and the other operates at the task level, as shown in Equation (3) .

In order to solve the problem of inconsistent convergence speed, we borrowed the idea of ​​FAMO (Fast Adaptation via Meta-Optimization) method (Liu et al., 2023) and innovatively applied it to calculate validation loss. First, we assume that each task (denoted by index i) has its own original loss function Li(θ). In the t-th iteration, we update the weight of each task according to the gradient of the validation loss of the corresponding task. The goal is to maximize the weight w_i of the task with the slowest convergence speed, as shown in formula (4 ) . Among them, g_t represents the gradient of the weighted verification loss of all tasks, ci(α, g_t) represents the slope (gradient) of the i-th task verification loss, θ_t represents the parameters of the network in the t-th iteration, α is the learning rate, and ε is a Small constant used to prevent division by zero. Furthermore, we would like further explanation of how convergent equilibrium is achieved. To ensure that tasks converge at similar speeds, we introduce a dynamic balancing mechanism. In each iteration, we update task-specific weights based on the gradient of the validation loss for the task. This method aims to give more attention to tasks that converge more slowly, allowing them to have a greater impact on the overall optimization process. By dynamically adjusting task weights, we create a balanced convergence scenario where all tasks progress towards their optimal solution at a similar rate. This mechanism effectively solves the problem of different convergence speeds and enhances the overall stability and performance of the MFTCoder framework.

By combining these different loss functions, MFTCoder can effectively solve the different needs of various multi-task scenarios, and alleviate the imbalance of task data, uneven difficulty and inconsistent convergence speed commonly encountered in existing large-scale multi-task learning research. challenge. As a flexible framework, MFTCoder provides effective solutions to these problems and supports the development of more efficient and accurate multi-task models.

 

3. Experiment

In this section, we will use MFTCoder to conduct multiple sets of experiments to verify the effectiveness and superiority of the MFT method. Specifically, we aim to answer the following three research questions:

  • RQ1: Is the MFT model obtained by fine-tuning multiple tasks using the MFT method better than the SFT-S (single task) model obtained by fine-tuning each task individually?
  • RQ2: Is the MFT model better than the SFT-Mixed (mixed task) model, which is obtained by fine-tuning multiple tasks as one task? 
  • RQ3: Is the MFT model better than the SFT-Mixed model in terms of generalization ability to unseen tasks?

Next, we first introduce the experimental setup. We then present and delve into the experimental results. Finally, we will summarize and answer the research questions posed in this section.

3.1 Experimental setup

To answer these three research questions, we selected 5 code-related downstream tasks and prepared the corresponding fine-tuning data, as shown in Table 1 . Table 1 shows the target abilities (column III) and sample size (column IV) for each task. For example, CODECOMPLETION-TASK aims to improve the code completion capabilities of the model and includes 192,547 fine-tuning samples. CODETRANS-TASK is designed to enhance the model's code translation capabilities and contains 307,585 fine-tuned samples. Therefore, we trained a total of 7 models (column I), including the SFT-S-* model trained separately for each downstream task, the SFT-MIXED model with the 5 task data mixed into one, and the MFT method trained MFT-5TASKS model.

 

In the experiments, all models were configured identically except for the training data. The base model for all models is Code Llama-13B-Python (Rozière et al., 2023). Each model uses 16 A100 GPUs (80GB video memory), the micro batch size is 8, and the global batch size is 128 for training. Using the Adam optimizer (Kingma and Ba, 2017), the initial learning rate is 2e-4 and the minimum learning rate is 1e-5. We use MFTCoder's QLora-INT4 mode for fine-tuning. The fine-tuning parameter ratio is 2.52%, and the position and initial value of the trainable parameters are also the same. All models adopt the data equalization loss function (i.e. formula (1) ) and use the packaged Tokenization mode. It is worth noting that when there is only one task, this loss function is consistent with the traditional loss function used in standard GPT model pre-training. In order to make each model converge as completely as possible, we will terminate model training when the validation loss of two consecutive Epochs of the model is higher than the validation loss of the immediately preceding Epoch, and select the model checkpoint corresponding to the third to last Epoch. as an evaluation object.

3.2 Evaluation set

In this article, we use publicly available and representative code review sets for comparative evaluation, including:

  • HumanEval (Chen et al., 2021) is a widely used Python code completion evaluation dataset, carefully designed by researchers at OpenAI. 
  • HumanEval-X (Zheng et al., 2023) extends HumanEval into multiple programming languages ​​through translation, realizing multi-language code completion evaluation. 
  • DS-1000 (Lai et al., 2022) focuses on evaluating the model's ability to perform data science analysis using Python code, covering important libraries such as Numpy, Pandas, TensorFlow, Pytorch, Scipy, Sklearn, and Matplotlib. 
  • MBPP (Austin et al., 2021) contains 1,000 Python programming questions, constructed through crowdsourcing, and mainly evaluates the model's ability to master basic Python. In this study, we selected 500 questions with IDs 11-510 from MBPP to evaluate the model's ability to generate codes based on text descriptions.
  • CodeFuseEval ( https://github.com/codefuse-ai/codefuse-evaluation ), based on HumanEval and HumanEval-X, further expands the evaluation scope and adds Chinese code completion (docstring is in Chinese), code translation and single Test case generation capability evaluation, the corresponding subsets are called CodeFuseEval-CN, CodeFuseEval-CodeTrans and CodeFuseEval-UnitTest respectively.

In the above evaluation set, we all use " pass @1 " as the evaluation index in this article.

3.3 Experimental results

Next, we show the evaluation results of the 7 trained models. For each single-task SFT-S-* model, we focus on testing their specific target capabilities. For example, for the SFT-S-CODECOMPLETION model trained only with the code completion dataset, we only test its ability on code completion. Performance. On the other hand, for the SFT-MIXED model and the MFT-5TASKS model, we will evaluate their performance on each task and compare it with the corresponding SFT-S-* model. Specifically, we evaluated the performance of seven models in capability dimensions such as code completion, Text2Code, code annotation generation, code translation, and single test case generation.

3.3.1 Code completion

For the code completion task, we used the HumanEval and HumanEval-X evaluation data sets to evaluate the performance of the model, using pass@1 as the evaluation index. We evaluated 3 models: SFT-S-CODECOMPLETION, SFT-MIXED and MFT-5TASKS. A summary of the performance of these models on the HumanEval dataset is shown in Table 2 (column III). The results show that the MFT-5TASKS model trained using the MFT method outperforms the other two models. Compared with the SFT-MIXED model fine-tuned using mixed task data, its performance is improved by 2.44%. It is worth noting that the performance of the SFT-MIXED model is not as good as the SFT-S-CODECOMPLETION model, which is separately trained for code completion tasks.

In addition, we also evaluated the multi-language code completion capabilities of the three models on the HumanEval-X data set, as shown in Table 3 . The MFT-5TASKS model shows excellent performance on Java and Golang, while the SFT-MIXED model performs well on C++ and JavaScript. Overall, the MFT-5TASKS model performs better than the other two models, with an average improvement of 1.22% compared to the SFT-MIXED model.

Overall, in terms of code completion tasks, models trained using the MFT method outperformed models fine-tuned alone and models fine-tuned by mixing multiple tasks.

3.3.2 Text2Code

To evaluate the model's ability to generate code based on descriptions, we selected the MBPP evaluation set and used pass@1 as the evaluation metric. We tested and compared 3 models on the MBPP dataset: SFT-S-TEXT2CODE, SFT-MIXED and MFT-5TASKS, as shown in Table 2 (Column IV). Among these models, MFT-5TASKS shows the highest performance, 2.4% higher than the SFT-MIXED model. Similarly, in the text-to-code generation task, a model obtained by fine-tuning a mixture of multiple tasks showed lower performance than a model fine-tuned for this task alone.

Overall, in terms of text-to-code generation tasks, models trained using the MFT method outperformed models fine-tuned alone and models fine-tuned by mixing multiple tasks.

3.3.3 Code comment generation

The goal of the code comment generation task is to enable the model to add necessary comments to the code, including line comments and interface comments, without modifying the input code itself, making the code more readable and user-friendly. To evaluate this capability, we constructed an evaluation set based on 500 MBPP test sets (id 11-510). For each question in the evaluation set, we let the SFT-S-CODECOMMENT, SFT-MIXED, and MFT-5TASKS models generate annotations. We then use GPT-4, which has been taught good code annotation standards, as a referee to determine which model performs best, and if it cannot be judged, output UNKNOWN. Finally, we counted the number of problems for which each model was identified as the best performer and calculated the corresponding proportions, as shown in Table 4 . It can be seen that 38.8% of the problems were determined to be for the MFT-5TASKS model, which performed best, surpassing the second-place SFT-MIXED model by 7.4% and the third-place SFT-S-CODECOMMENT model by 10.8%. In addition, 1.8% of the questions were marked as undecidable by GPT-4.

Overall, the model trained using the MFT method exhibits the best performance on the task of generating code annotations.

3.3.4 Code Translation

The goal of the code translation task is to accurately translate a given source code fragment into an equivalent code fragment in the target language, i.e. ensuring that both implementations have the same functionality. Here, we leverage the code translation subset of the CODEFUSEEVAL evaluation set, which supports bidirectional translation evaluation between Java, Python, and C++. In order to evaluate the accuracy and functional equivalence of the translation results, we use test cases that are semantically equivalent to the source program to verify whether the resulting code can run and pass successfully, that is, it meets the pass@1 standard. The test results of the three models are shown in Table 5 : The MFT-5TASKS model performed best in the translation of Python to Java, Python to C++ and C++ to Java; the SFT-MIXED model performed well in the translation of C++ to Python, while The SFT-S-CODETRANS model performs best in Java to Python and Java to C++ translations. Overall, the MFT-5TASKS model exhibits superior performance, averaging 0.93% higher than the SFT-MIXED model and 10.9% higher than the SFT-S-CODETRANS model.

In summary, on the task of code translation, the model trained using the MFT method is better than the models obtained by the other two training methods.

3.3.5 Single test case generation

The single test case generation task is to generate a set of unit test cases for a given code fragment (such as a method or class) by training the model to verify whether the provided code implementation is correct. We choose to use the UNITTEST subset of the CODEFUSEEVAL evaluation set as our test data set. Similarly, the pass@1 metric is used as the evaluation metric, which means that if the model generates test cases for the input sample (code snippet), and the input sample passes all test cases, the number of correctly generated samples is increased by 1. The greedy decoding strategy is also used during the evaluation process.

We compared the single test case generation capabilities of the three models in Python, Java and JavaScript, as shown in Table 6 . The results show that the MFT-5TASKS model is superior to other models in generating single test cases in Python, 5.73% higher than the second-placed SFT-MIXED model, and 10.19% ahead of the third-placed SFT-S-UNITTEST model. In JavaScript, the MFT-5TASKS model also performed well, leading other models by 7.93%. However, in Java, the performance of the MFT-5TASKS model is 5.37% higher than SFT-S-UNITTEST, but 5.44% lower than SFT-MIXED. Overall, the MFT-5TASKS model still shows the highest performance, with an average improvement of 2.74% compared to the SFT-MIXED model and an average improvement of 7.83% compared to the SFT-S-UNITTEST model.

In summary, models trained using the MFT method perform better than single-task models and mixed-task models.

3.3.6 Generalization performance on unseen tasks

In addition to evaluating the model's performance on tasks with training data to answer RQ1 and RQ2, this paper also tests and answers whether the MFT model shows better generalization ability than the mixed task model on unseen tasks (i.e., RQ3) . In order to answer this question, the article selected the Text-to-SQL generation task as the test target. The data for this task is not included in the training of the 7 existing models. In addition, this task has obvious code dependencies but is different from the 5 existing downstream tasks.

The article selected two evaluation indicators, the BLEU score and the logical accuracy of the SQL statement. BLEU evaluates the textual similarity between the generated output and the reference answer. On the other hand, the logical accuracy metric is used to deal with the situation where the meaning is correct but the SQL statement is expressed differently. Specifically, logical accuracy measures the proportion of test samples for which the generated SQL statements in the dataset are syntactically correct and semantically equivalent to the reference answer.

 

The article selected 5 representative Text2SQL data sets, including WikiSQL (Zhong et al., 2017), Spider (Yu et al., 2019b), CSpider (Min et al., 2019), CoSQL (Yu et al., 2019a) and BirdSQL (Li et al., 2023d), and 200 examples were randomly selected from each dataset for evaluation. The test case example is shown in Table 7 , where the first row shows the fine-tuning data format similar to the OpenAI ChatML format. For each sampled data set, the article tested the logical accuracy and BLEU score of the SFT-MIXED and MFT-5TASKS models, as shown in Table 8 . According to Table 8, the BLEU score of the MFT-5TASKS model is higher than that of the SFT-MIXED model on each dataset, with an average of 2.78 times higher. This indicates that the results generated by MFT-5TASKS are more similar to the text of the reference answers. This similarity can also be observed in Table 7 , where the MFT-5TASKS model generates cleaner results, while the SFT-MIXED model provides more explanations (which may be preferred in some cases). Furthermore, MFT-5TASKS performs better in terms of logical accuracy, with overall accuracy 2.18 times higher than the SFT-MIXED model and 4.67 times higher on the WikiSQL dataset.

 

Numerically, MFT-5TASKS shows better performance than SFT-MIXED, indicating that the MFT-trained model has stronger generalization ability on unseen tasks, which are not seen during the training process.

 

3.4 Experiment summary

This article selected 5 downstream tasks related to the code and trained a total of 7 models, including the SFT-S-* model that was fine-tuned separately for each task, the SFT-MIXED model that was fine-tuned using a mixture of all task data, and the SFT-MIXED model that was fine-tuned using a mixture of all task data. MFT-5TASKS model trained by MFT method. The article compares and tests the performance of each model in terms of its target capabilities. In addition, the article also comparatively evaluates the generalization performance of the MFT method and the hybrid SFT method on unseen tasks. The conclusion is summarized as follows: 

  • Models trained using the MFT method outperformed models fine-tuned individually for each task, answering RQ1 in the affirmative. 
  • Models trained using the MFT method outperformed models fine-tuned using a mixture of multiple tasks, answering RQ2 in the affirmative. 
  • The model trained using the MFT method shows stronger generalization ability on new unseen tasks than the SFT model fine-tuned using a mixture of multiple tasks.

4. MFTCoder application

In view of the excellent performance of the MFT training method, we have used MFTCoder to adapt to the current mainstream open source LLMs, including QWen, Baichuan 1/2, CodeGeex2, Llama 1/2, CodeLLama, StarCoder, etc.

MFTCoder supports Lora and QLora, which significantly reduces the number of model training parameters that need to be trained. In the process of adapting these models and fine-tuning them, we set the trainable parameters within the range of 0.1% to 5% of the total parameters. A large amount of practice has shown that as the proportion of trainable parameters increases, the model performance will not continue to improve but will soon become saturated. In practice, it has been observed that the proportion of trainable parameters does not exceed 5%, which can usually achieve a performance level close to full fine-tuning. In addition, during these fine-tuning processes, we will configure and use 3 to 7 code-related tasks. We usually use Lora mode for models below 20B, and QLora mode for models above 20B. After fine-tuning was completed, we evaluated the performance of these models on code completion and Text2Code tasks, as shown in columns III and IV of Table 9 . The article calculates the average index improvement of MFT fine-tuning compared to the baseline model on the HumanEval and MBPP evaluation sets. As shown in column 5, the improvement ranges from 6.26% to 12.75%, and the improvement on HumanEval exceeds that of MBPP. the magnitude above.

 

 

In addition, the article also evaluates the code completion performance of the model fine-tuned by MFTCoder on the multi-language benchmark HumanEval-X, as shown in Table 10 . It is worth noting that the fine-tuned CodeFuse-CodeLLama-Python-MFT (34B) achieves an average pass@1 of 56.88% on four languages ​​(Java, C++, JavaScript and Golang).

 

In particular, Table 9 also shows the performance of some representative fine-tuned open source models (such as OctoPack and WizardCoder-Python) and closed source models (such as Claude2 and GPT-4) on HumanEval and MBPP. It is worth noting that our fine-tuned CodeFuse-CodeLLama-34B model achieved pass@1 74.4% on HumanEval, surpassing all models listed in the table, including GPT-4 (67.00%, zero-shot , March 2023) . For this model, the article also evaluated its performance in other benchmark tests, including multilingual HUMANEVAL-X, MBPP, DS-1000 and CODEFUSEEVAL, and compared it with GPT-3.5 and GPT-4, as 6 . CodeFuse-CodeLLama-34B surpasses GPT-4 in CODEFUSEEVAL-UNITTEST and HUMANEVAL, and is equivalent to it in code translation capabilities, but has poor performance in Chinese code completion (CODEFUSEEVAL-CN), multi-language completion, and data science analysis (DS-1000). ) and text-to-code generation (MBPP) capabilities are slightly inferior to GPT-4. However, it does not perform worse than GPT-3.5 on all evaluation datasets.

Figure 6: Radar chart comparing the performance of CodeFuse-CodeLLama-34B with GPT-3.5/GPT-4 on multiple code evaluation sets

 

Furthermore, we conducted an additional evaluation to evaluate the performance impact of fine-tuning the model on NLP tasks using MFTCoder and code-related data. Taking CODEFUSE-QWEN-14B as an example, we compared its NLP performance with the baseline model QWEN-14B and Alibaba Cloud's official fine-tuned QWEN-14B-CHAT, as . Obviously, CODEFUSE-QWEN-14B's ability on NLP tasks has not declined. On the contrary, compared with the other two models, it has better performance in language (AFQMC, CHID, Wic, WSC), reasoning (COPA, CMNLI, OCNLI, AX- b, AX-g, RTE) and comprehension (CSL, C3, EPRSTMT) . However, compared with the benchmark model QWEN-14B, its comprehensive subject capabilities (MMLU, C-Eval, ARC-c) have slightly declined. A similar decline phenomenon also appears on the QWEN-14B-CHAT model. The detailed data is shown in Table 11 shown. On average across multiple tasks (including Coding), CodeFuse-QWen-14B improved by 2.56% and 4.82% respectively compared to Qwen-14B and QWen-14B-chat, while QWen-14B-chat decreased by 2.26% compared to QWen-14B. .

Figure 7: Radar chart comparing the performance of CodeFuse-QWen-14B with Qwen-14B and QWen-14B-chat on NLP and Coding evaluation tasks

 

Table 11: Comparasion of CodeFuse-QWen-14B, QWen-14B and QWen-14B-chat on NLP tasks.

5. Summary

This article introduces MFTCoder, which introduces multi-task learning into the (code) fine-tuning stage of large models, and effectively alleviates the challenges of uneven data volume, varying difficulty, and inconsistent convergence speeds in multi-task learning by designing or applying multiple balanced loss functions. A large number of experimental results show that multi-task fine-tuned models perform better than models that are fine-tuned for each downstream task individually and models that are fine-tuned after mixing multi-task data into one. MFTCoder provides efficient training solutions, including efficient data tokenization mode and PEFT support, and provides high-quality instruction data set construction solutions. In addition, MFTCoder has been adapted to many currently popular open source large models. Among them, the CodeFuse-CodeLLama-34B model, which is based on CodeLLama-34B-Python and fine-tuned using MFTCoder, achieved a pass@1 score of 74.4% on the HumanEval data set. , surpassing GPT-4 (67%, zero-shot, March 2023).

Reference

Specific reference paper: https://arxiv.org/pdf/2311.02303.pdf . The following are the papers cited in this article.

  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:cs.LG/1910.10683 
  • Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive Multi-task Representations with Pre-Finetuning. arXiv:cs.CL/2101.11038 
  • Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. 2022. ExT5 : Towards Extreme Multi-Task Scaling for Transfer Learning. arXiv:cs.CL/2111.10952
  • Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:cs.CL/2106.09685 
  • Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:cs.LG/2305.14314 
  • Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022). 
  • Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks Are All You Need.  arXiv preprint arXiv:2306.11644  (2023). 
  • Guohao Li, Hassan Abed Al Kader Hammoud, Hani Itani, Dmitry Khizbullin, and Bernard Ghanem. 2023c. CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society. arXiv : cs . AI / 2303 
  • Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li , Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, and Xianying Zhu. 2023. CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model. arXiv:cs.SE/2310.06266 
  • Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023). 
  • Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attarian, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. arXiv:cs.LG/1902.00751 
  • Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. 2023. FAMO: Fast Adaptive Multitask Optimization. arXiv:cs.LG/2306.03792 
  • Baptist Rozier, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jeremy Rapin, et al. 2023. Flaming code: Open foundation models for code.  arXiv preprint arXiv:2308.12950  (2023). 
  • Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:cs.LG/1412.6980 
  • Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. In  KDD . 
  • Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2022. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation.  ArXiv  abs/2211.11501 (2022). 
  • Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732 (2021). 
  • Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri and et al.. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:cs.LG/2107.03374 

contact us

MFTCoder has been open sourced, and the models and data sets mentioned in this article are also being open sourced. If you like our work, you are welcome to try it, correct errors and contribute code. If you can, please add Star to our project to support us.

 

 

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time It is about to enter the 1.7 billion era (already entered). Xiaomi officially announced that Xiaomi Vela is fully open source. The underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6942768/blog/10143310
Recommended