Microsoft uses GPT-4 for instruction fine-tuning for the first time, and the zero-sample performance of new tasks is further improved

77a3e0d743a10b57b56c05cbec8756b6.jpeg

Editors|Du Wei, Chen Ping
Source |Heart of the Machine

The fine-tuning level of large model instructions is constantly improving, and this time Microsoft used GPT-4.

We know that from the Google T5 model to the OpenAI GPT series of large models, large language models (LLMs) have demonstrated impressive generalization capabilities, such as context learning and thinking chain reasoning. At the same time, in order to make LLMs follow natural language instructions and complete real-world tasks, researchers have been exploring instruction fine-tuning methods for LLMs. This is achieved in two ways: one is to use human-annotated prompts and feedback to fine-tune models on a wide range of tasks, and the other is to use public benchmarks and datasets augmented by manually or automatically generated instructions for supervised fine-tuning.

Among these methods, Self-Instruct fine-tuning is a simple and effective method that learns from instruction-following data generated by teacher LLMs fine-tuned by SOTA instructions, such that LLMs are aligned with human intent. Facts have proved that instruction fine-tuning has become an effective means to improve the zero-shot and few-shot generalization ability of LLMs.

Recently, the success of ChatGPT and GPT-4 has opened up a huge opportunity to use instruction fine-tuning to improve open source LLMs. Meta LLaMA is a family of open-source LLMs whose performance is comparable to proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-Instruct was quickly adopted due to its superior performance and low cost. For example, Stanford's Alpaca model uses 52k instruction-following samples generated by GPT-3.5, and the Vicuna model uses about 700k instruction-following samples from ShareGPT.

In order to advance the SOTA level of LLMs instruction fine-tuning, Microsoft Research used GPT-4 as a teacher model for self-intruct fine-tuning for the first time in its paper "Instruction Turing with GPT-4".

b5897ebf499529627f9e58e9f4051d5e.png

Paper address:
https://arxiv.org/pdf/2304.03277.pdf

Project address:
https://instruction-tuning-with-gpt-4.github.io/

GitHub address:
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM

On the one hand, the researchers released the data generated by GPT-4, including the 52k instruction follow data set in Chinese and English, and the feedback data generated by GPT-4 for rating the output of the three instruction fine-tuning models.

On the other hand, an instruction fine-tuned LLaMA model and a reward model are developed based on the data generated by GPT-4. To evaluate the quality of instruction-fine-tuned LLMs, the researchers evaluated the test samples using three metrics: human evaluation on three alignment criteria, automatic evaluation based on GPT-4 feedback, and ROUGE-L (Automatic Abstract Evaluation Method for Unnatural Instructions). one).

Experimental results validate the effectiveness of using GPT-4 generated data for instruction fine-tuning of LLMs. The 52k Chinese-English instruction-following data generated by GPT-4 achieves better zero-shot performance on new tasks than previous SOTA models. At present, researchers have made public the data generated using GPT-4 and related codes.

data set

The study used GPT-4 to generate the following four datasets:

  • English Instruction-Following Dataset: For the 52K instructions collected from Alpaca, each instruction provided an English GPT-4 answer. This dataset is mainly used to explore and compare the statistics of GPT-4 answers and GPT-3 answers.

  • Chinese Instruction-Following Dataset (Chinese Instruction-Following Data): This study uses ChatGPT to translate 52K instructions into Chinese and asks GPT-4 to answer in Chinese.

  • Comparison Data: Let GPT-4 rate its own responses on a scale from 1 to 10. Additionally, the study asked GPT-4 to compare and score the responses of the three models, GPT-4, GPT-3.5, and OPT-IML. This dataset is mainly used to train the reward model.

  • Answers on Unnatural Instructions: GPT-4 answers are decoded on the 68K instruction-input-output triple core dataset. This subset is used to quantify the gap between GPT-4 and instruction fine-tuned models.

68377ef5a6a74ffab4128f40036f539e.png

Figure 1 compares the English output response sets of GPT-4 and GPT-3.5. Figures 1 (a) and (b) show the verb-noun pairs with frequency higher than 10 in the two output sets, and Figure 1 (c) compares the 25 most frequent pairs of words in the two sets, Figure 1(d) compares the frequency distributions of sequence lengths, showing that GPT-4 tends to generate longer sequences than GPT-3.5.

9663a778a531334c23c073c868d5e00a.png

Instructions to fine-tune the language model

Based on the LLaMA 7B checkpoint, the study trained two models using supervised fine-tuning: (i) LLaMA-GPT4, trained on 52K English instruction-following data generated by GPT-4. (ii) LLaMA-GPT4-CN, trained on 52K Chinese instruction-following data generated from GPT-4.

reward model

Reinforcement Learning with Human Feedback (RLHF) aims to align LLM behavior with human preferences, and reward modeling is one of its key parts. This problem is often formulated as a regression task to predict the reward between a given cue and a response. However, this method usually requires large-scale comparison data. Existing open source models such as Alpaca, Vicuna, and Dolly do not involve RLHF due to the high cost of labeling comparison data. Meanwhile, recent research has shown that GPT-4 is able to identify and fix its own mistakes and accurately judge the quality of responses. Therefore, to facilitate the study of RLHF, this study created comparative data using GPT-4, as described above.

To assess data quality, the study also trained an OPT 1.3B-based reward model for evaluation on this dataset. The distribution of the comparative data is shown in Fig. 2 .

d78d8aa1eec12bc582228e973c6a7c3a.png

experiment

The study utilized three types of evaluations: human evaluation, GPT-4, and unnatural instruction evaluation. The results confirm that using GPT-4-generated data is an efficient and effective method for fine-tuning LLM instructions compared to other machine-generated data. Next, let's look at the specific experimental process.

human evaluation

Figure 3 (a) is the comparison result of LLaMA-GPT4 vs Alpaca. The experiment shows that under the indicator of Helpfulness, GPT-4 wins with a score of 54.12%. Figure 3 (b) shows the comparison results of LLaMA-GPT4 vs GPT-4, showing that the performance of LLaMA fine-tuned with GPT-4 instructions is similar to the original GPT-4.

0d16f450dbaff19c26a4ad29f66dbec3.png

Comparison with SOTA using autoevaluation

The study used GPT-4 to automatically evaluate the responses of different models on 80 unseen questions. First collect answers from two chatbots, LLaMA-GPT-4 (7B) and GPT-4, and publish answers using other chatbots, including LLaMA (13B), Alpaca (13B), Vicuna (13B), Bard (Google, 2023) and ChatGPT. For each evaluation, the study asked GPT-4 to rate the quality of the response between the two models on a scale from 1 to 10. The result is shown in Figure 4.

4a5b2724559a610e653cb86a9d7fed8c.png

Figure 4 (c,d) compares all chatbots. LLaMA_GPT4 Higher Performance: 7B LLaMA GPT4 outperforms 13B Alpaca and LLaMA. However, there is still a gap between LLaMA_GPT4 and large-scale commercial chatbots such as GPT-4.

The researchers further investigated the performance of all chatbots in Figure 5 below. First, GPT-4 is used to translate the chatbot's English responses into Chinese, and then GPT-4 is used to translate English questions into Chinese to obtain answers. Comparisons with GPT-4 translated and generated Chinese responses are shown in 5(a) and 5(b), and 5(c) shows the results of all the models that were asked to respond in Chinese.

54698f2e6d290416546bb97adf05ddfe.png

In Figure 6 below, the researchers compare LLaMA-GPT4 with GPT-4, Alpaca unnatural instructions. The results show that LLaMA-GPT4 and GPT-4 perform better with increasing ground truth response length. This means that when scenes are more creative, they follow instructions better. When the sequence length is short, both LLaMA-GPT4 and GPT-4 generate responses containing simple ground truth answers, and adding extra words can make the responses more chat-like.

0d7c4b88f876ba9061e714c1b9ec796b.png

Please refer to the original paper for more technical and experimental details.

01d37457ce3b78daaec656782e21438b.jpegReply keywords in the background [ join the group ]

Join the NLP, CV, search promotion and job hunting discussion group

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130073329