Large-scale language model fine-tuning technology - the difference and connection between Instruction and Question

1 Introduction

In the era of ChatGPT, everyone can easily use this powerful language model. This is all happening much faster than I could have imagined. This is due to the large-scale language model fine-tuning technology, but it does not introduce too many novel elements. Especially with ChatGPT as a guide, many things become easier and simpler. Especially based on the LLaMA open source project, many models use LLaMA as the base model, and achieve alignment with ChatGPT by fine-tuning specific instruction data sets.

2. Fine-tune the format of the data

There are two main ways to reproduce ChatGPT. The first method is Claude's method, which basically reproduces the ChatGPT route. The advantage of this method is that since they are all former employees of OpenAI, the recurrence rate is indeed the most perfect in the current evaluations. However, the accumulation and cost of this method cannot be easily realized by everyone. Another method is to rely on ChatGPT's fast reproduction method, where Alpaca and Vicuna are representative models. Although they all use LLaMA as the base model, the performance of the Vicuna model is stronger than that of Alpaca. The reasons for this involve several aspects.

First, Alpaca tries to simulate the generation process of ChatGPT from the source, and achieves performance similar to ChatGPT through instruction fine-tuning. It draws on the ability of ChatGPT in the creation of data sets. Alpaca's fine-tuning data is mainly a dataset of synthetic instructions generated through self-guidance. In contrast, ChatGPT's command dataset is entirely from real human users, which makes ChatGPT's command quality higher and more in line with the distribution of user intentions. Alpaca only simulates the generation process of this instruction data set.

On the other hand, Vicuna directly distills the conversation between users and ChatGPT. On the one hand, this method obtains the information of the user's true intention distribution, and on the other hand, the training and testing scenarios are more consistent because it is based on multiple rounds of dialogue. Furthermore, in terms of data volume, the dataset size used by Vicuna is much larger than that used by Alpaca.

Furthermore, there are many models that follow these two training paradigms and even combine them. For more information, you can refer to LLMZOO . However, by looking at many other datasets reproduced by large-scale language models, it can be seen that their format is not very uniform. In particular, those models benchmarked against Alpaca vary in the form of their datasets.

The training set of some models is the original question and answer data set, and its data format is in the form of question and answer, which is only fine-tuned on a larger model. The training set of some models contains two parts, Instruction and Input, but there is no substantial change in essence, because their Instructions are exactly the same on the entire data set. There are also some problems with this approach. The training set of some models follows Alpaca's approach and does contain two parts, Instruction and Input. However, something has always bothered me about the difference and connection between question answering and instruction. What is the difference in their impact on the overall model training?

3. The difference between Question and Instruction

Question answering (QA) and instruction (Instruction) are one of the most common forms of human-computer interaction. QA refers to the form of one question and one answer. Usually, the user asks a question and the model gives an answer. Instruction is derived from Prompt Engineering, which divides the problem into two parts: Instruction is used to describe the task, and Input is used to describe the object to be processed. Here are examples of both forms:

The Question Answering (QA) format is used to train a model to provide an answer given a question. In general, QA training data consists of a sequence of questions and corresponding answers. For example:

Q: 什么是人工智能?
A: 人工智能是指由人造出来的系统表现出的、原本被认为只有人类才会表现出的智能行为。

This format is suitable for training question answering systems, or any task that requires a model to understand questions and provide accurate answers.

The instruction format is used to train the model to perform tasks according to the given instructions. For example:

I: 将以下句子翻译成英文:我爱学习人工智能。
O: I love learning artificial intelligence.

Training data in this format is suitable for training generative models, especially when the model is required to perform specific tasks such as translation, writing, code generation, etc.

Therefore, the training data in the question answering (QA) format is usually used to train the model to answer knowledge-based questions, while the training data in the instruction (Instruction) format is more suitable for training the model to perform specific tasks.

4. The connection between Question and Instruction

However, this is not a mandatory requirement, as many tasks can be formulated as questions or instructions. For example, you can formulate "Translate the following sentence into English: I love learning artificial intelligence" as a question, such as what is the English word for the sentence "I love learning artificial intelligence"? Conversely, you can also express "what is artificial intelligence?" as an instruction, such as "explain the meaning of the following noun: artificial intelligence".

Converting questions into instructions may help the model better understand the goal of a task, especially if the task requires specific actions to be performed. For example, for the question "Please explain the difference between VC Yinqiao Tablet and Shuanghuanglian Oral Liquid", we can split it into the following two parts:

Instruction: Please explain the difference between the following two medicines.
Input (Input): VC Yinqiao Tablets and Shuanghuanglian Oral Liquid.

In this example, the model needs to explain the difference between two pharmaceuticals. By turning questions into instructions, it might be easier for the model to identify key parts of the task, such as in this example, the names of two drugs that need to be explained.

Furthermore, the form of the instruction may enable the model to generalize better because it emphasizes the nature of the task rather than just specific inputs. For example, in the example "Please explain the difference between the following two medicines. VC Yinqiao Tablet and Shuanghuanglian Oral Liquid", the model may learn to compare and explain similarly for any given two medicines, not only Only for VC Yinqiao Tablets and Shuanghuanglian Oral Liquid.

Of course, the choice of which format to use usually depends on specific requirements, including the type of task, the expected behavior of the model, and the availability of training data. For some tasks, mixing training data from both formats may yield the best results.

5. Some ideas for existing models

With the emergence of more and more "crash" chatGPT models, the method of instruction fine-tuning has gradually become common, and more and more people have begun to try to apply this method on downstream tasks for evaluation. Next, there are several issues that need further verification:

  1. Is the method of instruction fine-tuning applicable to downstream tasks?
  2. Do you need to evaluate the performance of existing large models on some previous natural language understanding tasks?
  3. What performance aspects of the model have been improved by instruction fine-tuning?

Regarding the first question, the method of instruction fine-tuning may not be applicable to all downstream tasks. As mentioned in the previous analysis, the purpose of instruction fine-tuning is to accomplish a specific task, not to do it better. The ability to actually perform tasks still needs to be learned through language modeling tasks. In fact, in essence, the current large models are not much different from the previous models. Even if they exhibit very advanced natural language understanding capabilities, they are finally achieved through language modeling, but they behave more like real Ability to understand text. Therefore, instruction fine-tuning is effective for completing a specific task, but for improving the performance of a specific task, it depends on the form of the specific task. If it is a natural language generation task, such as summarization, continuation, translation, or even generative information extraction, it is not much different from language modeling. Continuous training is helpful to improve the performance of the model. If it is some discriminative tasks, such as classification, multiple-choice questions, etc., only a small number of sample training may be required to allow the model to output results in a specific format, but increasing the number of training samples may not necessarily enhance the performance of the model, depending on whether the task is closer to Generate tasks. Of course, if you only want to pursue a fixed format on decoding, you can use Guidance, provided that the output probability distribution of the entire model can be obtained.

For the second question, I think the evaluation method depends on whether the model is chat-based or non-chat-based, and whether the evaluation angle considers knowledge or interaction. Personally, if the focus is on knowledge, it is more important to evaluate the base model, that is, the non-chat model, because the chat itself only allows the model to interact according to user instructions, and does not represent the quality of the task. Of course, if trained with reinforcement learning and human feedback, this may improve performance, provided that user feedback is instructive and not just an alignment of values ​​and preferences. For the chat model, I think it should be evaluated from the aspects of interaction or dialogue. The pure question answer evaluation may not be good for the chat series model. Currently, I'm also doing research on this issue.

For the third question, what performance aspects of the model are improved by instruction fine-tuning. Through the previous discussion, my preliminary conclusion is that instruction fine-tuning significantly enhances the ability of the model to perform tasks according to instructions. However, for the execution quality of the task, it may be necessary to continue to use language modeling to train. Of course, regarding the above three issues, I believe that the research on Instruction has already reached some conclusions, but these issues may become more acute under the reproduction of the ChatGPT series model.

As for other research methods, you can check the papers accepted at recent conferences such as ACL and EMNLP to understand the current research progress.

6. Summary

Instruction fine-tuning techniques in the ChatGPT era are rapidly developing, providing us with a more precise and efficient way to train models to perform specific tasks. Although there are still some issues to be verified, such as the applicability of instruction fine-tuning in downstream tasks, the evaluation requirements for the performance of existing large models, and the specific improvement of model performance by instruction fine-tuning, these issues also provide us with more Opportunities for research and exploration.

Through continuous practice and research, we can better understand the nature and potential limitations of instruction fine-tuning, and further explore its applicability in different tasks and scenarios. This exploration will drive further deepening of our understanding and application of language models, opening up new possibilities for building smarter and more flexible models.

In future research, we can also pay attention to other ways and methods, such as the latest papers and research results at conferences, to obtain the latest progress and innovative ideas on instruction fine-tuning. This will help us promote the development of instruction fine-tuning technology and achieve greater breakthroughs and progress in the field of language models.

In short, instruction fine-tuning technology has brought us new opportunities and challenges in the ChatGPT era. Through continuous research and exploration, we will better understand and apply this technology, open up broader prospects for the development and application of language models, and further promote the progress of human-computer interaction and natural language processing.

Guess you like

Origin blog.csdn.net/qq_35082030/article/details/130727016