How to automatically identify high-quality instruction data from data sets - the use of IFD indicators

write in front

Hello everyone, my name is Liu Cong NLP.

In the era of large models, instruction fine-tuning has become an essential skill for algorithm engineers. During the instruction fine-tuning process, we often optimize the model from the two dimensions of data quantity and data quality. Research on the LIMA model found that a limited amount of high-quality data compiled manually can also improve the model's instruction following ability. So is it possible to discover high-quality data in an automated way?

Today I bring you an article that automatically identifies high-quality data in a large number of available data sets - "From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning". The core content is to propose an instruction following difficulty (Instruction- Following Difficulty (IFD) indicator, this indicator is used to filter data samples (cherry data) that have the potential to enhance LLM instruction tuning, and the model can reach the full amount of data using only 5%-10% of the cherry data of the original data. The effect of fine-tuning can even be improved.

Paper: https://arxiv.org/abs/2308.12032
Github: https://github.com/MingLiiii/Cherry_LLM

PS: Some errata content in the new book "ChatGPT Principles and Practice" have been updated to Github. Please extract it yourself. You are also welcome to catch bugs in the issue.

《ChatGPT原理与实战》:https://github.com/liucongg/ChatGPTBook

method

Use IFD indicators to automatically filter cherry data, and then use cherry data to fine-tune model instructions to obtain a better fine-tuned model, which mainly involves three steps:

  • Learning from Brief Experience: Use a small amount of progress to learn the model;

  • Evaluating Based on Experience: Use the beginner model to calculate all IFD indicators in the original data;

  • Retraining from Self-Guided Experience: Model retraining using cherry data.

As shown below,

fc7a5643cbc89aaa0bfab61d012e0569.png

Learning from Brief Experience

The reasons for using a small amount of data for initial model learning are as follows:

  • Some models are Base models, which can only be continued and do not have the ability to follow instructions;

  • LIMA has proven that high-quality data can enable models to follow instructions;

  • If a large amount of data is used for learning, the time cost and resource cost will be high.

In the selection of a small amount of data, the number of samples was selected to be 1k. In order to ensure the diversity of the data, the K-Means method was used to cluster the instructions, and a total of 100 clusters were clustered, with 10 samples selected in each cluster. And only train 1 epoch on the initial model to obtain a brief pre-experience model (Brief Pre-Experience Model).

Evaluating Based on Experience

The brief pre-empirical model can be used to predict all samples in the data set, predict the answer content through the instruction content, and obtain the direct difference between the predicted answer and the real answer (using cross entropy), that is, the Conditioned Answer Score (CAS) ,as follows:

814f1e8def79cbc343659b21b911b4ba.png

Depending on the level of CAS, it can be judged how difficult it is for the model to generate answer A for instruction Q, but it may also be affected by the difficulty of the model generating answer A. We use the model to directly continue writing the answer, and then obtain the direct difference value based on the actual content of the answer, that is, the Direct Answer Score (DAS), as follows:

731da1a1c17f656f0a5fa35765993f29.png

A higher DAS score may indicate that the answer is inherently more challenging or complex for model generation. In order to obtain better instruction data, that is, which instructions have a higher impact on the model, it is necessary to eliminate the impact of the answer itself, so the Instruction-Following Difficulty (IFD) score is proposed, as follows:

73f4caff0a79ef7be1ba6a9d6ac69bee.png

Using IFD indicators to filter data slows down the impact of large models on the ability to fit the answers themselves, and can directly measure the impact of given instructions on the answers generated by the model. A higher IFD score indicates that the model is unable to align the answer with the given instruction content, indicating that the instruction is more difficult and is more beneficial to model tuning.

Retraining from Self-Guided Experience

Use the IFD index to sort the original data set, select the data with the highest scores as cherry data, and fine-tune the original model to obtain the cherry model.

Discussion of results

Let’s talk about the conclusion first. We used Llama-7B to conduct experiments on two data sets, Alpaca and WizardLM, and found that training on 5% of the Alpaca cherry data exceeded the training results of the full amount of data.

af6823d05a29e0286654dc018a740abf.png

How to judge whether the IFD indicator is effective? Comparing the effects of random sampling, IFD sampling, IFD low-score sampling, and CAS sampling on model instruction fine-tuning, it was found that IFD sampling under different data proportions has a higher fine-tuning effect than the full amount of data, but other sampling methods are lower than the full amount of data. Fine-tuning method.

9d82967cfc2d6579290f162290f0efe2.png

In the early stage, 1,000 pieces of data were used for brief model learning. So what is the impact of the amount of data during the brief learning process of the model? A comparative experiment was conducted on the model brief learning with different amounts of data. It was found that without model brief learning, when the cherry data accounted for 10%, the model was still effective due to the full amount of parameters, indicating the effectiveness of the IFD indicator. The main purpose of briefly learning the model is to allow Base to have a certain ability to follow instructions. When there are 100 samples, model training has no effect. When the number of samples increases to 300, the model has a certain ability to follow instructions.

573cd8d330ef864d64178ce2eb960097.png

At the same time, we compared the sample sampling methods in the brief model learning process, and compared the differences in sample distribution (K-Mean method used above) and instruction following difficulty (IDF score) sampling. We found that both are valid, so for the model This process of learning briefly is more important.

4dbdc0f38431c43c29e03714d008c558.png

An analysis of high- and low-quality data found that cherry data generally scored higher on scope, complexity, depth, and knowledge, and lower on clarity and simplicity. And it was found that there is a clear boundary between high-difficulty and low-difficulty samples.

ecc7fa41a9f2f2edc97397eed9b79bb6.png 6fd44717b7743995e9881b2feada1e6e.png

Summarize

In the era of large models, most algorithm engineers have turned into data engineers. How to construct data to make the model perform better has become everyone's daily work, but don't underestimate this job. Details often determine success or failure.

Please pay more attention to "Liu Cong NLP" on Zhihu. Friends who have questions are also welcome to add me to WeChat "logCong" for private chat. Let's make friends, learn together, and make progress together. Our slogan is "Life is endless, learning is endless".

PS: The new book "ChatGPT Principles and Practical Combat" has been released, welcome to buy~~.

Recommended in the past:

Guess you like

Origin blog.csdn.net/fogdragon/article/details/133286332