ALPAGASUS : TRAINING A BETTER ALPACA WITH FEWER DATA♢

ALPAGASUS : TRAINING A BETTER ALPACA WITH FEWER DATA♢

Introduction

This paper demonstrates that the quality of data is more important than quantity. The author filtered the data of Alpaca52k by interacting with GPT, leaving 9k, and fine-tuned the two respectively. Through experimental comparison, it was found that the performance of 9k was much greater than that of 52k.

The second contribution is: the author proves that the number of data sets is 9k is the best performance by mixing other open source instruction data sets (Vicuna).
insert image description here

Method

The filtering method is to interact with ChatGPT through In Context Learning, and score the alpaca52k data, with a score of 0-5.

The prompt is as follows:
insert image description here
Then select the data with a score of 4.5 or more to form the alpagasus 9k data.

Then fine-tune the two data sets through llama7b and llama13b respectively, and the performance comparison is as follows:
insert image description here

At the same time, the author also verified the situation of 3k and 6k, and found that the performance of 9k is the best.

reference

https://arxiv.org/pdf/2307.08701.pdf

Guess you like

Origin blog.csdn.net/qq_18555105/article/details/131791683