Distilling Step-by-Step: You can beat the same level of LLM with less training data and model size!
Introduction
The author mentioned that deploying large models has challenges such as time delay, memory, and computing power, so the current trend is to fine-tune and distill a language model that is not very large, such as Vicuna and Alpaca, but it is difficult and expensive to obtain data for specific downstream tasks of.
In order to solve the above problems, the author proposes Distilling Step-by-Step, which can defeat large models on the same data set by using less data and smaller models. (In this article, the author defeated 540BPaLM through the experiment 770M-T5)
Method
Distilling step-by-step is divided into two steps:
- Put some unlabeled data through CoT to prompt LLM to generate labels and theoretical basis (that is why such results are obtained).
- Finetune the obtained data in the small model.
The first step is as follows:
In this way, the small model can learn how to do this task, how to learn why it is done, and increase the small model's understanding of specific tasks.
Now with xi (from the original unlabeled data), ri (theoretical basis), and yi (label), the author connects the three better:
Enter a question and change the output to answer + answer to solve the problem.
When calculating the loss function, the two are weighted.
experiment
reference
https://arxiv.org/pdf/2305.02301.pdf