Distilling Step-by-Step

This paper proposes a step-by-step distillation (Distilling Step-by-Step) paradigm to help model training. This method outperforms LLMs in training small models for specific tasks with far less training data than traditional fine-tuning and distillation. Their 770M T5 model outperformed the 540B PaLM model on a benchmark task while using only 80% of the available data.

Although large language models (LLMs) exhibit impressive few-shot learning capabilities, it is difficult to deploy such large-scale models in real-world applications. A dedicated infrastructure serving LLMs with a scale of 175 billion parameters requires at least 350GB of GPU memory. What's more, today's state-of-the-art LLM is composed of more than 500 billion parameters, which means it requires more memory and computing resources. Such computing requirements are unattainable for most manufacturers, let alone applications that require low latency.

To solve this problem with large models, deployers often use smaller specific models instead. These smaller models are trained using common paradigms - fine-tuning or distillation. Fine-tuning uses downstream human-annotated data to upgrade a small pre-trained model. Distillation trains an equally smaller model with labels produced by a larger LLM. Unfortunately, these paradigms come at a cost in reducing model size: fine-tuning requires expensive human labels to achieve LLM-like performance, and distillation requires large amounts of hard-to-obtain unlabeled data.

In a paper titled "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes", researchers from the University of Washington and Google introduced a new simple mechanism - stepwise distillation (Distilling step-bystep) for training smaller models using less training data. This mechanism reduces the amount of training data needed to fine-tune and distill the LLM, resulting in a smaller model size.

Paper link: https://arxiv.org/pdf/2305.02301v1.pdf

The core of this mechanism is to change the perspective and regard LLM as an agent that can reason, rather than a source of noisy labels. LLMs can generate natural language rationales that can be used to explain and support the labels predicted by the model. For example, when asked "A gentleman carries golf equipment, what might he have? (a) clubs, (b) auditorium, (c) meditation center, (d) conference, (e) church" , the LLM can answer "(a) a club" by chain of thought (CoT) reasoning, and justify the label by stating "the answer must be something used to play golf". Of the above selections, only the clubs are for golfing. We use these justifications as additional richer information to train smaller models in a multi-task training setting for label prediction and justification prediction.

As shown in Figure 1, stepwise distillation can learn small task-specific models with less than 1/500 the number of parameters of LLM. Stepwise distillation uses far fewer training examples than traditional fine-tuning or distillation.

Experimental results show promising experimental conclusions in three of the four NLP benchmarks.

  • First, compared to fine-tuning and distillation, the stepwise distillation model achieves better performance on each dataset, with an average reduction of more than 50% training instances (up to a maximum reduction of more than 85%).

  • Second, our model outperforms LLM at smaller model sizes (up to 2000 times smaller), greatly reducing the computational cost required for model deployment.

  • Third, while reducing the model size, the study also reduces the amount of data required to go beyond LLM. The researchers used a 770M T5 model to surpass the performance of the 540B parameter LLM. This smaller model uses only 80% of the labeled datasets of existing fine-tuning methods.

When only unlabeled data is available, the performance of the small model is still comparable to that of the LLM - with only a 11B T5 model surpassing the performance of the 540B PaLM.

The study further shows that when a smaller model performs worse than LLM, stepwise distillation can leverage additional unlabeled data more effectively to make the smaller model comparable to LLM performance compared to standard distillation methods.

stepwise distillation

The researchers proposed a new paradigm of step-by-step distillation, which leverages the reasoning power of LLMs on their predictions to train smaller models in a data-efficient manner. The overall framework is shown in Figure 2.

The paradigm has two simple steps: First, given an LLM and an unlabeled dataset, the LLM is prompted to generate an output label and justification for that label. The rationale is explained in natural language, providing support for the label predicted by the model (see Figure 2). The rationale is an emerging behavioral property of current self-supervised LLMs.

These justifications are then used to train smaller downstream models in addition to task labels. To put it bluntly, justifications provide richer and more detailed information about why an input is mapped to a particular output label.

Experimental results

The researchers verified the effectiveness of stepwise distillation in experiments. First, compared with standard fine-tuning and task distillation methods, stepwise distillation helps to achieve better performance with a much smaller number of training instances, greatly improving the data efficiency of learning small task-specific models.

Second, the study shows that the stepwise distillation method surpasses the performance of LLM with a smaller model size, which greatly reduces the deployment cost compared to LLM. 

Finally, we investigate the minimum resources required for stepwise distillation methods to outperform LLMs in terms of number of training examples and model size. They show that stepwise distillation improves both data efficiency and deployment efficiency by using less data and smaller models. whaosoft  aiot  http://143ai.com

 

 

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130715458