[Distill Series: Three] On the Efficacy of Knowledge Distillation

https://arxiv.org/pdf/1910.01348.pdf

Teacher is not the higher the performance, the better (the results obtained by my experimental verification are the same)
Early stop in teacher training (not tried), early stop in distillation (I don’t work here) will help improve the distillation effect

Method

A subconscious conjecture: the higher the performance of the teacher, the better the distillation effect
Insert picture description here

It can be seen that as the teacher model becomes larger, the student performance of distillation does not increase sequentially

The author proposes possible situations:

  1. Student can imitate teacher, but performance is not improved
  2. The student failed to imitate the teacher.

image|564x348

The model with distillation is lower than the model without distillation in the later stage of training

  • Stop distillation early (I don’t work here)
  • Proper termination of teacher training early can also improve the distillation effect.

Guess you like

Origin blog.csdn.net/qq_31622015/article/details/105707495