https://arxiv.org/pdf/1910.01348.pdf
Teacher is not the higher the performance, the better (the results obtained by my experimental verification are the same)
Early stop in teacher training (not tried), early stop in distillation (I don’t work here) will help improve the distillation effect
Method
A subconscious conjecture: the higher the performance of the teacher, the better the distillation effect
It can be seen that as the teacher model becomes larger, the student performance of distillation does not increase sequentially
The author proposes possible situations:
- Student can imitate teacher, but performance is not improved
- The student failed to imitate the teacher.
The model with distillation is lower than the model without distillation in the later stage of training
- Stop distillation early (I don’t work here)
- Proper termination of teacher training early can also improve the distillation effect.