Paper address: http://arxiv.org/abs/1911.07471
github address: None
This paper proposes two supervision methods to improve the effect of knowledge distillation, aiming to solve the distillation problem when the teacher network has misclassified results and fuzzy classification results, so as to ensure that the student network always learns effective knowledge.
Methods
Bad phenomenon 1: Genetic errors
Meaning: Both the student network and the teacher network get the same wrong prediction results. When the teacher network predicts wrongly, it is difficult for the student network to correct the wrong knowledge by itself, and a genetic error is generated at this time.
Method 1 : The author of knowledge adjustment
proposes to modify the loss of the original knowledge distillation. Specifically, the measurement method between the student network and the teacher network is clearly defined as KL divergence, and a correction function for the logits of the teacher network is added. Abandonment and real The cross entropy item between labels.
Correction function A ( ⋅ ) A(·)A(⋅) corrects the wrong logits and does not modify the correct logits. In this paper, the author uses Lable Smooth Regularization (LSR) and the proposed Probability Shift (PS) as the correction function. LSR will predict the wrong class label to add the influence of other class labels to soften the label. PS exchanges the confidence of the wrongly predicted class with the true corresponding label confidence to ensure that the maximum confidence falls on the correct label.
Bad phenomenon 2: Uncertainty of supervision
Meaning: When the output distribution of the softened logits of the teacher network is relatively flat (that is, the temperature scaling is large), the supervision information provided by it may lose some discriminative information, which makes the student network learn fuzzy logits distribution, resulting in wrong prediction results.
Method 2 : The author of Dynamic Temperature Distillation
proposed to dynamically adjust the distillation method DTD of temperature scaling, that is, the sample-wise adaptive setting parameters.
where NNN is the size of the batch,τ 0 , β \tau_0, \betat0,β are benchmark and bias terms respectively,ω x \omega_xohxIt is the sample xx after batch-wise normalizationThe weight of x is used to describe the degree of confusion of the sample, that is, the degree of difficulty in classification. whenxxWhen x is difficult to classify,ω x \omega_xohxis larger, at this time τ x \tau_xtxbecomes smaller so that the softened logits are discriminative and vice versa.
For ω x \omega_xohxThe calculation method, the author proposes two methods:
- Focal Loss Style Weights (FLSW)
Inspired by focal loss's penalty weight adjustment for difficult samples, the author proposes the FLSW method.
where v , tv, tv,t are the logits of the student and the teacher, respectively. γ \gammaγ is a hyperparameter.- Confidence Weighted by Student Max (CWSM)
其中 v m a x v_{max} vmaxIt is the maximum value of the student's logits after normalization, and the author believes that this value can reflect the student's confidence in this sample.
Ultimately, combining the two approaches can solve both problems at the same time. Its overall loss function is adjusted to:
Experiments
Datasets: CIFAR-10, CIFAR-100, Tiny ImageNet
Method Comparison: Standard Distillation (KD), Attention Mechanism (AT), Neuron Selective Transfer (NST)
Results