Preparing Lessons: Improve Knowledge Distillation with Better Supervision Paper Notes

Paper address: http://arxiv.org/abs/1911.07471
github address: None

This paper proposes two supervision methods to improve the effect of knowledge distillation, aiming to solve the distillation problem when the teacher network has misclassified results and fuzzy classification results, so as to ensure that the student network always learns effective knowledge.

Methods

Bad phenomenon 1: Genetic errors
Meaning: Both the student network and the teacher network get the same wrong prediction results. When the teacher network predicts wrongly, it is difficult for the student network to correct the wrong knowledge by itself, and a genetic error is generated at this time.

Method 1 : The author of knowledge adjustment
proposes to modify the loss of the original knowledge distillation. Specifically, the measurement method between the student network and the teacher network is clearly defined as KL divergence, and a correction function for the logits of the teacher network is added. Abandonment and real The cross entropy item between labels.
THE
Correction function A ( ⋅ ) A(·)A() corrects the wrong logits and does not modify the correct logits. In this paper, the author uses Lable Smooth Regularization (LSR) and the proposed Probability Shift (PS) as the correction function. LSR will predict the wrong class label to add the influence of other class labels to soften the label. PS exchanges the confidence of the wrongly predicted class with the true corresponding label confidence to ensure that the maximum confidence falls on the correct label.

Bad phenomenon 2: Uncertainty of supervision
Meaning: When the output distribution of the softened logits of the teacher network is relatively flat (that is, the temperature scaling is large), the supervision information provided by it may lose some discriminative information, which makes the student network learn fuzzy logits distribution, resulting in wrong prediction results.

Method 2 : The author of Dynamic Temperature Distillation
proposed to dynamically adjust the distillation method DTD of temperature scaling, that is, the sample-wise adaptive setting parameters.
DTD
where NNN is the size of the batch,τ 0 , β \tau_0, \betat0,β are benchmark and bias terms respectively,ω x \omega_xohxIt is the sample xx after batch-wise normalizationThe weight of x is used to describe the degree of confusion of the sample, that is, the degree of difficulty in classification. whenxxWhen x is difficult to classify,ω x \omega_xohxis larger, at this time τ x \tau_xtxbecomes smaller so that the softened logits are discriminative and vice versa.

For ω x \omega_xohxThe calculation method, the author proposes two methods:

  1. Focal Loss Style Weights (FLSW)
    Inspired by focal loss's penalty weight adjustment for difficult samples, the author proposes the FLSW method.
    FLSW
    where v , tv, tv,t are the logits of the student and the teacher, respectively. γ \gammaγ is a hyperparameter.
  2. Confidence Weighted by Student Max (CWSM)
    customer
    其中 v m a x v_{max} vmaxIt is the maximum value of the student's logits after normalization, and the author believes that this value can reflect the student's confidence in this sample.

Ultimately, combining the two approaches can solve both problems at the same time. Its overall loss function is adjusted to:
total loss

Experiments

Datasets: CIFAR-10, CIFAR-100, Tiny ImageNet
Method Comparison: Standard Distillation (KD), Attention Mechanism (AT), Neuron Selective Transfer (NST)
AT, NST
IMAGENET

Results

CIFAR-100
FLSW, CWSM
CIFAR-10, IMAGENET

Guess you like

Origin blog.csdn.net/qq_43812519/article/details/106040319