Knowledge distillation---study notes

Knowledge distillation (positioning distillation and classification distillation)

1. What is knowledge distillation

	知识蒸馏指的是模型压缩的思想,通过使用一个较大的已经训练好的网络去教导一个较小的网络确切地去做什么。
	蒸馏的核心思想在于好模型的目标不是拟合训练数据,而是学习如何泛化到新的数据。所以蒸馏的目标是让学生模型学习到教师模型的泛化能力,理论上得到的结果会比单纯拟合训练数据的学生模型要好。
	在蒸馏的过程中,我们将原始大模型称为教师模型(teacher),新的小模型称为学生模型(student),训练集中的标签称为hard label,教师模型预测的概率输出为soft label,temperature(T)是用来调整soft label的超参数。
	![在这里插入图片描述](https://img-blog.csdnimg.cn/4cd905865a37452dbae3b34466e861b9.png)

2. Why Knowledge Distillation

There is a certain inconsistency in the models used for training and deployment:
in the training process, we need to use complex models and a large number of computing resources in order to extract information from very large and highly redundant data sets. In experiments, the models with the best results are often large in scale, and even obtained by ensembles of multiple models. However, large models are inconvenient to deploy to services. The common bottlenecks are as follows:
• Slow inference speed
• High requirements for deployment resources (memory, video memory, etc.)
• During deployment, we have strict restrictions on latency and computing resources.
Therefore, model compression (reducing the number of parameters of the model while ensuring performance) has become an important issue. And "model distillation" is a method of model compression.

3. Theoretical Basis of Knowledge Distillation

Teacher-Student model: the teacher is the exporter of "knowledge", and the student is the recipient of "knowledge".
The process of knowledge distillation is divided into two stages:
• Original model training:
training the "Teacher model" Net-T, which is characterized by a relatively complex model, and the "Teacher model" does not impose any restrictions on model architecture, parameter quantity, and integration , the only requirement is that ROC/mAP/mIoU indicators
have no obvious bias
• Simplified model training:
train the "Student model" Net-S, which is a single model with a small number of parameters and a relatively simple model structure. Similarly,
for example, the classification model can also output the probability value of the corresponding category after softmax.

4. The key points of knowledge distillation

(1) Obtain a model with strong generalization ability by increasing the network capacity:
it can well reflect the relationship between input and output on all data of a certain problem, whether it is training data, test data, or any other data belonging to the problem. Unknown data for the problem.
(2) When we use Net-T to distill and train Net-S, we can directly let Net-S learn the generalization ability of Net-T.
(3) The straightforward and efficient way to transfer the generalization ability is to
use the probability of the category output by the softmax layer as the "soft target"
[Comparison between the KD training process and the traditional training process]:
the traditional training process (hard targets ): Find the maximum likelihood of ground truth, the relationship between cross entropy and maximum likelihood
KD training process (soft targets): use the class probabilities of the large model as soft targets
(4) Maximum likelihood function
insert image description here

(5) KL divergence
is used to measure the distance between two subdivisions P and Q
insert image description here

(6) Cross entropy and entropy
insert image description here

(7) Cross entropy loss function

insert image description here

5. Why the KD training process is more effective

The output of the softmax layer, in addition to positive examples, negative labels also carry a lot of information, such as the probability of some negative labels is much greater than other negative labels (BMW, rabbit and garbage truck).
In the traditional training process (hard target), all negative labels are converted to 0 after one-hot encoding, thus discarding the information carried by the negative labels. In the end, Net-T only distills the information of positive samples to train negative samples.
insert image description here
(In the traditional training process, only the information of positive samples is retained after one-hot)
and the KD training method does not use one-hot encoding, so that the information of each sample is retained, so that the amount of information brought to Net-S is greater than the traditional one. Training method
insert image description here
(KD training process, retaining information of all negative samples)

Example:
In the handwritten digit recognition task MNIST, there are 10 output categories. Assuming that the "2" of an input is more similar to "3", the probability corresponding to "3" in the output value of softmax is 0.1, while the values ​​​​corresponding to other negative labels are very small, and the other "2" is more similar to "7" , "7" corresponds to a probability of 0.1. The values ​​of the hard targets corresponding to the two "2" are the same, but their soft targets are different. From this we can see that the soft target contains more information than the hard target. And when the entropy of the soft target distribution is relatively high, the knowledge contained in the soft target is richer
insert image description here

6. Softmax with "temperature"

First review the original softmax functioninsert image description here

But if you directly use the output value of the softmax layer as the soft target, this will bring another problem: when the probability distribution entropy of the softmax output is relatively small, the value of the negative label is very close to 0, and the contribution to the loss function is very small, small to negligible. So the variable "temperature" comes in handy.
The following formula is the softmax function after adding the temperature variable: 
insert image description here

Here T is the temperature.
The original softmax function is a special case of T=1. The higher T, the smoother the output probability distribution of softmax, the greater the entropy of the distribution, the information carried by the negative label will be relatively amplified, and the model training will pay more attention to the negative label.
insert image description here

7. The specific method of knowledge distillation (classification network)

1. General knowledge distillation method:
(1) Train the teacher network Net-T well; 
(2) At high temperature T, distill the knowledge of Net-T to Net-S
 The process of training Net-T is very simple, detailed below Talk about the second step: the process of high temperature distillation. The objective function of the high-temperature distillation process
is weighted by distill loss (corresponding to soft target) and student loss (corresponding to hard target). Schematic eg.
insert image description here

2. High temperature distillation process
(1) Loss function:
insert image description here

a) The first part of the loss function,
Net-T and Net-S, are input into the transfer set at the same time (the training set used for training Net-T can be directly reused here
), and the softmax distribution (with high temperature) generated by Net-T is
used as Soft target, the softmax output of Net-S at the same temperature T
and the cross entropy of soft target are the first part of the Loss functioninsert image description here

b) The second part of the loss function is
the softmax output of Net-S under the condition of T=1 and the cross entropy of the ground truth is the second part of the Loss functioninsert image description here

c) Why the second part of loss
Net-T also has a certain error rate, using ground truth can effectively reduce the possibility of errors being propagated to Net-S. For example, although the teacher’s knowledge far exceeds that of the students, he still has the possibility of making mistakes. At this time, if the students can refer to the standard answers at the same time outside the teacher’s teaching, it can effectively reduce the occasional mistakes made by the teacher.” The possibility of "biased"
d) Loss split analysis
insert image description here
insert image description here

insert image description here

(2) Loss function derivation:

insert image description here
insert image description here

8. Location Distillation (LD)

1. Why there is positioning distillation:
bbox distribution and positioning ambiguity
Speaking of LD, we have to talk about bbox distribution modeling, which mainly comes from GFocalV1 (NeurIPS 2020) [1] and Offset-bin (CVPR 2020) [2 ] These two papers.
We know that the representation of bbox is usually 4 values, one is the distance from the point in FCOS to the four sides (tblr), and the other is the offset used in the anchor-based detector, that is, the anchor box to GT Mapping of boxes (encoded xywh).
GFocalV1 models the bbox distribution for the bbox in the tblr form, and Offset-bin models the bbox distribution for the encoded xywh form. What they have in common is that they try to treat bbox regression as a classification problem. And the advantage of this is that the positioning ambiguity of the bbox can be modeled.
insert image description here

Then use n probability values ​​to describe an edge, which can show the model’s fuzzy estimation of the position of a position. The sharper distribution indicates that this position has almost no ambiguity (such as the upper boundary of an elephant), and the flatter distribution indicates this position. There is a strong ambiguity (lower boundary of the elephant). Of course, not only the flatness of the bbox distribution, but also the shape can be divided into unimodal, bimodal, and even multimodal.
2. Positioning distillation:
The idea of ​​LD is self-explanatory. One side of a bbox is n logits, and there are 4 logits in a bbox. Each function acts on a softmax function with temperature to soften the positioning knowledge, and then the same KL loss , let the student's bbox distribution fit the teacher's bbox distribution.
In order to verify the effectiveness of LD, we first perform it on the positive location of the detector, that is, where bbox regression is performed, LD is performed there. The large model teacher is a high-precision detector (such as ResNet-101) that has been trained for 24 epochs, and the small model student can be ResNet-50. On the COCO dataset, we only need to slightly adjust the temperature to increase 1.0AP on the basis of GFocalV1, especially the improvement of AP75 is the most significant, indicating that LD has indeed significantly improved the positioning accuracy.
insert image description here

Positioning distillation LD and classification distillation KD are completely consistent in terms of formula. They are both aimed at the output logits of the head for knowledge transfer, which provides a unified logit mimicking framework for target detection knowledge distillation.

9. Feature map distillation

1. Why and feature map distillation:
Inefficiency of classification KD:
Many previous works have pointed out that the distillation efficiency of classification KD is low (low rising point), which mainly has two aspects:
in different data sets, the number of categories will change , fewer categories may not provide much useful information to the student.
For a long time, logit mimicking can only be operated on the classification head, but not on the positioning head, which naturally ignores the importance of positioning knowledge transfer.
For these two reasons, people turned their attention to another promising knowledge distillation method, Feature imitation. This method is mainly inspired by FitNet. In a word, it is not only to do logit mimicking on the classification head, but also to let the student fit the teacher in the middle hidden layer (feature map) by minimizing the L2 loss.
So the following target detection knowledge distillation framework is formed:
insert image description here

The classification head is logit mimicking (classification KD), the feature map is Feature imitation (L2 loss between the teacher and student feature map), and the positioning head is pseudo bbox regression, that is, the teacher prediction box is regarded as an additional regression target.
Feature imitation imposes supervision on the feature maps of teachers and students. The most common method is to first align the size of the student's feature map with the teacher's feature map, and then select some regions of interest as distillation regions, such as FitNet (ICLR 2015) [3 ] Distilled on the whole image; Fine-Grained (CVPR 2019) [4] Distilled on the location of some anchor boxes; and DeFeat (CVPR 2021) [5] Use a small loss weight inside the GT box and use it outside the GT box Large loss weight; or the dynamic distillation area of ​​GI imitation (CVPR 2021) [6], but no matter what area is selected, the L2 loss of the two is finally calculated on the distillation area. 2. The benefits of Feature imitation
:
in Under the multi-task learning framework, the feature map is equivalent to the root of the tree, and each downstream head is equivalent to the leaves of the tree. Then the feature map obviously contains the knowledge required by all leaves. Performing Feature imitation will naturally transfer classification knowledge and positioning knowledge at the same time, but classification KD cannot transfer positioning knowledge.
3. Disadvantages of Feature imitation:
The answer is naturally that it will simultaneously transfer classification knowledge and positioning knowledge at each location in the distillation area.
Comparing before and after, isn't it contradictory at first glance? Let me explain.
Classification knowledge is distributed differently from localization knowledge. This point has been mentioned in previous work, such as Sibling Head (CVPR 2020) [7].
The distribution of the two kinds of knowledge is different, which naturally leads to the fact that it is not beneficial to transfer classification knowledge and location knowledge at the same time. It is quite possible that some regions are only beneficial for taxonomic knowledge transfer, and it is also possible that some regions are only beneficial for localized knowledge transfer. In other words, we need to divide and conquer and transfer knowledge according to local conditions. This is obviously something that Feature imitation can't do, because it will only transfer mixed knowledge.
insert image description here

So we use multi-task learning to naturally decouple knowledge into different types, which allows us to selectively carry out knowledge distillation in an area. To this end, we introduce a concept of VLR (Valuable Localization Region) to help us divide and conquer.
insert image description here

Different from the previous Feature imitation method, our distillation is divided into two regions:
Main distillation region (main distillation region): the positive location of the detector, obtained through label assignment.
VLR: Similar to the general label assignment method, but the area is larger, including the Main region, but the Main region is removed. VLR can be regarded as the outward expansion of the Main region.

Guess you like

Origin blog.csdn.net/weixin_43391596/article/details/128903746