Introduction to knowledge distillation

knowledge distillation

——Huang Zhenhua, Yang Shunzhi, Lin Wei, et al. Review of knowledge distillation research [J]. Journal of Computer Science, 2022, 45(3):30.

concept

Knowledge distillation is a teacher- student (Teacher-Student) training structure. Usually the trained teacher model provides knowledge, and the student model obtains the teacher's knowledge through distillation training. It can reduce complexity at the expense of slight performance loss. The knowledge of the teacher model is transferred to the simple student model. In subsequent research, academia and industry expanded the application scope of knowledge distillation and proposed using knowledge distillation to enhance model performance. Based on this, the application scenarios are divided into There are two technical directions: model compression and model enhancement based on knowledge distillation . Model compression is where the teacher network guides the training of the student network on the same labeled data set to obtain a simple and efficient network model; model enhancement emphasizes the use of other resources (such as unlabeled or cross-modal data) or knowledge distillation Optimize strategies (such as mutual learning and self-learning) to improve the performance of a complex student model. As shown in the figure on the right, an unlabeled sample serves as the input of both the teacher and student networks. A powerful teacher network can usually predict the label of the sample, and then use the label to guide the training of the complex student network.

knowledge type

Vanilla Knowledge Distillation [2] just learns a lightweight student model from the soft targets output by the teacher model. However, when the teacher model becomes deeper, just learning the soft targets is not enough. Therefore, We not only need to obtain the knowledge output by the teacher model, but also need to learn other knowledge implicit in the teacher model. The knowledge forms that can be used include output feature knowledge, intermediate feature knowledge, relationship feature knowledge and structural feature knowledge . 4 types of knowledge distillation The relationship between knowledge forms is shown in Figure 5. From the perspective of students' problem-solving, these four knowledge forms can be metaphorically compared to: output feature knowledge provides the answer to the problem, intermediate feature knowledge provides the process of solving the problem, and relational feature knowledge It provides methods to solve problems, and knowledge of structural characteristics provides a complete knowledge system.

knowledge distillation method

From the perspective of knowledge utilization, the main methods of knowledge distillation are summarized and analyzed, including knowledge merging, multi-teacher learning, teacher assistants, cross-modal distillation, mutual distillation, lifelong distillation and self-distillation.

knowledge merging

Knowledge Amalgamation (KA) is to transfer the knowledge of multiple teachers or multiple tasks into a single student model so that it can handle multiple tasks at the same time. The focus of knowledge amalgamation is how students should combine the knowledge of multiple teachers It is used to update the parameters of a single student model, and the student model after training can handle the original tasks of multiple teacher models.

multi-teacher learning

Knowledge merging and learning from multiple teachers (Learning from Multiple Teachers) both belong to the "multi-teacher-single-student" network training structure. Their similarity is that knowledge merging and multi-teacher learning both learn the knowledge of multiple teacher models, but they The goals are different. Knowledge merging is to enable the student model to handle the original tasks of multiple teacher models at the same time, while multi-teacher learning is to improve the performance of the student model on a single task .

teacher assistant

Due to the large capacity difference between teacher and student models, there is a "generation gap" between them. The "generation gap" can be alleviated by transferring the teacher's characteristic knowledge, or by using a Teacher Assistant network to assist student model learning.

cross-modal distillation

In many practical applications, data usually exists in multiple modalities, and some data in different modalities describe the same thing or event. We can use synchronized modal information to implement cross-modal distillation (Cross Modal Distillation). One representative one is the cross-modal emotion recognition method proposed by Albanie et al. [60]. The facial emotions and voice emotions of people when speaking are consistent. This synchronized and aligned modal information is used to train unlabeled videos as input data. The pictures in the videos are entered into the pre-trained face teacher model to generate soft images. Goal to guide students’ speech model training.

Mutual distillation (deep mutual learning)

Mutual Distillation is to allow a group of untrained student models to start learning at the same time and solve tasks together. It is an online knowledge distillation, that is, the teacher and student models are trained and updated at the same time. The idea of ​​mutual distillation Proposed by Zhang et al. [81] in 2017, its significance is that in the absence of a strong teacher, the student model can improve performance through ensemble predictions learned from each other. Mutual distillation avoids dependence on a strong teacher model, while the student model can Benefit from mutual learning. Through mutual distillation, original combinations of various models can evolve into new combinations with better performance.

Distilled for Life

When deep learning networks learn new tasks, their performance on old tasks will drop sharply. This phenomenon is called catastrophic forgetting [85]. This requires the use of lifelong learning to mitigate this impact. Lifelong learning is also called continuous learning. Or incremental learning. Currently, some work uses knowledge distillation methods to achieve lifelong learning, called Lifelong Distillation. Lifelong Distillation is to maintain the performance of old tasks and adapt to new tasks through knowledge distillation , and its focus is on training new tasks. How to maintain the performance of old tasks when using data to mitigate catastrophic forgetting. Knowledge distillation can better solve this problem, that is, it can minimize the difference between the responses of the old and new networks to the old classes [85].

self distillation

Self-Distillation is a single network used as both a teacher and a student model, allowing a single network model to improve performance through knowledge distillation in the process of self-learning. Self-distillation is mainly divided into two categories. The first category uses different sample information for mutual distillation. Soft labels of other samples can avoid overconfident predictions by the network, and can even reduce the intra-class distance by minimizing the prediction distribution between different samples [ 90]. Other works use enhanced sample information, such as using the feature consistency of data under different distortion states to promote intra-class robust learning [91]. Another type is self-distillation between network layers of a single network. The most common approach is to use the features of deep networks to guide the learning of shallow networks [92], where the features of deep networks include the soft targets output by the network. In the task of sequence features, the knowledge in previous frames is transferred Learn for subsequent frames [93]. The learning of each network block of a single network can also be bidirectional, and each block can perform collaborative learning and guide each other's learning throughout the training process.

Integration of knowledge distillation and other technologies

Generative Adversarial Network

The combination of knowledge distillation and GAN is to introduce adversarial learning strategies in the learning of knowledge distillation . GAN generates new images through adversarial learning, that is, learning to generate images that cannot be distinguished by the discriminator network. Its main structure includes a generation and a discriminator. The generator is used to generate samples that are as close as possible to the real data so that the discriminator cannot distinguish them. In knowledge distillation, the main task of the GAN discriminator is to distinguish the feature map distribution or data distribution of different networks, while the generator The main task is to generate the relevant feature map distribution or data distribution of a given instance. The generator is generally a student model [94] or a teacher-student model [95]. Specifically, GAN-based knowledge distillation uses a discriminator to implement The knowledge of the teacher and student models converges, making it impossible for the discriminator to tell whether the knowledge comes from the teacher or the student model.

Neural Architecture Search

NAS uses a certain strategy to search for the optimal network structure under a given search space. Compared with ordinary NAS, the soft targets generated by NAS based on knowledge distillation contain more information. By utilizing the additional information of soft targets , can speed up NAS's search for network structure. The additional information generated by knowledge distillation can also guide NAS to search for the best student model framework, that is, improve the performance of the student model. Network structure is a very important form of knowledge.

reinforcement learning

The purpose of reinforcement learning (Reinforcement Learning) is to enable the agent to learn the best strategy based on the environment status, actions and rewards. When the reinforcement learning environment has been set up, the knowledge transferred using knowledge distillation can produce better performance. Student agent. Knowledge distillation based on reinforcement learning mainly has two purposes. The first is a strategy used to strengthen the deep learning network model. The main idea is to combine the strategies of one or more teachers through knowledge distillation. Teacher and student models All need to maximize the return of the student model under the constraints of the environment. Therefore, the student model can obtain higher performance strategies from multiple teacher agents [103], and student models with different strategies can also distill each other. Continue to strengthen the policy [104]. The second purpose is to obtain a more lightweight network model . The combination of reinforcement learning and knowledge distillation can transfer the policy knowledge in the reinforcement learning network model to a lightweight single network [ 105], or using the teacher’s strategic knowledge to gradually reduce redundancy in the student model [106].

graph convolution

Graph Convolutional Network (GCN) is a convolutional network characterized by a set of nodes and the relationship between nodes. It is widely used due to its powerful modeling capabilities. The network model of graph convolution has a topological structure , knowledge distillation can transfer the topological structure knowledge of the graph convolution teacher model to the student model [107]. At the same time, graph convolution can also promote knowledge transfer. Knowledge distillation mainly uses the powerful modeling ability of graph convolution to convert the graph into Convolution captures the transfer of certain areas of knowledge from the teacher model to the student model, such as the geometry of space [108], the interaction of targets in space and time, and the complementary knowledge among multiple teachers.

autoencoder

Autoencoder (AE) is an unsupervised neural network model. Due to its good performance in compression ratio and transfer learning, it has been widely used in dimensionality reduction and generative model training. Autoencoder uses feature re-analysis It can automatically learn hidden features from data samples. This feature can assist knowledge distillation to improve the performance of student networks. Similarly, knowledge distillation, as an auxiliary technology, can help autoencoders learn more robust features. Characteristic representation of stickiness .

Ensemble learning

The core idea of ​​Ensemble Learning is that "three stooges are as good as one Zhuge Liang". It uses multiple networks to process the same task, and its performance is usually better than a single network. The fusion of ensemble learning and knowledge distillation An important application direction is to make the performance of a simple model comparable to that of multiple integrated networks ; in addition, knowledge distillation can also be used to enhance an integrated model composed of multiple student networks . It mainly makes multiple student networks pass In the peer teaching method, mutual guidance and learning is carried out from the beginning, and finally integrated into an integrated network with stronger reasoning performance.

federated learning

Knowledge distillation can be used to reduce the bandwidth occupied by distributed federated learning training. It achieves cost compression by reducing the transmission of some parameters or samples at each stage of federated learning. Some work only transmits the prediction information of the model instead of For the entire model parameters, each participant uses knowledge distillation, and the learning server aggregates the prediction information of the global model to improve performance. Knowledge distillation and federated learning can reduce network bandwidth and allow model heterogeneity by only transmitting the prediction values ​​of each local model and global model. and protect data privacy. Other work focuses on using knowledge distillation to efficiently fuse the heterogeneous local model knowledge of each participant. For example, Shen et al. [128] use mutual distillation to train different architectures in the process of federated learning local updates. model. Lin et al. [129] utilize unlabeled data or pseudo-samples to aggregate model knowledge of all heterogeneous participants.

Guess you like

Origin blog.csdn.net/qq_43570515/article/details/130294999