[Knowledge Distillation] Detailed explanation of Knowledge Distillation technology

Reference paper: Knowledge Distillation: A Survey

1 Introduction

       In recent years, deep learning has achieved great success in academia and industry, rooted in its scalability and ability to encode large-scale data. However, the main challenge of deep learning is that, limited by resource capacity, deep neural models are difficult to deploy on resource-constrained devices. Such as embedded devices and mobile devices . Therefore, a large number of model compression and acceleration technologies have emerged, and knowledge distillation is the representative of them , which can effectively learn from large teacher models to small student models. This paper reviews knowledge distillation from the aspects of knowledge category, training mode, teacher-student structure, distillation algorithm, performance comparison and application.

       ​ Mentioned some progress of deep learning in computer vision, reinforcement learning and natural language processing in recent years, and pointed out that the main challenge lies in the deployment of large models in practical applications. In order to develop efficient deep models, research in recent years has mainly focused on the development of deep models: (1) efficient and fast construction of deep neural networks; (2) model compression and acceleration technologies, mainly including the following:

  • Parameter pruning and sharing: This method focuses on removing unnecessary parameter low-rank decompositions in the model:
  • The method utilizes matrix factorization and tensor factorization to identify redundant parameters of deep neural networks;
  • Shifting compact convolution filters: This method removes unnecessary parameters by shifting or compressing convolution filters;
  • Knowledge Distillation (KD): This method extracts knowledge from a larger deep neural network into a smaller network
    insert image description here

       ​ In order to solve the problem of online deployment of large models, Bucilua et al. (2006) first proposed model compression, which converts information in large models or model collections into training small models without significantly reducing accuracy. In semi-supervised learning, knowledge transfer between a fully supervised teacher model and a student model using unlabeled data is introduced. Learning from small models to large models was later formally named knowledge distillation (Hinton). The main idea of ​​knowledge distillation is: the student model imitates the teacher model, and the two compete with each other, so the student model can perform on par with the teacher model or even excel . The key issue is how to transfer knowledge from a large teacher model to a small student model. The knowledge distillation system consists of three key parts: knowledge, distillation algorithm, and teacher-student architecture. As shown in FIG.
       Despite great success in practice, knowledge distillation is poorly understood, both theoretically and empirically. Regarding the principle of knowledge distillation, Urner et al. used unlabeled data to prove that the knowledge transfer from the teacher model to the student model is PAC-learnable; Phuong & Lampert obtained knowledge by learning the rapid convergence of the distillation student network in the deep linear classifier scenario Proof of distillation, this argument answers the content and speed of student learning, and reveals the factors that determine the success of distillation, successful distillation depends on the distribution of data, the optimization bias of the distillation target and the strong monotonicity of the student classifier; Cheng et al. Human quantified the extraction of visual concepts from intermediate layers of deep neural networks to explain knowledge distillation; Ji and Zhu theoretically explained generalized neural network knowledge distillation from three aspects: risk bounds, data efficiency, and imperfect teachers; Cho and Hariharan conducted a detailed empirical analysis on the efficacy of knowledge distillation; the empirical results of Mirzadeh et al. showed that due to the gap in model capabilities, teachers with larger models may not necessarily be better; experiments by Cho and Hariharan (2019) also showed that, Distillation can have adverse effects on student learning. Different Forms of Knowledge Distillation Empirical evaluation of knowledge, distillation, and teacher-student interaction is not included; knowledge distillation is also used for label smoothing, evaluating teacher accuracy, and obtaining optimal output layer parameter distribution priors.
       ​ Knowledge distillation is similar to the human learning process. Based on this, some research in recent years has expanded the teacher-student model and developed it into mutual learning, lifelong learning and self-learning . At the same time, influenced by knowledge distillation in model compression, knowledge transfer has been used in compressed data, such as dataset distillation.

Article structure diagram
insert image description here

2. Knowledge

       In knowledge distillation, knowledge type, distillation strategy, and teacher-student architecture play a crucial role in the learning of the student model. The original knowledge distillation uses the logarithm of the large depth model as the teacher's knowledge (Hinton 2015), and the activations, neurons or features of the middle layer can also be used as the knowledge to guide the learning of the student model. The difference between different activations, neurons or pairs of samples Relations contain the rich information learned by the teacher model. In addition, the parameters of the teacher model (or the connections between layers) also contain another kind of knowledge. This section mainly discusses the following types of knowledge: Response-based Knowledge (response-based knowledge), feature-based knowledge (feature-based knowledge), relation-based knowledge (relation-based knowledge), the following figure is an intuitive example of different knowledge categories in the teacher model.

insert image description here

2.1. Response-Based Knowledge

       ​ Response-Based Knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to let the student model directly imitate the final prediction (logits) of the teacher model. Assuming that the logarithmic vector Z is the final output of the fully connected layer, the response-based knowledge distillation form can be described as:
insert image description here

       ​ LR(.) represents the divergence loss (cross-entropy loss can also be used here), as shown in the figure below, which is the Response-Based knowledge distillation model diagram.
insert image description here
       Response-based knowledge can be used for different types of model predictions. For example, the response in an object detection task may contain logits of the offsets of the bounding box; in the human pose estimation task, the response of the teacher model may include a heat map of each landmark. The most popular response-based knowledge for image classification is called soft target. The soft target is the probability of the input category, which can be estimated by the softmax function as:
insert image description here
       Zi is the logit of the i-th category, and T is the temperature factor, controlling the importance of each soft target. Soft targets contain informative dark knowledge from the teacher model. Therefore, the distillation loss for soft logits can be rewritten as:

insert image description here

​ Usually, LR(.) uses KL divergence loss (Kullback-Leibler divergence, a measure of the similarity of two probability distributions). Optimizing this equation allows the learned logits to match the teacher's logits. The following figure shows the specific architecture of benchmark knowledge distillation.

insert image description here
​​However, response-based knowledge usually needs to rely on the output of the last layer, and cannot solve the supervision of intermediate layers from the teacher model, which is very important for representation learning with very deep neural networks. Since logits are actually class probability distributions, response-based knowledge distillation is limited to supervised learning.

2.2. Feature-Based Knowledge

​​​​ Deep neural networks are good at learning representations at different levels, so both the middle layer and the output layer can be used as knowledge to train the student model, for the output of the last layer and the output of the middle layer (feature map, feature map) , can be used as the knowledge to supervise the training of the student model. The Feature-Based Knowledge in the middle layer is a good supplement to the Response-Based Knowledge. The main idea is to directly match the feature activation of the teacher and the student. In general, the distillation loss of Feature-Based Knowledge transfer can be expressed as:

insert image description here

​​​​ Among them, ft(x) and fs(x) are the feature maps of the middle layer of the teacher model and the student model, respectively. The transformation function is applied when the feature maps of the teacher and student models are of different sizes. LF(.) measures the similarity of two feature maps, commonly used are L1 norm, L2 norm, cross entropy, etc. The figure below shows the general architecture of a feature-based knowledge distillation model.
insert image description here
insert image description here
​​ Although feature-based knowledge transfer provides good information for the learning of the student model, how to effectively select the hint layer from the teacher model and the guide layer from the student model remains to be further studied. Since there are significant differences in the size of the hint layer and the guide layer, how to correctly match the feature representations of the teacher and the student also needs to be explored.

2.3. Response-Based Knowledge

Both response-based and feature-based knowledge use the output of a specific layer in the teacher model, and relation-based knowledge further explores the relationship of different layers or data samples. In general, the relational knowledge distillation loss based on the feature map relationship is expressed as follows:
insert image description here
where ft and fs represent the feature maps of the teacher model and the student model, respectively, and ft^, ft and fs^, fs~ are the features of the teacher model and the student model, respectively Graph groups (pairs). The function represents the similarity function of the feature group.
Traditional knowledge transfer methods often involve a single knowledge distillation, and the soft target of the teacher is directly distilled to the student. In fact, the distilled knowledge contains not only feature information, but also the interrelationships between data samples. The relationship in this section involves many aspects, and the author's design is very flexible. It is recommended to read the original paper more clearly.
insert image description here

3. Distillation Schemes

​​​ According to whether the teacher model is updated at the same time as the student model, the learning scheme of knowledge distillation can be divided into offline distillation, online distillation, and self-distillation.

insert image description here

3.1. Offline distillation (offline distillation)

​​​​ Most previous knowledge distillation methods are offline. In the initial knowledge distillation, knowledge is transferred from the pre-trained teacher model to the student model. The whole training process includes two stages: 1) the large teacher model is trained on the training samples before distillation; 2) the teacher model is in the form of logits or intermediate features Knowledge is extracted, which guides the training of the student model during the distillation process. The structure of the teacher is predefined, and little attention is paid to the structure of the teacher model and its relationship with the student model. Therefore, offline methods mainly focus on different parts of knowledge transfer, including knowledge design, feature matching or loss functions for distribution matching. The advantage of the offline method is that it is simple and easy to implement.
​​​​ Offline distillation methods usually adopt a one-way knowledge transfer and a two-stage training procedure. However, long training time, complex, and high-capacity teacher models cannot be avoided, and the training of student models in offline distillation is usually effective under the guidance of the teacher model. In addition, the ability gap between teachers and students persists, and students are often extremely dependent on teachers.

3.2. Online distillation (online distillation)

​​​​ To overcome the limitations of offline distillation, online distillation is proposed to further improve the performance of the student model, especially in the absence of a high-capacity high-performance teacher model. During online distillation, the teacher model and student model are updated synchronously, and the entire knowledge distillation framework is end-to-end trainable.
​​​​​​ Online distillation is a single-stage end-to-end training scheme with efficient parallel computation. However, existing online methods such as mutual learning are often unable to address high-volume teachers in online environments, making it an interesting topic to further explore the relationship between teacher and student patterns in online settings.

3.3. Self-distillation

       In self-distillation, the teacher and student models use the same network, which can be seen as a special case of online distillation. For example the paper (Zhang, L., Song, J., Gao, A., Chen, J., Bao, C. & Ma, K. (2019b). Be your own teacher: Improve the performance of convolutional eural networks via self distillation. In ICCV.) Distills the deep knowledge of the network to the shallow part.
​ Offline, online, and self-distillation can also be intuitively understood from the perspective of human teacher-student learning. Offline distillation means that knowledgeable teachers teach students knowledge; online distillation means that teachers and students learn together; self-distillation means that students learn knowledge by themselves. And, like the way humans learn, these three distillations complement each other due to their own merits.

4. Teacher-Student Architecture

       In knowledge distillation, the teacher-student structure is the general carrier for knowledge transfer. In other words, the teacher-student structure determines the quality of knowledge extracted by the student model from the teacher model. To describe the human learning process, we hope that students will acquire a good teacher to acquire knowledge. Therefore, how to choose or design an appropriate teacher-student structure in the process of knowledge refinement is an important and difficult problem. Recently, during the distillation process, the teacher and student model settings are almost fixed in size and structure, thus easily causing model capacity gaps. However, how to specifically design the architectures of teachers and students, and why their architectures are determined by these model settings, is almost missing. The setting of the model between the two mainly has the following relationship:

insert image description here
       Knowledge distillation was used to compress the model in the early days. Hinton et al. (Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.) once distilled knowledge Designed to compress ensembles of deep neural networks. The complexity of deep neural networks mainly comes from the two dimensions of depth and width. Often there is a need to transfer knowledge from deeper, wider neural networks to shallower, narrower neural networks. The structure of the student network usually has the following options: 1) a simplified version of the teacher network, with fewer layers and fewer channels in each layer; 2) retaining the structure of the teacher network, and the student network is its quantized version; 3) has an efficient A small network for basic operations; 4) a small network with an optimized global network structure; 5) the same structure as the teacher network.
The capacity gap between large and small networks hinders knowledge transfer. To effectively transfer knowledge into student networks, many methods have been proposed to control and reduce the complexity of the model. The paper (Mirzadeh, SI, Farajtabar, M., Li, A. & Ghasemzadeh, H. (2020). Improved knowledge distillation via teacher assistant. In AAAI.) introduces a teacher assistant to reduce the gap between the teacher model and the student model. training gap. Another work further closes the gap through residual learning, using an auxiliary structure to learn residuals. There are other approaches that focus on reducing the structural differences between the teacher model and the student model. For example, quantization is combined with knowledge distillation, that is, the student model is a quantized version of the teacher model.
       Previous studies mostly focused on designing the structure of the student model or the knowledge transfer scheme between the teacher-student model. In order to make the small student model match well with the large teacher model and thus improve the knowledge distillation performance, an adaptive teacher-student learning structure is required. In recent years, the idea of ​​neural structure search in knowledge distillation has been proposed, that is, the joint search of student structure and knowledge transfer under the guidance of the teacher model will become a hot spot in future research. In addition, the idea of ​​dynamically searching for knowledge transfer mechanisms has also emerged in knowledge distillation, for example, using reinforcement learning to automatically remove redundant layers in a data-driven manner and finding the optimal student network given a teacher network.

5. Distillation algorithm

A simple yet effective approach to knowledge transfer is to directly match representation distributions in the feature space between response-based, feature-based, or teacher and student models. Many different algorithms have been proposed to improve the process of transferring knowledge in more complex environments.

5.1. Adversarial Distillation

​ In knowledge distillation, it is difficult for the teacher model to learn the real data distribution. At the same time, the student model has a small capacity and cannot accurately imitate the teacher model. In recent years, adversarial training has achieved success in generative networks, where the discriminator in a generative adversarial network (GAN) estimates the probability that a sample comes from the training data distribution, while the generator tries to fool the discriminator's probability prediction using the generated data sample. Inspired by this, many adversarial knowledge distillation methods are proposed to enable the teacher and student network to better understand the real data distribution, as shown in the figure below, the application of adversarial training in knowledge distillation can be just divided into three categories.
insert image description here

a) Train an adversarial generator to generate synthetic data, either directly as the training set or for augmenting the training set.

insert image description here
​ where Ft(.) and Fs(.) are the outputs of the teacher model and the student model respectively; G(z) represents the training samples generated by the generator G given a random input vector z; LG is the distillation loss to force the prediction The matching between the probability distribution of the probability distribution and the real probability distribution, the distillation loss function usually adopts cross entropy or KL divergence.

b) Using the discriminator, use logits or features to distinguish samples from the teacher or student model.

​ Representative methods such as papers (Wang, Y., Xu, C., Xu, C. & Tao, D. (2018f). Adversarial learning of portable student networks. In AAAI.), its loss can be expressed as:
insert image description here
​ where , G is a student network, and LGAN is the loss function used by the generative adversarial network to make the output between the student and the teacher as similar as possible.

c) It is done in an online manner, and in each iteration, the teacher and students jointly optimize.

Using knowledge distillation to compress GAN, the small GAN ​​student network imitates the large GAN teacher network through knowledge transfer. From the above adversarial distillation methods, three main conclusions can be drawn: 1) GAN is an effective tool to improve students' learning ability through teacher knowledge transfer; 2) joint GAN and knowledge distillation can generate valuable information for the performance of knowledge distillation. data, overcoming the limitation of unavailable and inaccessible data; 3) knowledge distillation can be used to compress GAN.

5.2. Multi-teacher Distillation

Different teacher architectures can provide student networks with their own useful knowledge. During the training of one teacher network, multiple teacher networks can be used for distillation individually or collectively. In a typical teacher-student framework, the teacher is usually a large model or a collection of large models. To transfer knowledge from multiple teachers, the easiest way is to use the average responses from all teachers as supervision information. The general framework of multi-teacher distillation is shown in the figure below.

insert image description here

​​​​ Multiple teacher networks typically use logits and feature representations as knowledge. Besides the average logits from all teachers, there are other variants. The literature (Chen, X., Su, J., & Zhang, J. (2019b). A two-teacher framework for knowledge distillation. In ISNN.) uses two teacher networks, one of which will be based on the response knowledge transfer to the student, and the other transfers feature-based knowledge to the student. Literature (Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J. & Ramabhadran, B. (2017). Efficient knowledge distillation from an ensemble of teachers. In Interspeech.) at A teacher is randomly selected from the teacher network pool in each iteration. In general, multi-teacher knowledge distillation can provide rich knowledge and can tailor an all-round student model for the diversity of different teachers' knowledge. However, how to effectively integrate different types of knowledge from multiple teachers requires further research.

5.3. Cross-Modal Distillation

       Data or labels for some modalities may not be available during training or testing, thus requiring knowledge transfer between different modalities. There are a large number of well-annotated data samples on one modality (such as RGB images) pre-trained by the teacher model, (Gupta, S., Hoffman, J. & Malik, J. (2016). Cross modal distillation for supervision transfer . In CVPR.) to transfer knowledge from a teacher model to a student model, using new unlabeled input modalities such as depth image and optical flow. Specifically, the proposed method relies on unlabeled paired samples involving two modalities, namely RGB and depth images. The features obtained by the teacher from the RGB images are then used for supervised training of the students. The idea behind paired samples is to transfer annotation or label information via paired samples, and has been widely used in cross-modal applications. Examples of paired samples are also 1) in human action recognition models, RGB video and bone sequences; 2) in visual question answering methods, knowledge transfer from trilinear interactive teacher models with image-question-answer as input to Feed the image-question as an input bilinear into the student model. The framework of cross-modal distillation is as follows:

insert image description here
       Cross-modality is summarized below. Among them, ResK represents response-based knowledge, FeaK represents feature-based knowledge, and RelK represents relation-based knowledge.
insert image description here

5.4. Graph-Based Distillation

Most knowledge distillation algorithms focus on transferring individual instance knowledge from teachers to students, while some methods have recently been proposed to use graphs to explore relationships within data. The main idea of ​​these graph-based distillation methods is to 1) use the graph as the carrier of the teacher's knowledge; 2) use the graph to control the transfer of the teacher's knowledge. Graph-based knowledge can be classified as relation-based knowledge. Graph-based knowledge distillation is shown in the following figure:
insert image description here

1) Using graphs as the carrier of teacher knowledge

​ In the literature (Zhang, C. & Peng, Y. (2018). Better and faster: knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In IJCAI), each vertex represents a self-supervised teacher , using logits and intermediate features to construct two graphs to transfer knowledge from multiple self-supervised teachers to schools.

2) Use graphs to control knowledge transfer

       Literature (Luo, Z., Hsieh, JT, Jiang, L., Carlos Niebles, J.& Fei-Fei, L. (2018).Graph distillation for action detection with privileged http://modalities.In ECCV.) Will Modal diffing incorporates privileged information from the source field, privileged information. A directed graph is introduced to explore the relationship between different modalities. Each vertex represents a modality, and the edges represent the connection strength between one modality and another.

5.5. Attention-Based Distillation

       The attention mechanism can well reflect the activation of neurons in the neural network, so the attention mechanism is introduced in the knowledge distillation to improve the performance of the student model. The core of knowledge transfer based on attention mechanism is to define the attention map and embed features into each layer of the neural network. That is, the feature embedding knowledge is transferred using the attention map function.

5.6. Data-Free Distillation

       The background of the method without data distillation is to overcome the lack of data due to privacy, legality, security and confidentiality issues. "data free" indicates that there is no training data, and the data is newly generated or synthetically generated. Nascent data can be generated using GAN. Synthetic data is usually generated from feature representations of pre-trained teacher models.

insert image description here

       Although data-free distillation shows great potential when data is unavailable, how to generate high-quality diverse training data to improve the generalization ability of models is still a very challenging task.

5.7. Quantized Distillation

       Network quantization reduces the computational complexity of neural networks by converting high-precision networks (such as 32-bit floating point) to low-precision networks (such as 2-bit and 8-bit). At the same time, the goal of knowledge distillation is to train small models to have comparable performance to complex models. Under the teacher-student framework, some KD methods are proposed using the quantization process, as shown in the figure below;

insert image description here

5.8. Lifelong Distillation

       Lifelong learning includes continuous learning, continuous learning and meta-learning, which aims to learn in a similar way to human beings. It accumulates previously learned knowledge and transforms learned knowledge into future learning, and knowledge distillation provides an efficient way to preserve and transfer learned knowledge without catastrophic forgetting. Recently, more and more variants of KD have been developed, which are based on lifelong learning.

5.9. NAS-Based Distillation

       Neural Architecture Search (NAS), which is one of the most popular automated machine learning (or AutoML), aims to automatically identify deep neural models and adaptively learn suitable deep neural architectures. In knowledge distillation, the success of knowledge transfer depends not only on the teacher's knowledge, but also on the student's architecture. However, there may be a capability gap between the big teacher model and the small student model, making it difficult for students to learn well from their teachers. To address this issue, neural architecture search is employed to find suitable student architectures.

6. Performance Comparison (Performance Comparison)

       To better demonstrate the effectiveness of knowledge distillation, the classification performance of some typical KD methods on two popular image classification datasets is summarized. These two datasets are CIFAR10 and CIFAR100 respectively. Both have 50,000 training images and 10,000 testing images, and each class has the same number of training and testing images. For a fair comparison, the KD of the experimental classification accuracy results (%) are directly taken from the corresponding original papers, which report the performance of different methods when using different types of knowledge, distillation schemes, and teacher/student model structures. The accuracy in brackets is the classification result after the teacher model and the student model are trained separately.

insert image description here

       From the performance comparison in the above table, the following points can be summarized:

  • Knowledge distillation can be implemented on different deep models;
  • Model compression of different depth models can be achieved through knowledge distillation;
  • Online knowledge distillation based on collaborative learning can significantly improve the performance of deep models;
  • Self-distillation can improve the performance of deep models very well;
  • Offline and online distillation methods usually transfer feature-based knowledge and response-based knowledge respectively;
  • The performance of lightweight deep models (students) can be improved by knowledge transfer to high-capacity teacher models.

       By comparing the performance of different knowledge distillation methods, it can be concluded that knowledge distillation is an effective and efficient deep model compression technique.

7. Applications

       Knowledge distillation, as an effective deep neural network compression and acceleration technique, has been widely used in various fields of artificial intelligence, including visual recognition, speech recognition, natural language processing (NLP) and recommendation systems. Furthermore, knowledge distillation can also be used for other purposes such as data privacy and as a defense against attacks. This section briefly reviews applications of knowledge distillation.

KD in NLP

       Traditional language models (such as BERT) are complex in structure and consume a lot of time and resources. Knowledge distillation is a method widely studied in the field of natural language processing, and its purpose is to obtain a lightweight, efficient and effective language model. More and more KD methods have been proposed to solve a large number of NLP tasks. Among these KD-based NLP methods, most belong to natural language understanding (NLU), and many of them are designed as task-specific distillation and multi-task distillation.

       Here are some summaries of knowledge distillation in natural language processing.

  • Knowledge distillation provides efficient and effective lightweight language deep models. A large-capacity teacher model can transform the rich knowledge in a large amount of different kinds of language data into a small-capacity student model, enabling students to complete many language tasks quickly and efficiently.
  • Considering that knowledge in multilingual models can be transferred and shared with each other, teacher-student knowledge transfer can easily and effectively solve multiple multilingual tasks.
  • In deep language models, sequence knowledge can be efficiently transferred from large to small networks

8. Conclusion and Discussion

       This paper provides an overview of knowledge distillation from the aspects of knowledge, distillation scheme, teacher-student architecture, distillation algorithm, performance comparison, and application. Below, the challenges faced by knowledge distillation are mainly discussed, and some insights into future research on knowledge distillation are proposed. .

8.1. Challenges

  • The importance of different knowledge sources, and how they are integrated, how to model different types of knowledge within a unified and complementary framework remains a challenge;
  • To improve the effectiveness of knowledge transfer, further studies on the relationship between model complexity and existing distillation schemes or other novel distillation schemes are needed.
  • How to design an effective student model, or construct a suitable teacher model, is still a challenging problem in knowledge distillation. Interpretability of Knowledge Distillation.

       Despite the abundance of knowledge distillation methods and applications, the understanding of knowledge distillation, including theoretical explanations and empirical evaluations, is still insufficient. The assumption of linearization of teacher and student models makes it possible to study theoretical explanations of student learning characteristics through distillation. However, it is still very difficult to deeply understand the generalization of knowledge distillation, especially how to measure the quality of knowledge or the quality of teacher-student construction.

8.2. Future Directions

       To improve the performance of knowledge distillation, the most important factors include: what kind of teacher-student network architecture, what kind of knowledge is learned from the teacher network, and where it is refined into the student network.

  • Among the existing knowledge distillation methods, there are few related works discussing the combination of knowledge distillation and other various compression methods;
  • In addition to model compression for deep neural network acceleration, knowledge distillation can also be used for other problems due to the natural nature of knowledge transfer in the teacher-student structure. For example, data privacy, data augmentation, adversarial training, and multimodality.

Guess you like

Origin blog.csdn.net/Roaddd/article/details/129201010