[Neural Network] 2021-IJCAI-Learning from Concepts: Toward Pure Memory of Few-shot Learning

Learning from concepts: Towards pure memory with few-shot learning

Paper address

Summary

 Humans have strong generalization abilities and can recognize a new category by seeing only a few examples. This is because humans have the ability to learn from concepts that already exist in our minds. However, many existing few-shot methods fail to address the fundamental problem of how to leverage past learned knowledge to improve predictions for new tasks. In this paper, we propose a novel model that simulates the human recognition process 净化记忆机制. This new memory update scheme enables the model to purify information from semantic labels and gradually learn consistent, stable, and expressive concepts as it is trained episode by episode. On this basis, 图增强模块(Graph Augmentation Module,GAM)a graph neural network is introduced to aggregate these concepts and knowledge learned from new tasks to make predictions more accurate. In general, our approach is model-agnostic, computationally efficient, and has negligible memory cost. Extensive experiments on several benchmarks show that the proposed method can consistently outperform a large number of state-of-the-art small-shot learning methods.

1 Introduction

 The success of deep learning comes from large amounts of labeled data [Noh et al, 2017; Bertinetto et al, 2016; Long et al, 2015], and humans have good generalization capabilities by seeing only a few examples. The gap between these two facts has brought great attention to the research of few-shot learning [Vinyals et al, 2016; Finn et al, 2017; Sung et al, 2018]. Unlike traditional deep learning scenarios, few-shot learning does not classify unseen samples, but quickly adapts meta-knowledge to new tasks, given only little labeled data and knowledge gained from previous experience. knowledge .

 Recently, significant advantages [Vinyals et al, 2016; Finn et al, 2017; Snell et al, 2017; Sung et al, 2018] have been addressed by using the idea of ​​meta-learning combined with situational training [Vinyals et al, 2016] question. The intuition is to use episodic sampling strategies, a promising trend for transferring knowledge from known categories (i.e., known categories with enough training examples) to new categories (i.e., new categories with a small number of examples), simulating human learning. process. In this framework, metric-based methods [Vinyals et al, 2016; Snell et al, 2017] and graph-based methods [Garcia and Bruna, 2017; Liu et al, 2018; Kim et al, 2019; Yang et al, 2020] are two representative methods that mainly exploit transferable meta-knowledge. Due to the ability to effectively learn from graph data, graph-based methods often outperform metric-based methods, which extend pairwise query support relationships to graph structures.

 Although graph-based methods are effective [Kim et al, 2019; Yang et al, 2020], most of them ignore a key issue, which is how the past learned knowledge changes when scenarios are trained one after another. Useful for new tasks. As an intuition, for unseen tasks, humans do not use all knowledge, but use some information-rich related concepts to improve the prediction ability for new tasks . For example, if a person already understands the concepts of horse, tiger and panda, then it is easy to recognize the zebra as long as the outline of the zebra is like a horse, the stripes are like a tiger, and the black and white color is like a panda. Inspired by this simple intuition, we propose a hypothesis that few-shot learning models should explicitly establish relationships between scenarios and fully exploit existing learning knowledge .

 However, it raises two fundamental issues that hinder existing graph-based methods : 1) how to learn stable and consistent concepts when scenarios arrive rapidly; 2) how learned concepts further aid prediction when adapted to new tasks. In this paper, we propose a sanitized memory framework to address these two issues. Our basic idea is very simple, which is to simulate the human recognition process. To maintain stable and consistent concepts, we hold a memory bank during episodic training that learns the best prototype representation for each category from the perspective of the information bottleneck principle [Tishby and Zaslavsky, 2015]. By gradually purifying information from semantic tags, the stored knowledge should be universally expressive, consistent, and stable.

To make full use of the purified memory, we propose the Graph Augmentation Module (GAM) as a way to mine meta-knowledge and establish correlations between different scenarios . When processing a new task, GAM first retrieves the concept of k-nearest neighbors by taking the class center of the current task as a query. The retrieved concept and situation training samples are then forwarded to a graph neural network (GNN) with an adaptive weighting scheme. Therefore, concepts learned in the past and knowledge learned from new tasks are aggregated, which enables our model to make accurate predictions. Notably, our method is a model-agnostic approach that can be flexibly integrated into any advanced GNN method with negligible computational cost.

 Our main contributions are threefold: (1) We propose a new memory purification mechanism that is efficient, consistent and powerful in expression; (2) the proposed GAM is able to mine meta-knowledge and capture the differences between different events. Relevance; (3) Our approach produces state-of-the-art small sample results, and our interesting findings highlight the need to rethink the way we use metaknowledge.

2. Method

 This article aims to solve the problem of few-shot classification. The problem definition is fundamentally different from traditional classification in that our goal is not to classify unseen samples but to quickly adapt meta-knowledge to new tasks . Specifically, provides a class with the base C^{base}CBase a labeled dataset of sufficient training samples, with the goal of using a class C novel C^{novel} from a set of novel classesCn o v e l Very limited data collected to learn concepts whereC base ∩ C novel = ∅ C^{base}\cap C^{novel}=\emptysetCbaseCnovel= . An effective way to solve the few-shot problem is to use a situational sampling strategy. In this framework, the samples in meta-training and meta-testing are not samples but scenarios{ T } \left\{\mathcal{T}\right\}{ T } , each scenario containsNNN classes (ways) and each classKKK shots. In particular, forN − way K − shot N-way\ K-shotNw a y K shot 任务,支持集 S = { ( x i ,   y i ) } i = 1 N × K S=\left\{\left(x_i,\ y_i\right)\right\}_{i=1}^{N\times K} S={ (xi, yi)}i=1N×K 和查询集 Q = { ( x i ,   y i ) } i = N × K + 1 N × K + T Q=\left\{\left(x_i,\ y_i\right)\right\}_{i=N\times K+1}^{N\times K+T} Q={ (xi, yi)}i=N×K+1N×K+Tsamples. Here, xi x_ixi y i ∈ { C 1 ,   . . . ,   C N } y_i\in\left\{C_1,\ ...,\ C_N\right\} yi{ C1, ..., CN} This is No.iii input data, fromC base C^{base}Cbase . _ In meta-testing, test tasks are alsoC novel C^{novel}CExtract plots of the same size from n o v e l . The purpose is to centralize the query tothe TTT unlabeled samples are correctly classified asNNN categories.

2.1. Framework overview

 The framework of the proposed method is shown in Figure 1 . It mainly consists of three parts, namely the encoder for discriminative feature extraction, the memory module for expressing meta-knowledge storage, and the graph expansion module for comprehensive reasoning. Generally speaking, our method can be summarized into 3 stages (i.e., pre-training, meta-training, meta-testing).

figure 1

Figure 1: Flowchart of the proposed method. Let's take the 2-way 2-shot setup as an example.

The first stage of pre-training . We follow a simple baseline [Chen et al, 2020]: on the meta-training set C base C^{base}CA supervised representation is learned on base and then a linear classifier is used on top of this representation. It has been shown that this pre-training stage is beneficial for downstream small-shot tasks [Tian et al, 2020], and then the trained feature extractor (e.g. ResNet-12 [He et al, 2016]) and classifier are used as encoders respectively and memory bank initialization.

The second stage of meta-training . We first extract features of support samples and query samples as task-related embeddings V t V^tVt . Then to facilitate rapid adaptation, our approach possesses a memory bank to store the expressive representation of the support set. The memory bank is optimized using a new update scheme to gradually purify discriminative information (introduced in Section 2.2). Furthermore, the purified memory is combined with a graph augmentation module for robust prediction (introduced in Section 2.3). In this module, we mine relevant prototypesV m V^mVm , called meta-knowledge in this paper, to propagateV t V^tVt andV m V^mVsimilarity between m . Therefore, our model can easily generalize to new tasks with negligible memory cost.

Phase 3 meta-testing . The process of Meta-Test is similar to Meta-Train, also using episodic sampling strategy. But unlike Phase-II, the memory bank and other modules are not updated during the entire process. In other words, the switch will close, as shown in Figure 1.

2.2. Fine memory update

 Meta-knowledge plays an important role in learning new concepts from unknown samples, and recent FSL advances [Ramalho and Garnelo, 2019] often utilize memory mechanisms to store this meta-knowledge. In its typical setting, memory attempts to retain as much information as possible (e.g., store all features). However, we believe this strategy is both ineffective and inefficient. In the context of FSL, episodic sampling allows feature extractors to quickly learn new concepts with few samples, which leads to the problem of updating features in memory when the feature extractor is in a very different task context . From this perspective, representations learned from different tasks require a purification process to become a stable concept.

 To alleviate the above issues, we propose to optimize memory by learning the best prototype for each category . Specifically, consider N − way K − shot N-way K-shot in FSLNw a yKs h o t task, we usefsupl ∈ R [ N × K , d ] f_{sup}^l\in\mathbb{R}^{\left[N\times K,\ d\right]}fsuplR[ N × K , d ]  to represent thellthl Feature representation of concentrated support set,M ∈ R [ C , d ] \mathbb{M}\in\mathbb{R}^{[C,d]}MR[ C , d ] represents the memory bank, whereCCC andddd represents the total number of categories and the dimension of the prototype respectively.

 In order to gradually purify semantic information from tags, we first perform fsupl f_{sup}^lfsuplPerform class averaging to obtain the centroid fcenl ∈ R [ c , d ] f_{cen}^l\in\mathbb{R}^{[c,d]}fcenlR[ c , d ] , and then compare each centroid with the prototypefpl ∈ R [ c , d ] f_p^l\in\mathbb{R}^{[c,d]}fplR[ c , d ] (stored in memory) belong to the same category. We will connectfcatl ∈ R [ c , 2 × d ] f_{cat}^l\in\mathbb{R}^{[c,2×d]}fcatlR[ c , 2 × d ] is forwarded to a fully connected layer to reduce dimensionality, and the output is used to improve memory. Here we propose to use the information bottleneck principle to refine this concept. The following constraints are used to ensure that IB works properly, i.e., retains semantic tag information while avoiding task-irrelevant troubles.

official 1

 其中 I ( . ;   . ) I\left(.;\ .\right) I(.; . ) represents mutual information,YYY represents the label,β \betaβ are Lagrangian coefficients respectively.

 Specifically, formula (1) aims to learn about the target YYThe maximum information amount of Y simultaneously affectsfcatl f_{cat}^lfcatlPrototype fpl f_p^l with maximum compressionfpl. However, formula (1) requires estimating high-dimensional mutual information, which is tricky in such a high-dimensional space. Fortunately, since our goal is to purify this concept, we show that the self-knowledge distillation loss can be strictly consistent with equation (1). The mathematical derivation is shown in the supplementary material.

In practice, the following constraints are enforced to purify discriminative information and further refine memory :

official 2

 Here, θ \thetaθ andφ \varphiφ represents the parameters of the encoder and FC layer,DKL [ . ∣ ∣ . ] D_{KL}[.||.]DKL[ .∣∣. ] display KL-divergence,yyy represents the label. Please note thatp ( y ∣ fcatl ) p\left(y|f_{cat}^l\right)p(yfcatl) p ( y ∣ f p l ) p\left(y|f_p^l\right) p(yfpl) both represent conditional distributions, and in practice they are the output of extra linear (see the supplementary material for details).

M \mathbb{M} The refinement of M essentially iteratively aggregates discriminative information and dilutes task-irrelevant interference. A naive solution would be to append the output of IB toM \mathbb{M}Every episode of M. But this solution incurs huge space and time costs and produces poor performance (see Section 3.4). In summary,we propose to refine the memory bank through momentum updates. Formally,M \mathbb{M}M is updated via:

official 3

 Among them, λ ∈ [ 0 , 1 ) \lambda\in[0,\ 1)l[0, 1 ) is a momentum coefficient,f B l ∈ R d f_B^l\in\mathbb{R}^dfBlRd represents the output of IB in the current episode.

In this way, memory should generally be expressive, consistent, and more efficient. The improved prototype representation further combines and aggregates meta-knowledge mining and is used to facilitate reasoning in FSL, as described below .

2.3. Graph expansion module

 For an unseen task, humans do not use all knowledge, but use some information-rich related concepts to abstract the new task. Inspired by this, we propose a meta-knowledge mining method to simulate this behavior . The core idea behind our approach is to aggregate similar features instead of the entire memory bank to help our model learn new concepts to cope with unseen tasks. In particular, we use the Graph Augmentation Module (GAM) to capture the relationship between specific task context and related concepts. Their similarities are then propagated through a graph neural network [Kim et al, 2019], where each layer performs node feature and edge feature updates to enable fast and comprehensive inference.

Meta-knowledge mining . For the llEach class centroidfcenl [ i ] f_{cen}^l[i] in l setfcenl[ i ] , we first calculatefcenl [ i ] f_{cen}^l[i]fcenl[ i ] and memoryM \mathbb{M}Cosine similarity between each prototype in M. Then we choosefcenl [ i ] f_{cen}^l[i]fcenl[ i ] kk_k nearest neighbors, expressed asMK = { m 1 , m 2 , . . . , mk } MK=\left\{m_1,\ m_2,\ ...,\ m_k\right\}MK={ m1, m2, ..., mk} . To perform aggregation, we use centroidfcenl [ i ] f_{cen}^l[i]fcenl[ i ] and selected embeddingmj m_jmjCalculated attention coefficient :

official 4

 Among them ⟨ ⋅ , ⋅   \left\langle⋅, ⋅\right\rangle,⋅  represents the cosine similarity between two vectors,τ \tauτ is a scalar parameter. Finally,the meta-knowledge nodes of each classas:

official 5

 Among them [ ⋅ ; ⋅ ] [\cdot;\ \cdot][; ] is the splicing operation,fagg ( ⋅ ; θ agg ) f_{agg}\left(\cdot;\theta_{agg}\right)fagg(;iagg) transforms the concatenated features composed of fully connected layers:R 2 d → R d \mathbb{R}^{2d}\rightarrow\mathbb{R}^dR2d _Rd , the parameter set isθ agg \theta_{agg}iagg

Enhanced graph initialization . For N − way K − S hot N-way\ K-ShotNw a y K Sh o t task, given the features extracted from the encoder and the meta-knowledge mined, construct a fully connected graph G = ( V , E ) G=\left(V,\ E\right)G=(V, E),其中 V = { v i t } i = 1 N × K + T ∪ { v i m } i = 1 N = { v i } i = 1 N × ( K + 1 ) V=\left\{v_i^t\right\}_{i=1}^{N\times K+T}\cup\left\{v_i^m\right\}_{i=1}^N=\left\{v_i\right\}_{i=1}^{N\times\left(K+1\right)} V={ vit}i=1N×K+T{ vim}i=1N={ vi}i=1N×(K+1) E = { e i j } i ,   j = 1 , . . . ,   ∣ V ∣ E=\left\{e_{ij}\right\}_{i,\ j=1,...,\ |V|} E={ eij}i, j=1,..., Vrepresent the set of nodes and edges respectively. This node contains two types of points, namely task-related nodes V t V^tVt and meta-knowledge nodeV m V^mVm . The edge represents the similarity between two nodesand is initialized as:

official 6

 Where S ^ = S ∪ V m \hat{S}=S\cup V^mS^=SVm represents the union of support set and enhanced meta-knowledge. Therefore, meta-knowledge is augmented into existing reasoning tasks and allows the model to adapt to new tasks by leveraging learned concepts.

Node feature updates . Given from layer ℓ − 1 \ell-11vi ℓ − 1 v_i^{\ell-1}vi1eij ℓ − 1 \mathbf{e}_{ij}^{\ell-1}eij1,山ℓ \himCharacteristic node vi of ℓ v_i^\ellviUpdated through a neighborhood aggregation process. The aggregation is weighted by the edge similarity between two neighbors. Feature transformation is also performed to normalize the features. Mathematically, node feature update is defined as:

official 7

 Among them [ ⋅ ; ⋅ ] [·; ·][⋅;⋅] is a series operation,fnode ( ⋅ ; θ node ) f_{node}\left(\cdot;\ \theta_{node}\right)fnode(; inode) is a transformation block consisting of two convolutional layers [Glorot et al, 2011; Ioffe and Szegedy, 2015], a LeakyReLU activation and a dropout layer.

Edge feature updates . The edge feature update is based on the newly updated node feature vi ℓ v_i^\ellviCompleted. Recalculate the similarity between each pair of nodes, combined with the eigenvalue eij ℓ e_{ij}^\ell of the previous edgeeijand the updated similarity, update each edge eij ℓ − 1 \mathbf{e}_{ij}^{\ell-1}eij1The characteristics are:

official 8

 其中 f e d g e ( ⋅ ;   θ e d g e ) f_{edge}\left(\cdot;\ \theta_{edge}\right) fedge(; iedge) Koreyuθedge \theta_{edge}iedgeParameterized metric network, which includes four convolutional blocks, a batch normalization layer, a LeakyReLu activation layer and a dropout layer. Notably, our GAM can be implemented with any other GNN and significantly improve their performance.

2.4. Prediction and optimization

 When the optimization is complex, node vi v_iviBelongs to C k C_kCkThe predicted probability of can be expressed as:

official 9

 Enclose ( yj = C k ) \delta\left(y_j=C_k\right)d(yj=Ck) is the Kronecker delta function, whenyj = C k y_j=C_kyj=Ckis equal to 1, otherwise it is zero, eij e_{ij}eijRepresents two nodes vi v_iviand vj v_jvjedge features between. This probability is then normalized using a softmax layer.

 During the meta-training phase, our model is optimized by minimizing the binary cross-entropy loss (BCE):

official 10

 ei e_ieiy ^ i ℓ {\hat{y}}_i^\elly^iare the basic facts of query node edge label and query edge prediction, respectively, λ ℓ \lambda_\elllis the ℓ \ellℓCoefficient of layer. To make meta-knowledge nodes consistent with predicted labels, we also introduce another binary cross-entropy loss (BCE)L m \mathcal{L}_mLmto estimate the difference between the true value and the prediction of the edge label of the meta-knowledge node.

 Finally, the total loss L \mathcal{L}L can be defined as:

official 11

 where α \alphaα is the equilibriumL q \mathcal{L}_qLqand L m \mathcal{L}_mLmcoefficient. In our experiments, we fixed α = 0.2 \alpha=0.2a=0.2 andβ = 0.01 \beta=0.01b=0.01

3. Experiment

3.1. Experimental setup

data set . We evaluate our method on four few-shot learning benchmarks following [Yang et al, 2020]: miniImageNet [Vinyals et al, 2016], tieredImageNet [Ren et al, 2018], CUB-200-2011 [Wah et al , 2011] and CIFAR-FS [Bertinetto et al, 2018]. Among them, miniImageNet and tieredImageNet are collected from ImageNet, and CIFAR-FS is a subset of CIFAR-100. Unlike these datasets, CUB-200-2011 is a fine-grained bird classification dataset.

Assessment . For evaluation, all results were obtained under the standard few-shot classification protocol: 5 − way 1 − shot 5-way\ 1-shot5w a y 1 shot 5 − s h o t 5-shot 5s h o t tasks. Whether in 1-shot or 5-shot settings, only 1 query sample per class is used to test accuracy. We report the average accuracy (%) on 10K randomly generated events and the 95% interval on the test set. Note that all hyperparameters are determined based on the validation set.

3.2. Implementation details

Network architecture . We utilize two networks as our encoder backbone (i.e., ConvNet and Resnet12 [Kim et al, 2019; Lee et al, 2019]). ConvNet contains four blocks, each block includes a 3x3 convolutional layer, a batch normalization layer and a LeakyReLU activation layer. Similarly, ResNet12 consists of four residual blocks. For a complete understanding, please refer to [He et al, 2016]. After the backbone network, there is a global average pooling layer and a fully connected layer to produce 128-dimensional instance embeddings.

training . In the pre-training stage, the baseline following previous work [Chen et al, 2020] is trained from scratch with a batch size of 128 by minimizing the standard cross-entropy loss of the base class. After that, we randomly select 40 episodes per iteration for training ConvNet in the meta-training stage. This sampling strategy is slightly different from ResNet12, where 5-way 5-shot tasks, due to memory cost, we only sample 20 sets per iteration. The Adam optimizer is used in all experiments with an initial learning rate of 10 −3 . We decay the learning rate by 0.1 every 8000 iterations and set the weight decay to 10 −5 . We trained for a total of 50,000 epochs, and the encoder was frozen for the first 25,000 iterations.

3.3. Main results

 In this section, we demonstrate the effectiveness of our approach relative to state-of-the-art methods. For fair comparison, we adopt two representative few-sample graph neural networks, namely EGNN and DPGN, as our GAM module. Furthermore, by using two backbones ConvNet and ResNet12, we report 5-way\ 1-shot performance on all benchmark datasets in 5 − way 1 − shot5w a y 1 shot 5 − w a y   1 − s h o t 5-way\ 1-shot 5w a y 1 s h o t settings for a comprehensive evaluation.

Results of universal object recognition . For general object classification, we evaluate our method on miniImageNet, tieredImageNet and CIFAR-FS and report the results in Table 1. The main observations are as follows : 1) The proposed method outperforms all competitors, demonstrating the effectiveness of our method. Furthermore, the performance obtained using ResNet12 is better than that using ConvNet due to better representation capabilities. 2) No matter which graph neural network is used, the proposed method significantly outperforms the baseline and has clear advantages. 3) In 1 − shot 1-shot1shot 5 − s h o t 5-shot 5In the s h o t setting, our method is basically stable at the best performance. Due to memory purification, in1 − shot 1-shot1The improvement is more significant in the s h o t setting. Therefore, our method seems to be more effective when faced with new tasks with fewer samples.

The results of fine-grained classification . For the fine-grained bird classification problem, Table 1 reports the results of CUB-200-2011. In particular, our method also significantly outperforms other competitors. Note that on this dataset, different graph neural networks and backbones have less impact on performance .

Discussion . Since the proposed method is based on the GNNs framework, our method can be flexibly integrated into any advanced GNNs method. Our results show that the performance of GNNs is significantly improved using purified memory and GAM modules .

Table 1

Table 1: Small-shot classification accuracy for four few-shot learning benchmarks. "+" indicates the result we re-implemented using official code. Red indicates optimal performance, blue indicates suboptimal performance. Bold font indicates our results.

3.4. Ablation studies

 We provide experiments to confirm our main claims: 1) Purified memory can promote rapid adaptation. 2) Meta-knowledge and GAM can promote existing GNNs models. All experiments were conducted on tieredImageNet, using ResNet12’s 5 − way 1 − shot 5-way\ 1-shot5w a y 1 s h o t settings. 5 − shot 5-shotis also shown in the supplementary material5Quantitative results of s h o t .

Purifying memory effects . We compared four different memory banks and the results are shown in Figure 2. Note that when there is no memory, the baseline degenerates to EGNN. We can draw the following conclusions : 1) GAM can improve the performance of GNN models without the help of memory and can even be combined with the class centers in the current set. 2) Three different memory banks significantly outperform the non-memory baseline, showing the importance of meta-knowledge. 3) Prototype-based memory is more efficient, which confirms our hypothesis that storing the entire feature is a suboptimal solution. 4) Experimental results support our motivation to obtain the best prototype representation for each category. At the same time, the memory cost of the proposed method seems to remain at the same level compared to the baseline.

figure 2

Figure 2: Ablation study when using different memory mechanisms. “B”: our baseline (EGNN); “Non-Mem”: meta-knowledge nodes implemented by the class center of the current episode; “Naive-Mem”: memory that stores all features; “PB-Mem”: prototype-based memory .

GAM impact . To demonstrate the effectiveness of our graph augmentation module, we first visualize the embedding space in Figure 3. In particular, we randomly select 5 classes, each containing 200 samples from tieredImageNet. We project the features trained by EGNN and EGNN equipped with GAM into a two-dimensional plane through t-SNE. The results show that the embedding space in EGNN is mixed, so the discriminative ability of the learning model is naturally limited. In contrast, our model is able to distinguish different classes with larger inter-class margins, so we get substantial improvements. This shows that with the help of purified meta-knowledge, distinguishing information can be further highlighted through GAM .

 Furthermore, to visualize how meta-knowledge helps the prediction process, we select a test scenario in which the ground-truth categories of the five query images do not overlap (i.e., 5 − way 1 − shot 5-way\ 1-shot5w a y 1 s h o t ) and visualize the instance-level similarity as shown in Figure 5. Specifically, we choose two instance-level similarities to demonstrate the effectiveness of our method. Notably, the heatmap shows that GAM improves the instance-level similarity matrix after several layers and makes correct predictions for five query samples in the last layer compared to EGNN. We can also find that this improvement is due to the increase in meta-knowledge nodes. Due to the concept of purification, heatmaps are inherently clean and thusmeta-knowledge provides secondary strong supervision. These similarities are then further propagated through graph neural networks, enabling the model to exploit the concepts of memory and knowledge learned from new tasks. This experimental result convincingly supports our hypothesis.

The influence of k-nearest neighbors . In the meta-knowledge mining stage, we retrieve the most similar k k from memoryk samples to augment the graph. Here we discuss whenkkIts influence when k changes. As shown in Figure 4,when kkAs k increases, the few-shot recognition performance continues to improve. WhenkkWhen k increases to a certain value, the accuracy begins to decrease on both data sets. Therefore, it is recommended to set this value to 6 as a rule of thumb.

image 3

Figure 3: t-SNE visualization results obtained from our method and EGNN. Different colors represent different categories.

Figure 4

Figure 4: Performance impact of k-nearest neighbors.

Figure 5

Figure 5: Visualization of edge predictions in each layer of our method. The subgraphs from left to right represent the predictions of the graph neural network from layer 1 to layer 3. Darker colors represent higher scores, lighter colors represent lower confidence. The left axis represents the index of the 5 query images, and the lower axis represents the 5 supporting classes or our meta-knowledge nodes.

4 Conclusion

In this work, we propose a new small-shot learning memory update scheme to gradually purify semantic label information from the perspective of information theory. Purified memory is typically expressive, consistent, efficient, and then works naturally with graph enhancement modules. GAM further exploits meta-knowledge and knowledge learned from new tasks to make accurate predictions. This solution is a model-agnostic module that can be flexibly integrated into any advanced GNN method

references

[Bertinetto et al, 2016] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850– 865. Springer, 2016.
[Bertinetto et al, 2018] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
[Chen et al, 2020] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390, 2020.
[Finn et al, 2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126– 1135. JMLR. org, 2017.
[Garcia and Bruna, 2017] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
[Glorot et al, 2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
[He et al, 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[Kim et al, 2019] Jongmin Kim, Taesup Kim, Sungwoong Kim, and Chang D Yoo. Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2019.
[Lee et al, 2019] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019.
[Liu et al, 2018] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002, 2018.
[Liu et al, 2020] Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Mingsheng Long, and Han Hu. Negative margin matters: Understanding margin in few-shot classification. arXiv preprint arXiv:2003.12060, 2020.
[Long et al, 2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431– 3440, 2015.
[Noh et al, 2017] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision, pages 3456–3465, 2017.
[Ramalho and Garnelo, 2019] Tiago Ramalho and Marta Garnelo. Adaptive posterior learning: few-shot learning with a surprise-based memory module. arXiv preprint arXiv:1902.02527, 2019.
[Ren et al, 2018] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
[Snell et al, 2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pages 4077–4087, 2017.
[Sung et al, 2018] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
[Tian et al, 2020] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? arXiv preprint arXiv:2003.11539, 2020.
[Tishby and Zaslavsky, 2015] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
[Vinyals et al, 2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
[Wah et al, 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011.
[Wang et al, 2020] Zeyuan Wang, Yifan Zhao, Jia Li, and Yonghong Tian. Cooperative bi-path metric for few-shot learning. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1524–1532, 2020.
[Yang et al, 2020] Ling Yang, Liangliang Li, Zilun Zhang, Xinyu Zhou, Erjin Zhou, and Yu Liu. Dpgn: Distribution propagation graph network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13390–13399, 2020.

Guess you like

Origin blog.csdn.net/weixin_42475026/article/details/131310206