[Paper Reading] Learning without Memorizing

Paper address: https://link.springer.com/chapter/10.1007/978-3-319-46493-0_37
Code: https://github.com/stony-hub/learning_without_memorizing
Published in: CVPR 19

Abstract

Incremental learning (IL) is an important task that aims to improve the ability of a trained model, i.e. the number of classes the model can recognize. The key problem with this task is the need to store data (such as images) related to existing categories while guiding the classifier to learn new categories. However, this is impractical because it increases storage requirements at each incremental step, which makes IL algorithms impractical to implement on edge devices with limited memory. Therefore, we propose a new method, called "Learning without Memorizing (LwM)", to preserve the information of existing (base) classes without storing any data for them, while making the classifier step-by-step Learn new classes. In LwM, we propose an information-preserving penalty: the attention distillation loss ( LAD L_ADLAD ). We also show that penalizing changes in the classifier's attention map as new classes are added helps to preserve base class information. We show that in the iILSVRC-small and iCIFAR-100 datasets, theLADLAD is added to the distillation loss, an existing information-preserving loss that consistently outperforms state-of-the-art performance in terms of overall accuracy for both base and incremental learning classes.

Method

The highlight of this paper is that it can achieve good performance without relying on the example set. However, it should be noted that the surpassing SOTA mentioned in the article is only surpassing the baseline based on the example set of iCaRL (of course, there were not many studies on class incremental learning at that time). Regarding motivation, from the perspective of attention maps, this paper points out that as new classes are added, the attention maps predicted by old classes will shift, resulting in a drop in performance: an
insert image description here
interesting point is that traditional distillation losses cannot capture This attention map shift process. To this end, this paper designs an attention distillation loss to solve this problem.

The basic class incremental learning framework is as follows: As
insert image description here
you can see, this paper not only uses distillation loss, but directly transfers the knowledge distillation framework (Teacher-Student Network). So why bring this thing over here? Since we are going to keep the attention map from changing, the first problem we face is what to use as ground truth to supervise. Here, it is obvious that the ground truth is the attention map generated by the network in the last round of class incremental process. We only need to design a loss to constrain the similarity of the attention maps of the two rounds before and after. It can be found that this process has to maintain the two networks of the old and new rounds at the same time, and the design of knowledge distillation is introduced for this.

As for the specific loss, it is actually an improved L1 Loss: LAD = ∑ j = 1 l ∥ Q t − 1 , j I n , b ∥ Q t − 1 I n , b ∥ 2 − Q t , j I n , b ∥ Q t I n , b ∥ 2 ∥ 1 L_{AD}=\sum_{j=1}^{l}\left\|\frac{Q_{t-1, j}^{I_{n}, b}}{\left\|Q_{t-1}^{I_{n}, b}\right\|_{2}}-\frac{Q_{t, j}^{I_{n}, b }}{\left\|Q_{t}^{I_{n}, b}\right\|_{2}}\right\|_{1}LAD=j=1lQt1In,b2Qt1,jIn,bQtIn,b2Qt,jIn,b1

Guess you like

Origin blog.csdn.net/qq_40714949/article/details/123692802