[Paper Reading] Learning Without Forgetting

Paper address: https://link.springer.com/chapter/10.1007/978-3-319-46493-0_37
Code: https://github.com/lizhitwo/LearningWithoutForgetting
Published in: ECCV 16

Abstract

When building a unified vision system or gradually adding new capabilities to the system, a common assumption is that training data for all tasks is always available. However, as the number of tasks increases, storing and retraining on this data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data to maintain its existing capabilities is not available. We propose a "learning without forgetting" approach, which uses only new task data to train the network while preserving the original capabilities. Compared to commonly used feature extraction and fine-tuning techniques, our method performs well and is similar to multi-task learning assuming that the original task data is not available. A more surprising observation is that "forgetting-free learning" may replace fine-tuning techniques as the standard practice for improving performance on new tasks.

Method

Similar to iCaRL, this paper is also an early attempt of incremental learning in deep learning, and the method level is simpler. The design idea can be briefly summarized with a picture in the original text:
insert image description here
it can be found that the LwF in this paper is very similar to the traditional joint training. The difference is that joint training requires labels from old tasks, and maintaining old labels (and training at the same time) is expensive. In contrast, LwF only needs to supervise the response of the old task. The specific implementation uses the idea of ​​knowledge distillation, and constructs a distillation loss to constrain the response of the old task to not change as much as possible, thereby reducing the speed of knowledge forgetting: L old ( yo , y ^ o ) = − H ( yo ′ , y ^ o ′ ) = − ∑ i = 1 lyo ′ ( i ) log ⁡ y ^ o ′ ( i ) \mathcal{L}_{\text {old }}\left(\mathbf{y}_{o}, \hat{\mathbf{y}}_{o}\right)=-H\left(\mathbf{y}_{o}^{\prime }, \hat{\mathbf{y}}_{o}^{\prime}\right)=-\sum_{i=1}^{l} y_{o}^{\prime(i)} \log \hat{y}_{o}^{\prime(i)}Lold ( andthe,and^the)=H( andO,and^O)=i=1landO(i)logand^O(i)Another follow-up classic work, iCaRL, improves the classifier (FC->NN) based on the LwF distillation loss, and introduces the concept of example (exemplar).

Guess you like

Origin blog.csdn.net/qq_40714949/article/details/123689730