Semi-supervised field paper notes - Billion-scale semi-supervised learning for image classification

publish information

2019, Facebook

Field

semi-supervised learning

article method

main purpose

Improve existing models with unlabeled data

Method overview

Using the teacher/student learning mechanism, with the help of billion-level unable data and relatively small-scale label data, the effect of the current existing models on image classification tasks has been improved

background

  • In 2018, Facebook also proposed a weakly supervised research "Exploring the Limits of Weakly Supervised Pretraining", using billion-level weakly supervised data (the image has a hashtag tag, and the source of the image is Instagram)
  • This method is inspired by several directions: self- training, distillation, or boosting.

method introduction

  • Data used:

Lots of unlabeled + relatively small amount of labeled.

(billions of unlabeled images along with a relatively smaller set of task-specific labeled data)

  • specific process:
  1. A train a teacher model on the labeled data dataset

  2. Use the teacher to pseudo-label the unlabeled data, select data for each class (sort according to the pseudo-label prediction, and then select top-K images), and construct a new training set B

  3. Train a student model on dataset B, as a pre-train , the size of the student model is smaller than that of the teacher

  4. On the label data dataset A, fine-tune the student model

  • Method variant:
  1.  

 

Article conclusion

On the second page of the article, there is a table 1, which lists 6 suggestions from the author of the article on the large-scale semi-supervised learning process, which condenses the essence of many experiments in the article, and is very worthy of careful attention:

I interpret it in detail as follows:

 

 

method advantage

 

  • Compared to the weakly supervised method

  1. The problem of long-tail distribution of data is avoided . This method manually selects unlabeled data after marking, and can manually determine the amount and distribution of data (selecting same number of images per label), avoiding the problem of uneven numbers of different categories

  2. weakly supervised 的噪声问题。文章提到“significant amount of inherent noise in the labels due to non-visual, missing and irrelevant tags which can significantly hamper the learning of models”

  •  

 

 

Method Highlights

  • large data size

For the first time, billion-level unlabeled data is utilized in semi-supervised learning. ("semi-supervised learning with neural networks has not been explored before at this scale.")

 

details

Guess you like

Origin blog.csdn.net/s000da/article/details/109232063