Deep learning data long tail problem

foreword

I've been busy with work recently, and I've been busy with work and holidays, so I don't have much time to write a blog. Today I read an article 4D Review: How to create a data closed loop for autonomous driving? , which happens to be related to the target detection that I am doing recently. I often have a problem in target detection. Suppose I have a training set of 10,000 samples. The number of samples in the training set is very balanced. Suppose there are 5 categories. Each category has 2000 samples. Now the leader asked me to add a class of recognition, but there are only a few hundred. If hundreds of samples of the new class are added to the training set of 10,000, will it affect the original class or will it affect the new class? identify. This problem is the long tail problem.

In traditional classification and recognition tasks, the distribution of training data is often artificially balanced, that is, there is no significant difference in the number of samples of different categories. A balanced training sample has many advantages. It not only simplifies the requirements for the robustness of the algorithm, but also guarantees the reliability of the obtained model to a certain extent. However, as the categories in the sample increase, maintaining the balance between the categories will bring about exponentially increasing collection costs. If the samples are not artificially balanced and intervened deliberately, the distribution of these data categories is often as shown in the figure below. If the classification and recognition system is trained directly using long-tail data, it will often overfit the head data, thus ignoring the tail category when predicting. How to effectively use unbalanced long-tail data to train a balanced classifier is our concern. From the perspective of industrial demand, this research will also greatly increase the speed of data collection and significantly reduce the cost of collection.
insert image description here

basic method

Re-sampling

Resampling is mainly the undersampling of head categories and oversampling of tail categories. The essence is to inversely weight the sampling frequency of different categories of images according to the number of samples. One of the most commonly used strategies is class-balanced sampling. The concept of category balance is mainly to distinguish the sample balance in the traditional learning process, that is, each image has the same probability of being selected, regardless of its category. The core of category equalization is to weight the sampling frequency of each image according to the number of samples in different categories.

Resampling means that when the existing data is unbalanced, the training samples that the model comes into contact with during learning are artificially balanced, thereby reducing the fitting of the head data to a certain extent. However, because a small amount of data at the tail is often repeatedly learned, lacking enough sample differences, it is not robust enough, and a large amount of data with enough differences at the head is often not fully learned, so resampling is not really a perfect solution. .

For resampling methods, please refer to:

  1. Decoupling Representation and Classifier for Long-Tailed Recognition, ICLR 2020(代码:classifier balancing

  2. BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition, CVPR 2020 (code: BBN )
    insert image description here
    This figure shows that the best combination of long-tail classification comes from: backbone + learned using Cross-Entropy Loss and original data The classifier learned by Re-sampling.

  3. Dynamic Curriculum Learning for Imbalanced Data Classification,ICCV 2019

weighted

Reweighting is mainly reflected in the loss of classification, which is different from sampling. Because of the flexibility and convenience of loss calculation, many more complex tasks such as object detection and instance segmentation are more inclined to use reweighted loss to solve long-tail distribution. Problems, such as weighted loss, focal loss, etc.

Related article reference:

  1. Class-Balanced Loss Based on Effective Number of Samples,CVPR 2019(代码:class-balanced-loss
  2. Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss,NIPS 2019(代码:https://github.com/kaidic/LDAM-DRW)
  3. Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition from a Domain Adaptation Perspective, CVPR 2020
  4. Remix: Rebalanced Mixup, Arxiv Preprint 2020

More job references

  1. Learning to Segment the Tail, CVPR 2020 (Code: https://github.com/JoyHuYY1412/LST_LVIS)
    There are two core highlights of this work: the first is to regard the data of learning long-tail distribution as an increment Learning (incremental learning), that is, we first learn common objects (head data), and then based on the knowledge of common categories, to recognize rare tail categories. This is actually a learning method very close to human thinking. The learning process of this work is as shown in the figure below. The categories are arranged in order according to the frequency of occurrence, and divided into different learning stages, and all categories are learned from easy to difficult.
    insert image description here
    The second highlight is efficient instance-level re-sampling (Efficient Instance-level re-sampling). As a common method for long-tail classification, resampling is easy to implement without reweighting on instance-level data. If a picture only samples one instance, although instance-level resampling can be directly degenerated into image-level resampling, it is not efficient enough. Therefore, based on the observation of the data, we found that when an object appears, other instances of the same category often appear in the picture, so the following efficient instance resampling is proposed:
    insert image description here

  2. Focal Loss for Dense Object Detection, ICCV 2017(代码:https://github.com/clcarwin/focal_loss_pytorch)
    insert image description here

  3. Equalization Loss for Long-Tailed Object Recognition, CVPR 2020(代码:https://github.com/tztztztztz/eql.detectron2)

  4. Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax, CVPR 2020(代码:https://github.com/FishYuLi/BalancedGroupSoftmax)

  5. Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels, CVPR 2020

reference

  1. Long-Tailed Classification (1) Introduction to classification problems under long-tail (unbalanced) distribution

Guess you like

Origin blog.csdn.net/u012655441/article/details/124684999