Data distribution - treatment of long-tailed distributions

foreword

长尾分布This name will be mentioned in the classification task, because the phenomenon of long-tail distribution will lead to a high error rate during the training process, which will affect the experimental results.

What I want to say here is that the long-tail distribution is a phenomenon, and some places say it is a theory or law. I feel that this is not accurate, because the long-tail distribution is not a common phenomenon, and all data cannot be distributed. Or the phenomenon is imposed on the concept of long-tailed distribution.

This concept is relatively common in the IT industry, such as the sales phenomenon of e-commerce—common industry giants can be unique in the physical industry, but the development of e-commerce has led to the sale of many niche or less common products. The sales volume is likely to be greater than or equal to the sales output of the original industry giants, which is the phenomenon of long-tail distribution in the IT industry.

In the actual application of machine learning and visual recognition, the long-tail distribution can be said to be a natural distribution that exists more widely than the normal distribution to some extent. In reality, it mainly shows that a small number of individuals make a large number of contributions (a small number of categories) The number of samples occupies a large proportion of samples), and the often-mentioned "Pareto law" (Pareto law) is an image summary of the long-tail distribution.

For long-tailed distributions, this phenomenon is often encountered in tasks such as image or vision.

image-20211115144340223

Now the field of NLP has also appeared. Here I would like to mention a ranking I saw-. 齐夫定律(Zipf's Law)This is applied in natural language processing, mainly talking about: in the corpus of natural language, the frequency of a word and Its rank in the frequency table is inversely proportional. So, the word with the highest frequency appears about twice as often as the word with the second most frequency, and the word with the second most frequency appears twice as often as the word with the fourth most frequency. This law is used as a reference for anything to do with power law probability distributions .

The reason why Zif's law is mentioned is because it is the same as the long-tail distribution, it is 幂定律概率分布a distribution about , in natural language processing, it also shows that frequently appearing words or words can easily produce unwanted effects on the model result.

In the Brown corpus, "the", "of", and "and" are the three words with the highest frequency of occurrence, and their frequency of occurrence is 69971 times, 36411 times, and 28852 times, accounting for about 1 million words in the entire corpus 7%, 3.6%, 2.9%, the ratio is about 6:3:2. About 7% of the entire corpus (69971 occurrences in 1 million words). Satisfy the description in Zif's law. The first 135 words alone make up half of Brown's corpus.

So far, it can be seen that balancing the data is a problem that needs to be considered in machine learning.

Processing method (related work)

The following division of vocabulary and processing methods comes from this article: [Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks], the methods mentioned in it are all applied in CV, but I think this This phenomenon can also be transferred to other research directions.

Let me talk about some related vocabulary first:

  • CE - cross entropy;

  • Imbalance factor——defined as the ratio of the class with the largest amount of data in the dataset to the class with the smallest amount of data;

  • CAM—— tailored for two-stage training and generates discriminative images by transferring foregrounds while keeping backgrounds unchanged.

  • These fine-tuning methods (Cao et al. 2019) can be divided into two sections:

    deferred re-balancing by re-sampling (DRS) and by re-weighting (DRW).DRS and DRW are actually two training methods. DRS is trained using vanilla training schedule in the first stage, and re-sample is used in the second stage. Fine-tuning; while DRW uses re-weight to fine-tune in the second stage.

The classic machine learning approach

Re-sampling method (Re-Sampling)

There are two methods here - Over-Sampling and Under-Sampling

  • Over-Sampling, during the training process, the data with a small proportion of data in the data set is sampled multiple times, so that these data are used multiple times during training, thereby alleviating the problem of long-tail data distribution.
  • Under-Sampling discards data with a high proportion of data during training, so as to achieve a balance in the amount of data in each category and alleviate the problem of long-tail data distribution.

【Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks】Some methods are summarized in the article: Class-balanced sampling, Random under-sampling, Progressively-balanced sampling, etc., which essentially modify the probability P of selecting samples.

Loss Reweighting (Re-Weight)

In the process of training the model, increasing the weight of the long-tail data in the loss is to weight the loss of the sample according to the "sparseness" of the sample category. The category that contains more samples is usually assigned a lower weight. In order to balance the contribution of different categories of samples to the loss function. However, this method cannot handle the data in real life, and once the long-tail data distribution is serious, this method is also easy to cause optimization problems.

The weight redistribution is to determine the penalty coefficient of the loss value according to the different sample numbers of different categories. For example, for a small number of sample classes, the loss penalty should be larger. The usual practice is to add the weight coefficient of each category to the cross entropy loss. Coefficients are usually defined as the reciprocal of the class sample size.

[Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks] summarizes the relevant algorithms, essentially modifying the loss to achieve a balanced effect.

Why do the above two methods have certain effects?

In the article "BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition", it is mentioned that although such a common method is very good in classification, it is not very good in representation ability. The original text is as follows:

We firstly discover that these rebalancing methods achieving satisfactory recognition accuracy owe to fifier learning of deep networks. However, at the same time,they will unexpectedly damage the representative ability of the learned deep features to some extent.

This means that such a balanced method loses the characteristics of the data to a certain extent, that is, the ability to represent features is reduced:

In this paper,we reveal that the mechanism of these strategies is to signifificantly promote classififier learning but will unexpectedly damage the representative ability of the learned deep features to some extent.

The specific effect is as follows:

image-20211115150424481

It can be seen from the figure that the characteristics of the data after balancing begin to disperse.

Further experimental analysis obtained the data in the following figure:

image-20211115150735637

The two graphs represent two data sets. Here, let's take the left graph as an example to explain what these two graphs illustrate:

  • First of all, for the convenience of proof, the author divides the training method into two stages:
    • representation learning, which is the training process of the feature extractor (FP and BP stages, not including the fully connected layer);
    • classifer learning, which is the training phase of the classifier (the last fully connected layer);
  • There are three training methods—plain training (that is, cross-validation, which is a commonly used method for classification, PS: I think the role of this cross-validation is to act as a controlled experiment, that is to say, without any technical processing. The obtained training results), re-sampling and re-weight;
  • Only look at one of the columns (fix one of the columns, under the premise of certain representation learning), the classification effect of RS is the best;
  • Just look at one of the lines (similarly), the best representation of cross-validation CE;

existing problems

There are still problems with the re-balance method, as mentioned in the article:

  • re-sampling
    • Premise: Under the premise of data extreme imbalance;
    • over-sampling: There is an over-fitting phenomenon;
    • under-sampling: There is an underfitting phenomenon;
  • re-weight
    • disrupt the distribution of the original data;

deep learning method

Two-stage fine-tuning strategy

The following BBN is just one of the models, the so-called two-stage fine-tuning strategies (Two-stage fifine-tuning strategies), the so-called two-stage is divided into: 1) unbalanced training; 2) balanced fine-tuning; two parts .

BBN

This method divides the training into two stages. The first stage is trained as usual to extract representations, and the second stage uses a smaller learning rate to fine-tune the network in a re-balancing manner.

Here I mention a two-stage method that I have seen just proposed this year to implement a framework for dealing with long-tail distributions.

In the article "BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition", the BBN network is proposed to better deal with long-tailed distributions.

image-20211115160513841

As shown in the figure above, in the article, the author designed the entire self-designed BNN model into three modules:

  1. conventional learning branch
  2. re-balancing branch
  3. cumulative learning (according to the increase in training epochs to continuously modify the parameter α \alphaα , through this parameter to integrate the above two branches)

Of course, there are other methods that work well on long-tailed distributions—LDAM and CB-Focal.

Mixup method

【Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks】In the article, two Minxup methods are proposed, one of which is the existing Mixup method:

  • Input Mixup
    • image-20211117090822205
  • Maninfold Mixup

The other is the "fine-tuning after mixup training" method proposed by the author, which is actually divided into two stages, the first stage is mixup, and the second stage is fine-tuning model training (as for how to achieve it, the text does not explain ).

Long Tail Distribution Problem in Text Classification

The above solutions are all for the image or CV field, but there are also long-tailed distribution problems in NLP. The ending method of the text classification problem of the label - a new Loss function is proposed:

image-20211123011302812

Guess you like

Origin blog.csdn.net/c___c18/article/details/131154250