How to solve the problem of unbalanced data categories (Data with Imbalanced Class)

Category imbalance means: in the classification task, the data set number of samples from different types of disparities.

 

Category imbalance can cause such consequences: the imbalance in the distribution of data, which often leads to the output of the classifier tend to concentrate the majority of categories of data: the output of most of the class will lead to higher classification accuracy, but in minority interest in our poor performance.

 

Deal with this problem usually have three ways:

 

1. undersampling

Removing some embodiments such counter-n, the number of counter-examples proximity, then learning. Since many anti discarded embodiment, classifier training assembly is far less than the initial training set. Undersampling drawback is that you may lose some important information. It is often the use of integrated learning mechanism, the counterexample is divided into several collections for different learner use, this is equivalent to each learner have been under-sampled, and will not lose important information from a global point of view.


Representatives algorithm: EasyEnsemble

With integrated learning mechanism, the counterexample is divided into several collections for different learner use, so that each learner have carried out undersampling of view, but in a global point of view, but without losing important information.

Algorithm works:

  • First, a plurality of randomly selected subset of the most independent class.
  • Each subset of the minority class training data together to generate a plurality of base classifiers.
  • These groups will eventually classifiers are combined to form an integrated learning system.

EasyEnsemble algorithm is considered non-supervised learning algorithm, so it can be returned to use every independent random sampling mechanism to extract the majority class samples.

 

2. oversampling

In the training set of samples were positive class "oversampling." N such that n add some embodiments, the number of counter-examples proximity, then learning. But it is not directly copy positive cases, are prone to overfitting. Representative algorithms generally use SMOTE algorithm. It is to generate additional positive examples in the training set by interpolating positive examples. Oversampling disadvantage due to increased number of positive cases, much greater than the initial training set such that the training set, the time is much greater than the cost undersampling.


Representatives algorithm: SMOTE (Synthetic Minority Oversampling Technique)

Through positive examples in the training set is interpolated to generate additional positive cases. It uses the k-nearest neighbor algorithms to analyze the existing minority class samples, to synthesize a new minority sample in feature space.

Algorithm works:

SMOTE algorithm is established between closely spaced sample minority sample remains on the assumption that the minority class, which is an artificial similarity data between the feature space using the existing minority sample. Here we have a brief idea SMOTE algorithm.

The figure below shows a set of data:

As can be seen, the sample is much larger than the number of blue red sample, in the general call classification model to determine the time between neglect may result in lost influence with a sample of red, blue only emphasized the classification accuracy of the sample, it is necessary to increase red to balance the sample data set.

First of n randomly chosen sample of less classes:

And then find its nearest class less of m samples:

Then optionally nearest little less any class of m samples:

On these two points, optionally a little, this is the new data samples.

 

3. The moves threshold

Learning based on the original training set, but when using the trained classifier prediction, Jiangzai scaled formula embedding decision-making process, known as the "threshold value move."


In binary classification task, we will sample belongs to the class of probability being referred to as p, therefore the probability of a negative sample belongs to the class is 1-p. When p / (1-p)> 1, we sample into positive class. But the sample under equilibrium conditions, that is to say the ratio of positive and negative samples is close to 1, then the classification threshold is 0.5. If the sample is not balanced, then we need to modify the classification threshold value in predicting.

Suppose there are m centralized positive samples, negative samples in the n-th data, then the probability of positive and negative samples observed m / n (the probability of observing a sample in the case of a balanced). When classification is performed, at this time if the probability p '/ (1-p') is greater than the probability of the actual observed m / n, we only positive samples into classes. At this time, m / (m + n) becomes the new substituted 0.5 classification threshold.

 

Guess you like

Origin www.cnblogs.com/HuZihu/p/11039627.html