Detailed category imbalances

This paper describes the category imbalances catalog:

  • 1 What is the category imbalances?
  • 2 categories imbalance causes of the difficulties of classification?
  • 3 categories unbalanced solution?
  • 4 How to choose the imbalance in the evaluation category of learning?
  • 5 suggestions on solving method of choice?
  • 6 Summary

1 What is the category imbalances?

Category imbalance (class-imbalance), also known as data skew, data imbalance, refers to a large number of different classification task of training examples in different categories of cases. In reality, the classification of learning tasks, we often encounter unbalanced category, such as fraud transactions, advertising click-through rate prediction, virus scripts judgment; or in solving the problem by splitting a multi-classification method, even if the original problem in different categories a considerable number of training examples, using OvR (One vs. Rest), produced after MvM (Many vs. Many) strategy II classification task may still occur category imbalances. The standard machine learning algorithms typically assume that the number of samples of different categories were similar, so the class imbalance can cause learning algorithm ineffective. It is necessary to understand the basic method of processing when the class imbalance.

2 categories imbalance causes of the difficulties of classification?

Typically, the higher the degree of imbalance class, the greater the difficulty of classified data set, but this is not always the case. The following listed several categories unbalanced situation:

  • 1) wherein the difference between positive and negative samples is large, wide boundaries; (this is the best case)
  • Splitting the plurality of sub-concepts (sub-concepts 2) minority sparsity distribution (sparsity) and sparsity caused understood Clusters sub) and each sub-concepts contained only a small number of samples;
  • 3) too many outliers (i.e., excessive minority majority class samples in the sample appear dense region);
  • 4) Overlapping distribution between the classes (i.e., samples of different types of relatively dense regions appear in the same feature space);
  • 5) the data inherent noise, especially noise minority class.

1 species is best case, using suitable models directly on these data sets (such as SVM, Decision insensitive categories imbalance model Tree etc.) can be obtained generally good classification results. So from this point of view, the category itself is not difficult to classify unbalanced sources, the following situations is.

Case 2 is also known as small disjuncts problem. The reasons which led to classification difficulties is straightforward: in the same feature space, compared to only one class a few simple cluster distribution, having a plurality of sub-concepts of minority class distribution model needs to give a more complex decision boundaries to get a good forecast . In the case of model complexity unchanged, the number of classification performance increase factor concept deteriorated. So the solution to this problem is relatively simple: the higher capacity models (DL: wider, deeper, stronger).

Third and fourth case demonstrated similar difficulties, i.e., some or most of the minority class majority class samples embedded in dense sample area, resulting in fuzzy boundaries, classification is difficult, the difficulty increases as the unbalance category further.

The first five cases would not have said, Shashi Hou are tired of the noise, noise minority even worse.

The following figure shows an intuitive visual categories to help understand the relationship between the overlapping ratio of the unbalanced distribution / Type: Unbalanced than the same, overlapping categories / data set does not overlap also exhibit very different classification difficulty. Dark blue dots on their behalf may be well classification model, and dark red dots represent the sample model completely unable to classify these data points are correct. Refers to the hardness data output probability classifier training is completed and the residual ground truth label (ie, | F (x) -y |).

In FIG. (A), the data set is not overlapped with the two-dimensional Gaussian distribution generated to represent a situation. We can observe uneven growth ratio does not affect the difficulty classification of the data set (Fig. (C)). In view (b), the data set consists of two overlapping two-dimensional Gaussian mixture distribution is generated to represent the situation 2,3,4. As can be seen, with the increase in the imbalance ratio, it is free from a relatively simple task becomes a very difficult task (FIG. (D)).

In addition, in practical industrial applications among these factors will be difficult to appear together with other practical issues, such as missing values ​​features, huge data sets and scale.

3 categories unbalanced solution?

Each solution has its scope, according to local conditions. And sometimes you do not need special treatment, such as these two cases:

  • It has given the problem of indicators ROC AUC, then the difference between this time and does not deal with the processing of not so great, because the ROC is not sensitive categories unbalanced (Note: This may occur in the game, if it is their problem definition should be used with caution ROC);
  • Chiang Kai-shek and negative samples task is equally important that the prediction of a prediction of a positive samples and negative samples are equally important, so do not do treatment for those who have nothing positive samples was flooded influence.

But if we have to recall a particularly large demand, which means that we are more concerned about the positive samples, so this time if no treatment is difficult to get the results we want. So, here are several ways to solve this problem, including peace of mind method, adjustment data, adjustment algorithms and integration.

Note: This article CKS sample (positive examples) is the minority class, negative samples (negative example) is the majority class, sub-sampling at the same sampling, oversampling above sampling.

3.1 hassle free method

Before introducing three high behind the method, first introduced several unpretentious but very hassle free approach.

Actively collecting data

For a small amount of sample data, you can go to the expansion of these small sample data sets as possible, or as much as possible to increase their unique features to enrich the diversity of data (as far as possible be converted into 1 case). For example, if the project is a sentiment analysis, found that a small number of samples negative samples (negative emotion) in the analysis of data proportions, then we can collect as much as possible in the site more negative the number of samples, or spend money to buy, after all, less data will bring a lot of potential problems.

The conversion tasks to anomaly detection problem

If the sample is too small minority class, minority structure may not be well distributed minority sample showing, the equilibration method or data adjustment algorithm is not necessarily effective. If the minority samples in the feature space redistribution more casual, the situation would be even worse. This time it is better to convert it to unsupervised anomaly detection algorithm, not too much to consider converting the data to balance problems to solve.

Adjusted Weight

You can simply set a weight loss function, let the model increase the punishment for most classes, more attention to the minority class. In the python scikit-learn, we can use class_weight parameters to set weights.

In addition, the weighting adjustment method is also suitable for this situation: the different consequences of different types of errors caused. For example, in medical diagnostics, wrongly diagnosed patients with healthy people may lead to further examination of trouble, but mistakenly diagnosed patients healthy people, you may lose the best time to save lives; another example, access control systems mistakenly passable personnel stopped at the door, will make the user experience is poor, but mistakenly put into the door to strangers, will result in serious security incidents; credit card theft in check, the normal use of mistaken theft, It may cause poor user experience, but the theft mistaken for normal use, the user will bear huge losses. In order to weigh the different losses caused by different types of errors, can give "unequal costs" (unequal cost) as an error.

Threshold adjustment (threshold moving)

Learning directly based on the original training set, but in the prediction using the trained classifier, originally defaults to a threshold value of 0.5 was adjusted to \ (\ frac {| P | } {(| P | + | N |)} \ ) can be. (Mostly negative samples, therefore classifiers tend to give lower scores)

3.2 adjusted data

The method of adjustment data (also known as resampling method) is the uneven development of the field of study of the earliest, most influential, the most widely used of a class method by modifying the focus on the training data set so that standard learning algorithm can also be effective on its training. According to different implementations, it may be further classified as:

  • To remove samples from most categories (under-sampled, such as ENN, Tomeklink, NearMiss, etc.)
  • Generate new categories for a few samples (oversampling, such SMOTE, Borderline-SMOTE, ADASYN etc.)
  • Combination of the two schemes (oversampling + subsampling denoising, etc. as SMOTE + ENN)

Due to random sampling may be lost due to samples containing important information, random oversampling may be severe over-fitting (sample simple copy minority class), the introduction of new samples meaningless or even harmful (rudely synthetic minority sample ), so the development of a series of more advanced methods, trying to maintain the original data structure according to the data distribution information while performing resampling.

Note: In the resampling process, as much as possible to keep the probability of training and testing samples of the distribution is the same. If you violate the assumption that independent and identically distributed, is likely to produce bad results.

3.2.1 undersampling (under sampling)

The following describes three typical sub-sampling methods, the idea is, the edge between classification categories may increase the difficulty, by removing the majority class edge of the sample may be such that a larger margin between categories, ease of classification.

Edited Nearest Neighbor (ENN)

For those samples majority class, if most of its neighbors k samples related to its own category is not the same, which means it is in the category at the junction of the edge, we'll remove it.

Repeated Edited Nearest Neighbor(RENN)

This method is to constantly repeat the deletion process until you can no longer delete it.

Tomek Link Removal

If there are two different types of samples, and their nearest neighbors are the other side, which is the nearest neighbor is A B, B nearest neighbor is A, then A, B is Tomek link, all we have to do is to Tomek link We are deleted. So one way is to delete Tomek link, will form two samples Tomek link, if there is a sample belonging to the majority class, the majority class samples will be deleted. So that we can find positive and negative samples on the further apart. As shown below.

3.2.2 oversampling (over sampling)

SMOTE (Synthetic Minority Oversampling, synthetic minority oversampling)

SMOTE an improved algorithm is a stochastic sampling method, to produce more minority minority samples by interpolating samples. The basic idea is, for each sample of minority class, selecting a sample (this sample is also a small number of class) from its neighbors k random, then randomly select a new point in the minority synthesized connection therebetween sample.

SMOTE will randomly select a small number of samples for the synthesis of a new class sample, regardless of the circumstances surrounding the sample, so easily leads to two questions:

  • If the class around the few selected samples are also minority class samples, the new synthetic sample does not provide much useful information. It's like support vector machine point away from the margin of little effect on the decision boundary.
  • If you select a few classes around the sample are the majority class samples, such samples may be noise, the newly synthesized sample will produce a sample with most classes around the most overlap, resulting in classification difficult.

Overall we hope that the new class of synthetic small number of samples can be in the vicinity of the boundary of the two categories, which can often provide enough information to classify. This is below the Borderline SMOTE algorithm to do.

Borderline SMOTE

process:

  • . First all minority class samples into three categories, 1 Noise: all k neighbors belong to the majority class samples; 2 Danger:. More than half of the sample belongs to the class with k-nearest neighbor; 3 Safe:. More than half of the k-nearest neighbor samples belonging to the minority class. As shown below.
  • Point Danger class are at the border, as a seed as a starting point, and then generates a new sample with SMOTE algorithm. If the other end of the sample selected in the Safe Noise or collection, the sample should be close to the interpolation random seed end.

3.2.3 sub-sampling and over-sampling combined

Actual seemingly did not bring much to enhance the effect, may be watered too much data distribution?

  • SMOTE + Tomek Link Removal
  • Smote + THAN

Pros and cons of the method of adjustment data 3.2.4

advantage:

  • Noise can be removed / balance class distribution: on the data set of resampled training can improve the classification performance of some classifiers.
  • Under-sampling method to reduce the size of data sets: the computational overhead when the model training may be reduced.

Disadvantages:

  • Calculating sampling process inefficient: typically used (usually k- nearest neighbor method) to extract data from the profile information of neighbor relations based on consumption calculated in this regard.
  • Susceptible to noise: Nearest Neighbor algorithm easily corrupted by noise, it may not be an accurate distribution of information, resulting in unreasonable resampling strategy.
  • Oversampling method of generating data too: further increases the number of samples in the training set, increased computational overhead, and may lead to over-fitting.
  • Does not apply to complex data sets can not calculate the distance: industrial category data sets often contain features (e.g., user ID) or missing values, it is difficult to define a reasonable distance measure.

3.3 Adjustment Algorithm

The main method of adjusting the algorithm is to modify the existing standard machine learning algorithms to modify their preferences for most classes. In such processes are the most popular branches of cost-sensitive learning (cost-sensitive learning).

Cost-sensitive learning value is an extension of the method for adjusting the weights, a weight adjusting weights two methods commonly used in the classification, but the cost-sensitive learning extended this idea further, the cost matrix may be provided for multi-classification, may also be used for standard cost matrix algorithm is modified to make it learn to adapt unbalanced data, such as decision trees, cost matrix can be substituted into the decision threshold selection, split standards, pruning these three areas.

Cost-sensitive learning advantages:

  • Without increasing the complexity of the training can be directly used for multi-classification problems.

Cost-sensitive learning disadvantages:

  • Require prior knowledge areas: cost matrix needs to be provided by experts in the field based on a priori knowledge of the task, which is obviously not available in many real-world problems. Therefore, in practice the cost matrix is ​​generally disposed directly the number of different types of samples normalized ratio, we do not guarantee optimal classification performance.
  • Inappropriate for some classification: For the batch needs to be trained (mini-batch training) method of training a model (such as neural networks), only the minority class present in the sample in small batches, which leads to a gradient descent update non-convex optimization process will soon fall into the saddle point, so that the network can not learn effectively.

3.4 Integration Methods

Strength in numbers, there is always an integrated method for you. Here we highlight two: EasyEnsemble algorithms and BalanceCascade algorithm.

EasyEnsemble algorithm

Bagging a similar approach. From each negative example N (the majority class) has returned to extract a subset of N ', the size of the Far Example P (minority); each subset of N' and P together training classifiers to generate a plurality of groups; these groups will eventually classifiers are combined to form an integrated learning system (arithmetic mean or weighted mean).

BalanceCascade algorithm

Adaboost based, as the Adaboost classifier group. At each round of training uses positive Example P equal to the number of training set N ', training an Adaboost group classifier; then use the classifier negative examples set N prediction, to control the false positive cases by controlling classification threshold FPR rate is f, the N judicious all negative embodiment deleted; iteration is repeated T times. (Note: the number of negative cases each left after deleting the ratio is f, the iteration after times T-1 for the remainder of | N | * f ^ (T-1), i.e. equal to the number of positive cases | P |, and then the last iteration.)

advantage:

  • Effect is usually better: no problem is ensemble can not be solved, if there is, then one more base learner. According to previous experience, integrated learning method is still learning to solve the problem of imbalance in the most effective way.
  • Feedback can be used in an iterative process is dynamically adjusted: BalanceCascade will discard majority class classifier current sample has been well classified in each iteration having idea of ​​dynamic resampling.

Disadvantages:

  • Easy to introduce unbalanced disadvantage group learning to learn;
  • Further increase computational overhead;
  • BalanceCascade not robust to noise: the blind retention policies difficult to classify samples may lead to over-fitting noise / outliers in the latter part of the iterative.

4 How to choose the imbalance in the evaluation category of learning?

Due to unbalanced category, and we are more concerned about the small number of positive samples and the like, and therefore:

  • Curve can be used to focus on PR, F1 positive values, etc. of the embodiment;
  • precision assumption is that the threshold classifier is 0.5, so if using precision, Please adjust classification threshold. By contrast, precision @ n more meaningful.

note:

  • Try not to use accuracy, here lacks significance;
  • Try not to choose ROC: ROC curve is insensitive to class imbalance, which is its advantages but also its disadvantages. When the imbalance in the category, and we are more concerned about minority class, overly optimistic estimates of ROC curve analysis is very confusing. If ROC curve as the evaluation index, it is easy because of the high AUC value, while ignoring minority class sample of the actual classification results actually not an ideal situation.

5 suggestions on solving method of choice?

Due to the above dimensional feature space visualization is difficult, it is often difficult to visualize the characteristics of the data itself to choose the appropriate method, but various methods are applicable to its range, which is the hard part. If I have to recommend a further high-level and easy to use method, it would first try to drop random sampling + Bagging.

6 Summary

  • There are five categories of imbalance common situation, in which case 1 is the best case, only need to use the appropriate classification and evaluation;
  • Solutions require local conditions (and sometimes do not need special treatment), including the peace of mind of method, adjustment data, adjustment algorithms and integration;
  • Evaluation category disequilibrium study focused on the selection of the best embodiment of the positive indicators, and not the ROC Accuracy;
  • Randomized downsampling + Bagging is one size fits all.

Reference

Guess you like

Origin www.cnblogs.com/inchbyinch/p/12642760.html