Unbalanced treatment

A classification problem underlying assumption is that each category of data has its own distribution, when less time is difficult to observe certain types of data to the structure, we can consider the type of data discarded in favor of learning more clear majority class model, and will judgment sample does not meet the majority class patterns of abnormal / minority class, some time would be better. At this time, the problem reduces to the abnormality detection (anomaly detection) problem.

For general machine learning methods, the most common evaluation is undoubtedly the classification accuracy ACC (accuracy), ACC meaning very intuitive and very effective under normal circumstances. However, for unbalanced classification task, ACC and can not reflect the performance of the classifier. Consider the following situation: A sample contains 10,000 data sets, a case where all samples were determined as the majority class classifier to obtain a sample classification accuracy of 99%, which is a relatively high score, but give a complete We can not distinguish between the minority class sample classifiers such a high score is clearly unreasonable. Due to the nature of traditional evaluation of ACC, etc., when the data often leads to uneven distribution of its output sorter tend to concentrate in the majority of categories of data: the output of most of the class will lead to higher classification accuracy, but we minority concerns of poor performance.

Category imbalances have some more reasonable evaluation criteria, these indicators are often the true example TP / true negative cases TF / false positive cases of FP / FN false negative cases calculated based on the number. In the binary scene, the number of the four kinds of samples can be calculated by the confusion matrix.

Based on the confusion matrix, we can use accuracy (Precision) and recall (Recall) to evaluate the classification accuracy of the model on the unbalanced data. F-score (F1) and the G-mean (GM) is the harmonic mean of precision and recall rate of [4,5]. MCC [6] (Matthews correlation coefficient, i.e., the correlation coefficient Matthews) considering the number of a variety of samples, evaluation is in a category can be used in the balanced or unbalanced. AUCPRC [7] (Area Under Curve of Precision-Recall Curve) refers to the accuracy rate - the area under the curve of the recall rate. These evaluation criteria will not be affected by the number of samples in different categories, it is often considered to be "unbiased" and can be used in class imbalance. Note that a common evaluation AUCROC (Area Under Receiver-Operator Characteristic curve) is actually biased, it does not apply to model evaluation under imbalances scene.

Typically, the higher the degree of imbalance class, the greater the difficulty of classified data set. But in some work, found some highly unbalanced dataset, without any modification of the standard learning model (eg, SVM, Decision Tree, etc.) on these data sets can still get a good classification results. Obviously, the category itself is not difficult to classify unbalanced sources, the reasons behind the need for data distribution and behavior models in the training process for a more detailed observation.

Some studies try to explain the reasons for the imbalance in nature dataset difficult to classify, difficult work think the reason classification from some of the essential element of data distribution. Such as,

  • Excessive minority samples occur in the majority class samples dense region [8]

  • Overlapping distribution between categories (i.e., samples of different types of relatively dense regions appear in the same feature space) [8]

  • Noise inherent in the data, especially the minority noise [9]

  • Sparsity minority class distribution (sparsity) and splitting a plurality of sub-concepts sparsity resulting (sub-concepts, sub Clusters understood) and each sub-concepts contained only a small number of samples [10]

Standard machine learning algorithm assumes that the number of samples belonging to different categories substantially similar. Therefore, the category of non-uniform distribution to the application on the unbalanced data sets standard learning algorithm creates difficulties: learning algorithms behind the design of these implicit goal is to optimize the classification accuracy on the data set, which can lead to learning algorithms without the equilibrium data prefer to most of the class of more samples. Most of the imbalance learning (imbalance learning) algorithm is to solve this "preference for the majority of the class" and asked:

Method 1 Data Set:

    

Data-level approach is the uneven development of the field study of the earliest, most influential, the most widely used of a class method, also known as resampling methods. Such methods focus on the training data set by modifying the standard learning algorithm so that it can be effective in training. The different data-level implementation method can be further classified as:

1. Remove the sample from the method most categories (undersampling, such RUS, NearMiss [11], ENN [12], Tomeklink [13] , etc.)
2. The method of generating a small number of categories new sample (oversampling, such as the SMOTE [ 14], ADASYN [15], Borderline-SMOTE [16] , etc.)
3. Combination method of mixing the two above classes of (+ undersampling the oversampled denoising as SMOTE + ENN [17], etc.)

Standard random resampling method using a random method for selecting the target sample pretreatment. However, a random sample method may result discarded contain important information (random sub-sampling) or the introduction of new samples even harmful nonsense (random oversampling), and therefore there is a series of more advanced methods to try based on the distribution of information to the data resampling simultaneously maintain the original data structure

Strength: 
1. Such method can remove noise / balance class distribution: on the data set of resampled training can improve the classification performance of some classifiers.
2. sub-sampling method to reduce the size of the data set: under-sampling method will remove some classes majority of samples, which may reduce the computational overhead of model training.

Weakness: 
1. inefficient sampling process calculated: This series of "Advanced" resampling method generally used to extract information about the distribution of the neighborhood distance relationship (usually k- nearest neighbor method) based on the data. The disadvantage of this approach is the need to calculate the distance between each data sample, the amount needed to calculate the distance is calculated as the square of the size of the data set level of growth, and therefore on large data sets using these methods may lead to a very low computational efficiency.
2. susceptible to noise: Also having a high ratio of unbalanced and contains a large number of industrial noise data set, the minority class structure may not be well distributed showing sample. These are resampling methods used to extract information about the distribution of nearest-neighbor algorithm can easily be noise, and therefore may not be an accurate distribution of information, resulting in unreasonable resampling strategy.
3. The method of generating excessive oversampling data: When applied to a large-scale and highly unbalanced data set, oversampling method class may generate a large minority of the samples to balance the data set. This will further increase the number of samples of the training set, increasing the computational overhead, slow down the training speed, and may lead to over-fitting.
4 does not apply to complex data sets can not be calculated from: The most important point is that these resampling methods rely on well-defined distance measure, such that they are unavailable on some data sets. In practice, often the data set containing industrial category feature (i.e., not in a continuous spatial distribution of the features, such as a user ID) or missing values, the range of different features in addition may have a huge difference. On the definition of a reasonable distance measure these data sets is very difficult.

 

  • Algorithm-level approach

Algorithm level approach focuses on modifying an existing standard machine learning algorithms to modify their preferences for most classes. In such processes are the most popular branches of cost-sensitive learning [18,19] (cost-sensitive learning), we are here only to discuss the class algorithms. Cost-sensitive learning to higher minority sample misclassification cost allocation, and to allocate the majority class small sample misclassification cost. In this way, cost-sensitive learning during the training learner's artificially increased the importance of a small number of sample categories, in order to reduce the classifier preference for most classes.

Strength: 
1. without increasing the complexity of the training: the use of such algorithms magic algorithm change usually have better performance, and does not increase the computational complexity of training.
2. Can be directly used for multi-classification problem: These algorithms are generally only modify misclassification cost, it can be directly extended to multiple classification.

Weakness: 
1. require a priori knowledge areas: It must be noted that the cost-sensitive learning cost matrix (cost matrix) need to be provided by experts in the field based on a priori knowledge of the task, which is obviously not available in many real-world problems. Therefore, in practice the cost matrix is typically provided directly to the number of different types of samples normalized ratio. Due to lack of guidance in the areas of knowledge, such a cost matrix without setting does not guarantee optimal classification performance.
2. Do not generalize to different tasks: for a particular design problems cost matrix can only be used for this particular task, it does not guarantee good performance when used on other tasks.
3. depending on the particular classification: On the other hand, the need for model training in batch (mini-batch training) such as a method of training neural networks, only a small number of classes present in the sample in small batches, and most batch contains only the majority class samples, neural network training which will have disastrous consequences: using a gradient descent update non-convex optimization process will soon fall into local extreme points / saddle points (gradient of 0), the network can not cause for effective learning. Sensitive to the cost of learning to use a weighted sample does not solve the problem.

  • Integrated learning

Integrated learning approach focuses on the class A data-level method or algorithm level and integrated learning are combined to obtain a powerful, integrated classifier. Because of its performance in the category unbalanced task well in the practical application of integrated learning more and more popular. Most of them, and one other embedded imbalance learning methods (e.g., SMOTE [14]) in the integrated process based on a learning algorithm specific integration (e.g., Adaptive Boosting [20]).

e.g., SMOTE+Boosting=SMOTEBoost [21]; SMOTE+Bagging=SMOTEBagging [22];

Some other method of ensemble learning ensemble learning is based learner (e.g., EasyEnsemble, BalanceCascade [23]). So the final classification is an "integrated integration."

"Integrated Integration" does not mean that there will be better performance, as an integrated learning method based learners also affect the classification performance. The above two methods using as AdaBoost-based classifier, Boosting class method itself is sensitive to noise, plus BalanceCascade itself has the same problem, the use of non-integrated classifier may effect but better (e.g., directly C4.5).
PS, using these two methods do AdaBoost-based learning is a high probability of reason to rub hot (around 2010).

Strength: 
1. the effect is usually better: no problem is ensemble can not be solved, if there is, then one more base learner. In my experience integrated learning (magic reform) is still the most effective way to solve the imbalance learning problems.
2. The feedback iterative process can be used for dynamic adjustment: very few integrated process having the idea of dynamic resampling, as will discard BalanceCascade majority class classifier current sample has been well classified in each iteration of ( assumption that these samples have not contain information that contribute to the model). In practice the method is also achieved compared to the other down-sampling methods faster convergence rate, so that it can use a relatively lower classifier to obtain better performance classification.

Weakness: 
drawback 1. imbalance learning method comprising using: Since most unbalanced ensemble learning still uses a standard data grade / stage algorithm in which method the pipeline, so the disadvantages of the above two methods is also present in the they use an integrated approach.
2. oversampling + integrated to further increase the computational overhead: when applied to practical tasks, even if integration can enhance the performance of most methods of classification, data grade / level algorithm method remains low computational efficiency, applicability of the poor and vulnerable to the effects of noise Shortcomings. For example, an oversampling method SMOTE when used alone has introduced a number of new training samples, using SMOTE training data set to generate more and more trained classifiers will make the whole training process becomes slower.
3. noise is not robust: BalanceCascade made a very meaningful exploration, but its difficult to classify samples blindly retention strategies could lead to over-fitting noise / outliers in the latter part of the iteration, and ultimately worsen integrated classifier which performed. In other words, it is not robust to noise

  • Undersampling: data set is generally used to balance, denoising. Random sampling under-balanced data sets / NearMiss, sampling and training fast enough. Random sub-sampling can be used in any case, but in a higher degree of data sets imbalance inevitably discarded most of the majority class samples cause loss of information. NearMiss extremely sensitive to noise, noise basically destroyed. There are many ways denoising, such Tomeklink, AllKNN the like, need to have a well-defined distance metric data set amount is calculated on the large data sets. After the de-noising effective for some classifier, some invalid.

  • Oversampling: random oversampling do not use under any circumstances, and easily lead to over-fitting. SMOTE, ADASYN data on a small scale can try. When large-scale data and a high degree of imbalance, the over-sampling method to generate a huge amount of synthesized samples, requires a lot of extra computing resources. Meanwhile such over-sampling structure information based on a small number of class samples, even when poor quality reverse optimization represents the minority class: not as a direct effect of oversampling training.

  • Composite sampling: theory joined the under-sampling algorithm denoising like to clean the data set after sampling. The actual use I did not feel any different, the only difference is the addition of de-noising method after more slowly.

  • Cost-sensitive: available at the time of data collection imbalance is not high, the efficiency of training classifiers trained with the use of raw data is no different. The disadvantage is that generally require themselves as "experts" to set the cost matrix (more parameters to be adjusted), set bad will usually not get the desired results. In addition, when a serious imbalance, since there is no modification of the data set, training the neural network will crash: number of consecutive mini-batch samples are majority class, to trap minute of local optimal / saddle point.

  • Integration: downsampling + random integration at a relatively high unbalance requires more base learners to achieve good results. Note Boosting likely to be affected by noise, Bagging method is a real snake oil, increase the number of results-based learning is generally not decreased. Senior downsampling + integration, you can try to run slower and does not guarantee better results than random method. Senior oversampling + integration, above, large-scale data and a high degree of imbalance, the number of training samples explosion. Especially integrated training methods but also a good number of base learners. BalanceCascade, high efficiency information, only a few base learners will be able to achieve good results, but the noise is not robust

Guess you like

Origin www.cnblogs.com/limingqi/p/11783435.html