Deep Understanding of Machine Learning - Imbalanced Learning: Overview of Common Techniques

Since researchers began to notice the class imbalance problem in the late 1990s, a variety of learning techniques have been developed and used to solve this problem, mainly including the following.

sample sampling technique

Sample sampling, also known as data layer processing method, as the name implies, is to obtain a relatively balanced training set by increasing the minority class samples or reducing the majority class samples to solve the class imbalance problem. The method of increasing the samples of the minority class is called oversampling, and the method of reducing the samples of the majority class is called downsampling or undersampling. Taking the imbalanced data set in the previous article as an example, the following figure shows the sample distribution of the data set after down-sampling and over-sampling respectively.

downsampling and oversampling
Random Over-sampling (ROS) and Random Under-sampling (RUS) are the simplest and most commonly used sample sampling techniques. However, both have their own shortcomings. Among them, the former will increase the space-time overhead of classifier training and is prone to over-adaptation of the classifier, while the latter will cause serious loss of classification information, resulting in a significant decline in classification performance.

To overcome the shortcomings of random sampling, Chawla et al. proposed a new oversampling method in 2002: Synthetic Minority Oversampling Technique (SMOTE). Different from random oversampling methods, SMOTE effectively solves the problem that ROS methods tend to fall into overfitting by inserting dummy samples between two adjacent minority class samples. Han et al. found that most of the misclassified samples are usually distributed near the boundary between the two categories, so they improved the SMOTE algorithm and proposed two algorithms, BSO1 and BSO2, respectively. The BSO1 algorithm only performs the SMOTE algorithm on the minority class samples close to the boundary line, while the BSO2 algorithm performs the SMOTE algorithm jointly on the minority class and majority class samples near the boundary line. Another improved method is called One Side Selection (OSS), which is very similar to the idea of ​​the BSO2 algorithm. It removes the noise samples, redundant samples and boundary samples in the majority class so that the decision region can be obtained. effective shrinkage. Another well-known sampling method is the ADA-SYN algorithm, which can automatically decide the number of generated pseudo samples according to the distribution density of the samples. The SBC method proposed by Yen and Lee draws on the idea of ​​clustering, which can automatically decide how many majority class samples to remove in each class cluster.

In addition, some more advanced sampling algorithms have been proposed in recent years. For example, the ACO Sampling algorithm proposed by Yu et al. uses the ant colony optimization search algorithm to examine the information content and importance of most samples, so as to The majority class samples with large amount of information are retained to the greatest extent during downsampling; Zhang and Li proposed an oversampling method based on random walk, the advantage of this method is that the newly generated training set still conforms to the original training set in every feature From the perspective of preserving the original probability distribution density of minority samples, Das et al. proposed two oversampling methods: RACOG and wRACOG, which obtained better results than traditional oversampling methods. performance.

It can be said that sample sampling is an effective class-imbalanced learning technique. The biggest advantage of this technique is that the sampling process and the training process of the classifier are independent of each other, so it is more versatile.

cost-sensitive learning techniques

Cost-sensitive learning is also one of the commonly used techniques for solving class imbalance problems. Different from the sample sampling technology that directly changes the sample distribution by adding or deleting samples, the cost-sensitive learning technology changes the training principle of the classifier, and no longer pursues the minimization of the training error, but instead aims to minimize the overall misclassification cost as the training goal, that is, At training time, a large penalty is imposed on the training error of the minority class samples, and a relatively small penalty is imposed on the training error of the majority class. As for the specific penalty coefficient, it needs to be given in the form of a cost matrix. Still taking the imbalanced dataset in the previous article as an example, the figure below shows its corresponding cost-sensitive weighting diagram. It is not difficult to see from this figure that the weight of the minority class samples is proportionally enlarged, and it can be predicted that some minority class samples in the overlapping area of ​​the two classes of samples can be correctly classified.

cost-sensitive learning
The essence of cost-sensitive learning is to integrate the cost matrix with the traditional classifier model to achieve the purpose of correcting the classification surface, so it is a learning method of the classifier layer. In the cost-sensitive learning family, there are various ways of fusion. For example, the naive Bayes classifier calculates the modified posterior probability by multiplying the original posterior probability and the cost coefficient, thereby changing the distribution of decisions, while the decision tree algorithm usually The concept of cost is taken into account in the whole process of its training, including attribute selection and pruning. Support vector machines and extreme learning machines directly weight the cost of their penalty factors, so that the trained classifier can directly adapt to the imbalance of samples. distributed. The traditional cost-sensitive learning algorithm has a strong dependence on the cost matrix, which is easy to cause a serious problem: ** The misclassification cost of the same samples is the same, and the position information in the feature space is not considered, so There is still a lot of room for improvement in classification performance. **There are two main ways to solve the above problems:

  • The prior distribution information of the samples is mined and quantified, and then the corresponding fuzzy cost weighting matrix is ​​designed.
  • Combine the classifier with the Boosting ensemble learning model, and improve the generalization performance of the classifier by continuously adjusting the cost weight of each sample

Compared with sample sampling techniques, although cost-sensitive learning techniques may be more complex in construction mechanism, the techniques are more flexible, and their performance is often better in many specific classification tasks.

Decision Output Compensation Technology

Decision output compensation technology can also be regarded as a learning method of the classifier layer. It corrects the original biased classification decision surface by directly making positive compensation for the final decision output. Do translation processing. The figure below shows the basic principle of such a technique, where the distance between the original classification surface and the revised classification surface is the compensation value of the decision output.

Decision Output Compensation Technology
Zhou et al. were the first to explore the feasibility of such a technology in their work. They restrict the output of the backpropagation neural network to [ 0 , 1 ] [0,1][0,1 ] , and normalize the output value after training, and then multiply the normalized output by different thresholds to achieve the purpose of translating the classification surface. The disadvantage of the above method is that the threshold is set empirically, and the position of the movement cannot be guaranteed to be optimal. Lin and Chen proposed a decision output compensation algorithm based on support vector machine. The compensation value is jointly determined by the sample size of the majority class and the minority class. The disadvantage is that the decision output compensation value is empirical. To this end, Yu et al. made improvements on the basis of it, considering the role of the prior distribution information of the samples in the training set, and by comparing the candidate bits of the classification surface, the optimal decision output compensation value can be determined adaptively, In order to maximize the classification performance. On the basis of the above work, Yu et al. studied the decision output compensation strategy based on extreme learning machine, and proposed the ODOC-ELM algorithm. This algorithm uses the golden section optimization search algorithm and the particle swarm optimization search algorithm to solve the second type The optimal decision output compensation value in unbalanced and multi-class unbalanced problems has obtained better performance.

Decision output compensation is also an effective class imbalance learning technology. The advantage is that the process of determining the compensation value and the training process of the classifier are independent of each other, but the difficulty is that the optimal compensation value is not easy to determine, and even if it can be determined, only It can ensure the parallel movement of the classification surface, but cannot change its direction, so the degree of performance improvement is very limited.

Integrated Learning Technology

As we all know, ensemble learning has been one of the research hotspots in the field of machine learning in recent years. It can effectively overcome the performance limitations of a single learning algorithm and greatly improve the generalization performance of the learner. In recent years, this technology has also been deeply integrated with the class imbalance learning technology, and a large number of efficient algorithms have been proposed. The most classic ensemble imbalanced learning algorithm is Asymmetric Bagging (asBagging), which combines random downsampling RUS technology with Bagging ensemble learning model, which effectively solves the problem of unstable classification performance caused by RUS being prone to mistakenly delete large information samples. question. Sun et al. proposed an ensemble learning algorithm similar to the asBagging algorithm, which firstly divides the majority class samples into multiple subsets randomly and without crossover, ensuring that each subset is roughly equal to the minority class samples, and then constructs multiple balanced Train subsets, and then integrate. In particular, in their experiments, five different ensemble decision rules were mainly compared and analyzed. Yu et al. combined asBagging algorithm and feature subspace algorithm to classify high-dimensional imbalanced bioinformatic data, and achieved ideal results.

As an important member of the ensemble learning family, the Boosting learning model has also been used to solve the class imbalance problem. Combining the SMOTE method with the Boosting learning model, Chawla et al. proposed the SMOTE Boost method, which followed the idea of ​​sample weighting in the traditional Boosting algorithm, but first used the SMOTE algorithm to oversample the original training set before weighting. Seiffert et al. combined the Boosting algorithm with the random downsampling method, and proposed the RUS Boost method, which was found to have better performance than SMOTE Boost. Combining random downsampling, Bagging and AdaBoost algorithms, Liu et al. proposed two ensemble learning methods: Easy Ensemble and Balance Cascade. These two algorithms have the advantages of low time complexity and high data utilization.

Random forests can also be used to solve class imbalance problems. The solution is either to construct balanced random forests through sample sampling techniques, or to construct weighted random forests through cost-sensitive learning techniques. In addition, it is well known that in order for ensemble learning to play its best role, the following two conditions must be met:

  • Individual classifiers should be as accurate as possible
  • The difference between individual classifiers should be as large as possible

To this end, Diez-Pastor et al. especially emphasized the importance of maintaining the degree of dissimilarity of individual classifiers in their research work. Compared with a single technique, ensemble learning usually has higher classification accuracy and stronger generalization performance. When solving the problem of class imbalance, ensemble learning has shown obvious advantages. But it is undeniable that some ensemble learning algorithms still have the problem of excessive time complexity. Therefore, in practical applications, it is necessary to judge whether this technology should be selected according to the specific situation.

active learning techniques

Like ensemble learning, active learning is one of the important branches of machine learning. The core idea of ​​active learning is: first, human domain experts manually label the class labels of some samples as the initial training set, and train the classifier; then, according to a certain sample selection strategy, select a small amount of unlabeled samples with a large amount of information Or samples with high uncertainty, and submit them to human domain experts for annotation to expand the training set size; finally, retrain the classifier on the expanded training set. Repeat the above process until a preset stop condition is met. It is not difficult to see from the above process that the advantage of active learning is that it can effectively reduce the size of the training samples without losing the classification performance, thereby reducing the time, economic and labor costs in the sample labeling work.

Ertekin et al. found that active learning can effectively alleviate the negative impact of class imbalance distribution on classifiers. Consider a classification task with a high imbalance ratio. If each class of samples approximately conforms to a Gaussian distribution, in the boundary region of the two types of samples, that is, their overlapping regions, the imbalance ratio of the samples is often much lower than that of the entire training set. balance ratio. For the active learning algorithm, the samples in the boundary area often contain the largest amount of information, so the probability of being selected and marked in the process of active learning is usually high. In this way, the training set selected by active learning usually does not have a high imbalance ratio, and its damage to the performance of the classifier is greatly reduced. In their work, Ertekin et al. used a support vector machine as a classifier, and in each round of active learning, the samples closest to the classification hyperplane were selected for labeling until there were no more unlabeled samples within the support vector interval. In addition, in order to speed up the learning process, they also proposed the "59 sampling" principle to achieve a trade-off between space-time complexity and classification performance.

Since active learning can be used to solve the class imbalance problem, does it mean that the class imbalance distribution of samples has no negative impact on active learning? The answer is obviously no. Existing work has shown that the performance of active learning algorithms will still be negatively affected when the class imbalance ratio of the training set is high. Therefore, it is necessary to adopt certain strategies in the process of active learning, such as sample sampling, cost weighting, etc., to ensure the fairness of learning.

one class classification technique

Different from traditional classification techniques, the one-class classification technique only uses samples belonging to one category to train the classifier, which is usually used in some extreme scenarios, that is, the training samples only contain normal samples and abnormal samples are not available. This technique has also been used to solve extremely imbalanced classification problems, where traditional class-imbalanced learning methods often fail to achieve good classification results. At present, the most commonly used types of classifiers include methods based on Gaussian probability density estimation, methods based on Parzen windows, autoencoder methods, methods based on clustering, methods based on K-nearest neighbors, a class of support vector machines, support vector data Description method and a class of extreme learning machines, etc. Either method is used to describe a coverage relationship to better describe the distribution of normal samples and distinguish them from abnormal samples. The figure below shows a schematic diagram of a one-class classifier. It is not difficult to see from this figure that the training samples all belong to the same class, and the one-class classification technology needs to find a coverage model to distinguish this class of samples from abnormal samples that do not appear. In particular, in order to ensure the generalization performance of the classifier, a certain proportion of outlier samples can be allowed to be misclassified.
one class classification technique

Guess you like

Origin blog.csdn.net/hy592070616/article/details/124235223