Deep Understanding of Machine Learning - Imbalanced Learning: The Basics

Category: General Catalog of "Understanding Machine Learning"


Traditional classification techniques usually have a fatal flaw, that is, when they are trained on data with unbalanced sample distribution (eg: using the physical examination indicators of 99 healthy people and 1 patient to create a disease diagnosis model or 99,990 normal data packets and 10 Virus data packets to build network intrusion detection models, etc.), there is often a phenomenon of classification bias, so that the ideal classification effect cannot be obtained, and in severe cases, the model may even fail completely.

The above problems are usually called class imbalanced problems in the field of machine learning and data mining, and people are accustomed to collectively refer to the algorithms used to solve the above problems as class imbalanced learning algorithms. Since the late 1990s, class imbalance learning has been one of the hot and difficult research topics in the field of machine learning and data mining. Many mainstream conferences and journals in the industry have also held special issues or seminars on this topic, such as AAAI'00, ICML, ACM SIGKDD Explorations Newsletter and PAKDD etc. At the ICDM conference, the class imbalance problem was listed as one of the top ten problems to be solved in the field of data mining. Today, the enthusiasm for research on this issue in academia and industry has not subsided, and with the emergence of big data, it is gradually warming up.

In the past 10 years, the number of papers published in the field of category imbalance still showed a trend of increasing year by year, especially after 2012, the number of papers published each year remained above 120. In fact, considering the limitations of keywords and literature library selection, the above statistical results are obtained with a serious discount, and in fact, it is much more than that. It can be seen that class imbalance learning has gradually developed into one of the important branches of machine learning and data mining. In the follow-up part of the "In-depth Understanding of Machine Learning - Class Imbalanced Learning" series, the basic concepts of the class imbalance problem, the commonly used class imbalance learning techniques and the applicable application areas of the class imbalance learning will be briefly summarized. Introduction, so that readers can have a preliminary and simple understanding and understanding of class imbalance learning, and lay a solid foundation for machine learning practical projects.

Class imbalance refers to the situation in which the number of training samples of different classes in a classification task is very different. Without loss of generality, we can assume that the training set contains only two types of samples, that is, the classification problem to be processed is a binary classification problem. At the same time, in order to ensure a better visualization effect, it is advisable to set each sample to have two characteristics. The following figure shows the comparison effect of the balanced sample set and the unbalanced sample set, in which: there are 500 samples in each of the two types in the balanced sample set, and the class 1 samples are in the [0, 0.7] value range of feature 1 and the [0, 0.7] value range of feature 2. The value interval of 0,1] is uniformly distributed, while the samples of category 2 are uniformly distributed in the value interval of [0.5,1] of feature 1 and the value interval of [0,1] of feature 2; unbalanced sample set It also contains 1000 samples, but class 1 is assigned 900 samples, while class 2 has only 100 samples, and their respective distributions are exactly the same as the balanced sample set.
insert image description here
It is not difficult to observe an interesting phenomenon, that is, only from the visual effect, on the balanced and unbalanced training sets, the two types of samples have completely different segmentation positions, which means that their classification boundaries are different. But as far as we know, the distribution of homogeneous samples is exactly the same on these two training sets. So, is this just an illusion caused by human vision defects? This is not the case, because most traditional classification algorithms make the same mistakes as the human eye. As we all know, although the traditional classification algorithms are different in construction mechanism, they almost all follow a common principle, that is, the principle of minimizing the training error. On a balanced training set, using the training error minimization principle will undoubtedly yield the best results, but when the training set is unbalanced, insisting on this principle will have serious consequences. Looking back at the above figure (b), it can be clearly observed that the two types of samples overlap each other in the [0.5, 0.7] value interval of feature 1, and the number of samples of class 1 (majority class) is much larger in this interval. For class 2 (minority class), if the training error minimization principle is adopted, the minority class samples in this interval will be misjudged, resulting in the classification accuracy of the minority class being much lower than that of the majority class, resulting in the training of the classification model. The quality is greatly compromised, or even completely ineffective. This is the challenge that the class imbalance problem poses to traditional classification algorithms.

In the class imbalance problem, people are accustomed to refer to the class with more samples as the negative class, and the class with less samples as the positive class. In addition, another important concept is the Imbalanced Ratio (IR), which is the ratio of the number of negative samples to the number of positive samples. In general, the larger the IR value, the more detrimental it will be to the performance of traditional classifiers. Considering a training sample set with an IR value of 99, if all positive samples are misjudged as negative classes when constructing the classifier, the classification accuracy can still reach 99%, and such accuracy is the smallest for the establishment of the training error. For the traditional classification algorithm based on the principle of transformation, it is absolutely acceptable, but such a classification model is really not very useful.

According to different classification criteria, the class imbalance problem can also be divided into several different categories. The specific classification criteria are as follows:

  • Using the number of categories as the division standard: the category imbalance problem is divided into a first-class imbalance problem, a second-class imbalance problem and a multi-class imbalance problem. Among them, the two-type imbalance problem is the most common in practical applications, and the research is currently the most complete; the multi-type imbalance problem is relatively the most complex, and is still the research hotspot and difficulty in this field; and the one-type imbalance problem is unique. There are several effective solutions available.
  • Using the IR value as the division standard: divide the class imbalance problem into mild imbalance problem and extreme imbalance problem. Among them, the former has a small IR value and has little impact on the performance of traditional classifiers, while the latter will pose a greater threat to traditional classification algorithms, and will completely fail in extreme cases.
  • Taking the scope of action as the division standard: divide the class imbalance problem into the intra-class imbalance problem and the inter-class imbalance problem. Among them, the former is also called intra-class sub-aggregation or small disjunctive term problem, which is mainly caused by the uneven distribution of similar samples in the feature space, while the latter is regarded as a class imbalance problem in the traditional sense. The above two are both different and interrelated. When the two appear together, it will create more difficulties for learning tasks.

In addition to the above concepts, it is necessary for the reader to understand the difference and connection between class-imbalanced learning and cost-sensitive learning. In fact, the above two are two completely different concepts, and also belong to two completely different branches in the field of machine learning. In cost-sensitive learning, the definition of cost is often divided into various types, including misclassification cost, test cost, query cost, sample cost, and computation cost. Only when the misclassification cost is considered, can cost-sensitive learning be linked to class-imbalanced learning and be used as a class-imbalanced learning method. Therefore, readers should not confuse the concepts of the above two types of learning methods.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/124230533