I. Introduction
Ensemble Learning: Learning tasks are accomplished by building and combining multiple learners. The general structure is: first generate a group of "individual learners", and then use a certain strategy to combine them. Combining strategies mainly include average method, voting method and learning method.
In ensemble learning, the difference between individual learners is called "ensemble diversity". How to understand the integration of diversity is the holy grail problem of this learning paradigm , that is , the elusive and meaningful goal . The existing ensemble diversity measurement methods mainly include two categories: one is the diversity measurement of paired individual learners, and the other is the diversity measurement of unpaired individual learners. This paper mainly discusses and summarizes the latter category.
2. Preparation
This section declares some basic terminology, as the following metrics are computed based on individual learners. Individual learner set: ; dataset: , where samples and class labels are respectively, and .
3. Unpaired diversity measurement method
1. Kohavi-Wolpert variance , referred to as KW measure, was proposed by Kohavi and Wolpert in 1996. The specific calculation method is
Among them, is the number of samples, is the number of individual learners, and is the number of correct classification of samples by individual learners .
It can be seen from the equation that and are regarded as constants, and the most critical point is : when each sample is half of , the KW metric reaches the maximum, and the diversity is the largest at this time; and when each sample is all of When 0 or , the KW metric reaches the minimum, and the diversity is minimum at this time. This is well understood, if each sample is 0 or , then the prediction results of all individual learners are the same; otherwise, if each sample is half of , the prediction results of all individual learners are the same It may be different, please note that it may be different, not absolutely different. Therefore, there are certain problems in the diversity measurement of the KW metric.
2. Interrater Agreement (Interrater Agreement) , that is, measurement. The metric is used to analyze the consistency of a set of classifiers, it is defined as
Among them, is the average classification accuracy of individual learners; and is an indicator function, which returns 1 when the condition in parentheses is true, and returns 0 otherwise.
The metric mainly reflects the consistency of prediction results among individual learners. When the prediction results are completely consistent, then the value of is 1; if the degree of consistency between the learners is worse than random (the most extreme case is: the result of each sample being correctly classified is half of the individual learner and the average Accuracy is 0.5), then . Therefore, the larger the value of the metric, the more consistent the prediction results of the individual learners, but the smaller the diversity; and vice versa.
3. Entropy . The entropy measure calculation method proposed by Cunningham and Carney in 2000 is
Among them, represents the proportion of individual learners that will be predicted as (the denominator of the proportion is ). Obviously, it is not necessary to know the accuracy of individual learners.
The entropy measure calculation method proposed by Shipp and Kuncheva in 2002 is
Among them, is the sign of rounding up: if it is an integer, then , if it is not an integer, then the integer part of the value is +1. The value range of is [0, 1], when it is set to 0, it means that it is completely consistent, and when it is set to 1, it means that the diversity is the largest. It's worth noting that no logarithmic function is used, so it's not classical entropy. Nevertheless, this equation is used more often because it is easier to implement and faster to compute.
4. Difficulty . Assuming that the proportion of individual learners that correctly classify samples is recorded as a random variable , then the calculation method of difficulty is
Among them, the value range of the random variable is , and the probability distribution of can be estimated by predicting on the data set by a classifier . Therefore, the distribution of the random variable is listed as
... | ||||
... |
Measures the classification difficulty of the sample, the smaller it is, the greater the diversity. If a histogram is used to visualize the above distribution column, when the sample is difficult to be classified, the distribution area of the histogram will be mainly scattered on the left, and when the sample is easy to be classified, the distribution area of the histogram will be mainly scattered on the right.
5. Universal diversity . This measure is calculated as
Among them, , , and represents the probability that a randomly selected classifier fails to predict on a randomly selected sample. The value range of the metric is [0, 1], when =0, the diversity is the smallest. This measure captures the idea that diversity is greatest when one classifier's prediction error is accompanied by another prediction being correct. As for why this can be done, I haven’t figured it out yet. If you understand, please leave a message and let me know.
6. Simultaneous failure measurement . This measure is a modified version of the generic diversity, calculated as
When all classifiers give the same prediction results at the same time, cfd=0, if the samples of each classifier are different, cfd=1. Sorry, I still don't understand it.
Four. Summary
The diversity calculation methods above are all implemented based on classifiers. Among them, except for the two of consistency between raters, other metrics are directly proportional to the integration diversity.
In fact, the author is just getting started in the field of integrated learning, and there are still many things I don’t understand. If anyone sees it, please give me advice. If you don't understand something, you are welcome to leave a message in the comment area to discuss the unpaired diversity of this integrated learning.
5. References
1. Baidu Encyclopedia: Integrated Learning
2. Zhou Zhihua. Integrated Learning: Foundations and Algorithms [M]. Electronic Industry Press, 2020.