In the above introduction, we intentionally ignored the "number" column. If "number" is also used as a candidate division attribute, its information gain can be calculated according to the information gain formula to be 0.9182, which is much greater than other candidate division attributes.
In the process of calculating the information entropy of each attribute, we found that the value of this attribute is 0, that is, its information gain is 0.9182. However, it is obvious that such a classification, the final result does not have a generalization effect. It cannot be effective for new samples predict.
In fact, the information gain criterion has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm [Quinlan, 1993J does not directly use information gain, but is to use the "gain ratio" (gain ratio) to select the optimal partition attribute.
Gain rate: The gain rate is jointly defined by the ratio of the previous information gain Gain(D, a) and the "intrinsic value" (intrinsic value) corresponding to attribute a [Quinlan, 1993J].
The more possible values of attribute a (that is, the larger V), the larger the value of IV(a) will be.
Case number one
a. Calculate category information entropy
b. Calculate the information entropy of gender attributes (gender, activity)
c. Calculate the information gain of activity (gender, activity)
d. Calculate attribute split information metrics
The split information measure is used to consider the quantity information and size information of branches when a certain attribute is split. We call this information the intrinsic information of the attribute (instrisic information). The information gain rate uses information gain/intrinsic information, which will cause the importance of the attribute to decrease as the intrinsic information increases (that is, if the attribute itself is very uncertain, then I will be less inclined to choose It), which can be regarded as compensation for purely using information gain.
e. Calculate the information gain rate
The information gain rate of activity is higher, so when building a decision tree, it is preferred to choose
In this way, in the process of selecting nodes, we can reduce the selection preference of attributes with more values.
case two
As shown in the figure below, the first column is the weather, the second column is the temperature, the third column is the humidity, the fourth column is the wind speed, and the last column is whether the activity is carried out.
We have to solve: According to the data in the table below, judge whether the activity will be carried out under the corresponding weather?
The data set has four attributes, the attribute set A={weather, temperature, humidity, wind speed}, there are two category labels, and the category set L={proceed, cancel}.
a. Calculate category information entropy
Category information entropy represents the sum of uncertainties of various categories in all samples. According to the concept of entropy, the greater the entropy, the greater the uncertainty, and the greater the amount of information needed to figure things out.
Ent(D)=−149log2149−145log2145=0.940
b. Calculate the information entropy of each attribute
The information entropy of each attribute is equivalent to a conditional entropy. What he represents is the sum of the uncertainties of various categories under the condition of a certain attribute. The greater the information entropy of an attribute, the less "pure" the sample category in this attribute is.
c. Calculate information gain
Information gain = entropy - conditional entropy, here is category information entropy - attribute information entropy, which represents the degree of information uncertainty reduction. If the information gain of an attribute is greater, it means that using this attribute to divide samples can better reduce the uncertainty of the divided samples. Of course, choosing this attribute can complete our classification goals faster and better.
Information gain is the feature selection index of ID3 algorithm.
e. Calculate the information gain rate
Weather has the highest information gain rate, and weather is selected as the splitting attribute. It is found that after the split, the category is "pure" when the weather is "cloudy", so it is defined as a leaf node, and the node that is not "pure" is selected to continue splitting.
Repeat the process 1~5 among the child nodes until all the leaf nodes are "pure".
Now let's summarize the algorithm flow of C4.5
while(当前节点"不纯"):
1.计算当前节点的类别熵(以类别取值计算)
2.计算当前阶段的属性熵(按照属性取值吓得类别取值计算)
3.计算信息增益
4.计算各个属性的分裂信息度量
5.计算各个属性的信息增益率
end while
当前阶段设置为叶子节点