The basis for the division of decision trees: information gain rate

In the above introduction, we intentionally ignored the "number" column. If "number" is also used as a candidate division attribute, its information gain can be calculated according to the information gain formula to be 0.9182, which is much greater than other candidate division attributes.

In the process of calculating the information entropy of each attribute, we found that the value of this attribute is 0, that is, its information gain is 0.9182. However, it is obvious that such a classification, the final result does not have a generalization effect. It cannot be effective for new samples predict.

In fact, the information gain criterion has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm [Quinlan, 1993J does not directly use information gain, but is to use the "gain ratio" (gain ratio) to select the optimal partition attribute.

Gain rate: The gain rate is jointly defined by the ratio of the previous information gain Gain(D, a) and the "intrinsic value" (intrinsic value) corresponding to attribute a [Quinlan, 1993J].

insert image description here

The more possible values ​​of attribute a (that is, the larger V), the larger the value of IV(a) will be.

Case number one

a. Calculate category information entropy

b. Calculate the information entropy of gender attributes (gender, activity)

c. Calculate the information gain of activity (gender, activity)

d. Calculate attribute split information metrics

The split information measure is used to consider the quantity information and size information of branches when a certain attribute is split. We call this information the intrinsic information of the attribute (instrisic information). The information gain rate uses information gain/intrinsic information, which will cause the importance of the attribute to decrease as the intrinsic information increases (that is, if the attribute itself is very uncertain, then I will be less inclined to choose It), which can be regarded as compensation for purely using information gain.

insert image description here

e. Calculate the information gain rate

insert image description here

The information gain rate of activity is higher, so when building a decision tree, it is preferred to choose

In this way, in the process of selecting nodes, we can reduce the selection preference of attributes with more values.

case two

As shown in the figure below, the first column is the weather, the second column is the temperature, the third column is the humidity, the fourth column is the wind speed, and the last column is whether the activity is carried out.

We have to solve: According to the data in the table below, judge whether the activity will be carried out under the corresponding weather?

insert image description here
insert image description here

The data set has four attributes, the attribute set A={weather, temperature, humidity, wind speed}, there are two category labels, and the category set L={proceed, cancel}.

a. Calculate category information entropy

Category information entropy represents the sum of uncertainties of various categories in all samples. According to the concept of entropy, the greater the entropy, the greater the uncertainty, and the greater the amount of information needed to figure things out.

Ent(D)=−149log2149−145log2145=0.940

b. Calculate the information entropy of each attribute

The information entropy of each attribute is equivalent to a conditional entropy. What he represents is the sum of the uncertainties of various categories under the condition of a certain attribute. The greater the information entropy of an attribute, the less "pure" the sample category in this attribute is.

insert image description here

c. Calculate information gain

Information gain = entropy - conditional entropy, here is category information entropy - attribute information entropy, which represents the degree of information uncertainty reduction. If the information gain of an attribute is greater, it means that using this attribute to divide samples can better reduce the uncertainty of the divided samples. Of course, choosing this attribute can complete our classification goals faster and better.

Information gain is the feature selection index of ID3 algorithm.

bold style

e. Calculate the information gain rate

Weather has the highest information gain rate, and weather is selected as the splitting attribute. It is found that after the split, the category is "pure" when the weather is "cloudy", so it is defined as a leaf node, and the node that is not "pure" is selected to continue splitting.

insert image description here

Repeat the process 1~5 among the child nodes until all the leaf nodes are "pure".

Now let's summarize the algorithm flow of C4.5

while(当前节点"不纯")1.计算当前节点的类别熵(以类别取值计算)
    2.计算当前阶段的属性熵(按照属性取值吓得类别取值计算)
    3.计算信息增益
    4.计算各个属性的分裂信息度量
    5.计算各个属性的信息增益率
end while
当前阶段设置为叶子节点

Guess you like

Origin blog.csdn.net/cz_00001/article/details/132041633