Machine learning---The basis for dividing decision trees (entropy, information gain, information gain rate, Gini value and Gini index)

1. Entropy

In physics, entropy is a measure of "chaos".

The more orderly the system is, the lower the entropy value; the more chaotic or dispersed the system is, the higher the entropy value is.

In 1948, farmers proposed the concept of information entropy (Entropy).

       Description from the perspective of information integrity: When the ordered state of the system is consistent, the more concentrated the data, the smaller the entropy value.

The more dispersed the place, the greater the entropy value. 

      Description from the perspective of the orderliness of information: When the amount of data is consistent, the more orderly the system is and the lower the entropy value is; the more chaotic or fragmented the system is.

The more dispersed, the higher the entropy value.

"Information entropy" is the most commonly used indicator to measure the purity of a sample set. Assume that the current sample set D has

The proportion of k class samples is pk (k = 1, 2,. . . , |y|), , D is the total number of samples, and Ck is the kth class sample

quantity. Then the information entropy of D is defined as (log is based on 2, lg is based on 10):

Among them: The smaller the value of Ent(D), the higher the purity of D.

Example: Suppose we did not watch the World Cup, but want to know which team will be the champion. We can only guess that a certain team is

Or not the champion, and then the audience answers with correct or incorrect answers. We want to minimize the number of guesses as much as possible. What method should we use?

Answer: Dichotomy.

If there are 16 teams, number them respectively. First ask if they are between 1-8. If so, continue to ask if they are between 1-4, and so on.

Push until you finally determine which team is the champion. If the number of teams is 16, we need to ask 4 times to get the final answer.

Then the information entropy of the news about the world champion is 4.

So how is information entropy equal to 4 calculated?

Ent(D) = -(p1 * logp1 + p2 * logp2 + ... + p16 * logp16), where p1, ..., p16 are the 16 teams respectively

The probability of winning the championship. When the probability of each team winning the championship is equal to 1/16, Ent(D) = - (16 * 1/16 * log1/16) = 4

When the probability of each event is the same, the entropy is the largest, and the more uncertain the event is.

2. Information gain

Information gain: the difference in entropy before and after dividing the data set by a certain feature. Entropy can represent the uncertainty of the sample set. The greater the entropy, the smaller the sample

The greater the uncertainty. Therefore, the difference in set entropy before and after division can be used to measure how well the current features are used to divide the sample set D.

The effect is good or bad.

Information gain = entroy(before) - entroy(after)

Information gain represents the degree to which the information entropy of class Y is reduced by knowing the information of feature X.

Assume that discrete attribute a has V possible values:

If a is used to divide the sample set D, V branch nodes will be generated, among which the v-th branch node contains all the nodes in D

The sample with value av on attribute a is marked as D. We can calculate the information entropy of D according to the information entropy formula given above, and then

Considering that different branch nodes contain different numbers of samples, assign weights to the branch nodes.

That is, the greater the number of samples, the greater the influence of branch nodes. Therefore, the "information obtained by dividing the sample set D using attribute a can be calculated.

Gain" (information gain).

Among them: the information gain Gain(D,a) of feature a to training data set D is defined as the information entropy Ent(D) of set D and the given feature a

The difference between the information conditional entropy Ent(D∣a) of D under the condition, that is, the formula is:

Detailed explanation of the formula: Calculation of information entropy:

Calculation of conditional entropy: 

Among them: Dv represents the number of samples contained in the v-th branch node in attribute a

Ckv represents the number of samples included in the k-th category among the number of samples included in the v-th branch node in attribute a.

Generally speaking, the greater the information gain, the greater the "purity improvement" obtained by using attribute a to divide. therefore, we

Information gain can be used to select the partition attributes of the decision tree. The famous ID3 decision tree learning algorithm [Quinlan, 1986] is

The partition attributes are selected based on the information gain criterion. Among them, the ID in the ID3 name is Iterative Dichotomiser.

The abbreviation of device).

For example: the first column is the forum number, the second column is the gender, the third column is the activity level, and the last column is whether the user has been lost. we need to solve

Decide on a question: Which of the two characteristics, gender and activity, has a greater impact on user churn?

This problem can be solved by calculating the information gain, statistically based on the information in the table on the right. Among them, Positive is a positive sample (lost),

Negative is a negative sample (not lost), and the following values ​​are the corresponding number of people under different divisions. Three entropies are available: 

① Calculate category information entropy (overall entropy)

②Information entropy of gender attributes

③Information gain of gender

④Information entropy of activity 

⑤Information gain of activity

The information gain of activity is greater than the information gain of gender. In other words, activity has a greater impact on user churn than gender. Making special

When selecting or analyzing data, you should focus on the indicator of activity. 

3. Information gain rate

In the above introduction, we intentionally ignored the "number" column. If "number" is also used as a candidate dividing attribute, the information will be

According to the information gain formula, its information gain can be calculated to be 0.9182, which is much larger than other candidate partition attributes. Calculate the information entropy of each attribute

During the process, it was found that the value of this attribute is 0, that is, its information gain is 0.9182. But it is obvious that the classification results in the final result

If it does not have a generalization effect, it cannot effectively predict new samples.

In fact, the information gain criterion favors attributes with a larger number of possible values. In order to reduce the possible adverse effects of this preference,

The famous C4.5 decision tree algorithm does not directly use information gain, but uses "gain ratio" to select the optimal partition attribute.

sex. Gain rate: The gain rate is calculated using the previous information gain Gain(D, a) and the "intrinsic value" corresponding to attribute a.

The ratio is jointly defined.

The greater the number of possible values ​​of attribute a (that is, the larger V), the larger the value of IV(a) will usually be.

The split information metric is used to consider the quantity information and size information split when a certain attribute is split, and this information is called the content of the attribute.

Instrisic information. The information gain ratio uses information gain/intrinsic information, which will cause the importance of attributes to increase with intrinsic information.

The information increases and decreases (that is, if the attribute itself is very uncertain, then I will be less inclined to choose it),

This can be regarded as compensation for the pure information gain.

In the above example, the attribute split information metric is calculated:

Calculate the information gain rate:

The information gain rate of activity is higher, so when building a decision tree, this method is preferred when selecting nodes.

In the process, we can reduce the selection preference for attributes with more values. 

Example 2: The first column is weather, the second column is temperature, the third column is humidity, the fourth column is speed, and the last column is whether the activity is progressing.

OK. Based on the data in the table below, determine whether the activity will be carried out in the corresponding weather?

This data set has four attributes, attribute set A={weather, temperature, humidity, speed}, two category labels, and category set L={enter

OK, Cancel}. 

① Calculate category information entropy. Class information entropy represents the sum of the uncertainties that occur in various classes in all samples. According to the concept of entropy

Think about it, the greater the entropy, the greater the uncertainty, and the greater the amount of information required to figure things out.

② Calculate the information entropy of each attribute. The information entropy of each attribute is equivalent to a kind of conditional entropy. It means that under the condition of certain attributes,

The sum of the uncertainties arising from various categories. The greater the information entropy of an attribute, the less "pure" the sample categories contained in this attribute are. 

③Calculate information gain. Information gain = entropy - conditional entropy, here it is category information entropy - attribute information entropy, which represents information

The extent to which uncertainty is reduced. If the information gain of an attribute is greater, it means that using this attribute for sample division can be better.

Reduce the uncertainty of the divided samples. Of course, selecting this attribute can complete the classification goal faster and better. Information gain is

Feature selection index of ID3 algorithm. 

Assume that a column "number" is added in front of the data in Table 1 above, with a value of (1--14). If "number" is also used as a candidate classification attribute

property, according to the previous steps: in the process of calculating the information entropy of each attribute, the value of the attribute is 0, that is, its information gain is

0.940. But it is obvious that with this classification, the final result does not have a generalization effect. At this time, it is impossible to choose based on the information gain

Effective classification features. Therefore, C4.5 chooses to use the information gain rate to improve ID3. 

④ Calculate the attribute split information metric. Use the split information metric to consider the quantitative information and size information split when a certain attribute is split.

Information, this information is called intrinsic information of attributes (instrisic information). The information gain rate is information gain/intrinsic information,

This will cause the importance of the attribute to decrease as the intrinsic information increases (that is, if the attribute itself is very uncertain,

Then the less likely you are to choose it), which can be regarded as compensation for the pure information gain.

⑤ Calculate the information gain rate.

The information gain rate of the sky is the highest, so the sky is chosen as the splitting attribute. After discovering the split, under the condition that the weather is "Yin", the category is "Pure"

"Yes, so define it as a leaf node, and select the node that is not "pure" to continue splitting. 

C4.5 algorithm flow:

while(当前节点"不纯"):
    1.计算当前节点的类别熵(以类别取值计算)
    2.计算当前阶段的属性熵(按照属性取值吓得类别取值计算)
    3.计算信息增益
    4.计算各个属性的分裂信息度量
    5.计算各个属性的信息增益率
end while
当前阶段设置为叶⼦节点

Advantages of C4.5:

① Using information gain rate to select attributes overcomes the disadvantage of using information gain to select attributes that tend to select attributes with more values.

②A post-pruning method is used to avoid the uncontrolled growth of the tree and avoid overfitting the data.

③Handling of missing values ​​In some cases, the available data may lack the values ​​of some attributes. If 〈x, c(x)〉 is

A training instance in the sample set S, but the value A(x) of its attribute A is unknown. One strategy for dealing with a missing attribute value is to assign it the structure

The most common value of this attribute in the training instance corresponding to point n; Another more complex strategy is to assign a value to each possible value of A

Probability, C4.5 uses this method to deal with missing attribute values.

4. Gini value and Gini index

The CART decision tree [Breiman et al., 1984] uses the "Gini index" to select partitioning attributes. CART is

Short for Classification and Regression Tree, this is a well-known decision tree learning algorithm. Both classification and regression tasks are

Available Gini value Gini (D): the probability that two samples are randomly selected from the data set D and their category labels are inconsistent. Gini (D) value

The smaller it is, the higher the purity of data set D. The purity of data set D can be measured by the Gini value:

         D is the total number of samples, and Ck is the number of samples of the kth category. 

Gini_index (D): Generally, the attribute with the smallest Gini coefficient after division is selected as the optimized sub-attribute.

1. Calculate the Gini index of the non-serial numbered attributes of the data set {whether you own a house, marital status, annual income} respectively, and get the Gini index

The smallest attribute is used as the root node attribute of the decision tree. 

2. The Gini value of the root node is:

3. When dividing according to whether there is a house, the Gini index calculation process is: 

4. If divided according to the marital status attribute, the attribute marital status has three possible values ​​{married, single, divorced}, respectively.

Calculate the Gini coefficient gain after division. {married} | {single,divorced} {single} | {married,divorced} {divorced} |

{single,married} 

Comparing the calculation results, when dividing the root node according to the marital status attribute, the group with the smallest Gini index is taken as the division result, that is:

{married} | {single,divorced} 。

5. In the same way, annual income Gini can be obtained: For annual income attributes that are numerical attributes, you first need to sort the data in ascending order, and then

The samples are divided into two groups using the middle value of adjacent values ​​as a separator. For example, when faced with the two values ​​of annual income of 60 and 70

When , we calculate that the middle value is 65. The Gini index is calculated using the median value of 65 as the dividing point.

According to calculations, we know that among the three attributes that divide the root node, there are two with the smallest indices: annual income attribute and marital status. Their indices are both

is 0.3. At this time, select the attribute {married} that appears first as the first division. 

6. Next, use the same method to calculate the remaining attributes respectively, in which the Gini coefficient of the root node is (whether the loan is in default at this time)

Each has 3 records).

7. Regarding whether there is a house attribute, you can get: 

8. For the annual income attributes: 

After the above process, the decision tree constructed is as shown below:

CART algorithm flow:

while(当前节点"不纯"):
    1.遍历每个变量的每⼀种分割⽅式,找到最好的分割点
    2.分割成两个节点N1和N2
end while
每个节点⾜够“纯”为⽌


 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_43961909/article/details/132576668