Python machine learning (6) decision tree (on) construction tree, classification and measurement of information entropy, information gain, CART algorithm, pruning

decision tree algorithm

Simulate the process of a blind date. Through the blind date decision-making map, when a man goes on a blind date, he will first choose the gender as a woman, and then learn about the other party based on information such as age, appearance, income, and occupation.
Through the decision-making diagram, we can find that we are faced with various choices in life. Based on our experience and our own needs, we can filter the logic behind the judgment into a structural diagram, and we will find a tree-like structure, which is the so-called Tree.
When we make decisions, we go through two stages: construction and pruning.

construction tree

Construction is to generate a complete decision tree. In simple terms, the construction process is the process of selecting which attributes are used as nodes. Then in the construction process, there will be three types of nodes:

  • Root node: The position is at the top of the tree, which can maximize the division of categories and help us filter data
  • Internal node: A node located in the middle of the tree, also a child node.
  • Leaf node: The result of each node's decision, there are no child nodes.

So in the process of constructing the tree, we have to solve three important problems:

  • Which attribute to choose as the root node
  • Which attributes are selected as internal nodes (child nodes)
  • When to stop and get the target state (leaf node)

Different selections of the root node will lead to differences in the selection of the tree.
Question: How do we use a program to select the root node?
Goal: Use a measure to calculate the classification of branches selected through different features, find the best one as the root node, and so on.
The root node can be used to better split the data, and the internal nodes are used to better subdivide the data.

Information entropy (Entropy)

Information entropy is a measure of the uncertainty of a random variable. Entropy is a knowledge point in physics, which indicates the degree of chaos inside an object. For example, there are many types of lipsticks, which means that the probability of easily buying a completely satisfactory lipstick in a shopping mall is very low. The greater the certainty, the greater the entropy. For example, if you want to buy a Huawei mobile phone, you can buy it in a Huawei store. The greater the certainty, the smaller the entropy.

The relationship between entropy and classification

insert image description here
Figure 1 becomes more chaotic after classification. The greater the degree of confusion, the greater the uncertainty, and the greater the entropy value. After the classification of Figure 2, the data is relatively pure, and the classification is clear and regular, and the entropy value is smaller.
If you use a program to achieve it, you must quantify the indicators.

Measure of information entropy

Unit: 1bit
A coin has two sides. The probability of flipping a coin is 50%, and the probability of a proof is also 50%. This is the simplest binary classification problem. The uncertainty of tossing a coin is recorded as 1bit. The number of coin tosses is exponentially related to the uncertain outcome it presents.

1枚硬币:不确定性为2,正反
2枚硬币:不确定性为4,全正,全反,一正一反,一反一正
3枚硬币:不确定性为8
n枚硬币:不确定性为2^n

Uniform distribution with equal probability

4 uncertain outcomes = 2 2 2^222 , the entropy is 2bit,2 = log 2 4 2=log_242=log24
8 uncertain results =2 3 2^323 , the entropy is 3bit,3 = log 2 8 3=log_283=log28
m uncertain outcomes =2 n 2^n2n , the entropy is nbit,n = log 2 mn=log_2mn=log2m
probability is equal,n = log 2 mn=log_2mn=log2m , m is the number of uncertain results.

Unequal probability distribution

insert image description here
put DDD is divided into 6 situations, and the uncertainty result is 6, that is,m = 6 m=6m=6 D D The entropy of D is:E nt ( D ) = log 2 6 Ent(D)=log_26Ent(D)=log26
The real situation may be more complicated, as shown in the figure below,DDThe uncertainty of D is divided into A, B, CA, B, CA , B , C three kinds,AAA has three cases with 1/2 probability,BBB has two cases with probability 1/3,CCC has a probability of 1/6 of one case. AAAThere are three possibilities for A , corresponding to m(A)=3, and the entropy of a single A islog 2 3 log_23log23 D D The data of D is more pure, and the purity of the data itself is obtained by subtracting the high-purity data from the low-purity data, and then considering the probability problem, multiplying the weight.
AAAThe entropy at A is Ent ( A ) = 1 2 ( log 2 6 − log 2 3 ) Ent(A)=\frac{1}{2}(log_26-log_23)Ent(A)=21(log26log23)
B 、 C B、C The entropy of B , C and so on.
E nt ( B ) = 1 3 ( log 2 6 − log 2 2 ) Ent(B)=\frac{1}{3}(log_26-log_22)Ent(B)=31(log26log22)
E n t ( C ) = 1 6 ( l o g 2 6 − l o g 2 1 ) Ent(C)=\frac{1}{6}(log_26-log_21) Ent(C)=61(log26log21)
insert image description here
D D D的熵为: E n t ( D ) = E n t ( A ) + E n t ( B ) + E n t ( C ) Ent(D)=Ent(A)+Ent(B)+Ent(C) Ent(D)=Ent(A)+Ent(B)+En t ( C ) , after derivation, we can get:
insert image description here
the figure below is a logarithmic graph, it is necessary to pass the point ( 1 , 0 ) (1,0)(1,0 ) , the abscissa isP k P_kPk, the ordinate is E nt ( A ) Ent(A)Ent(A) P k = 1 P_k=1 Pk=When it is 1, it means that the probability is 1 and the entropy is 0, which means that the data is the most pure. P k = 1 P_k=1Pk=The scope of 1 is between 0 and 1, so there is no part of the graph greater than 1.
insert image description here
In other words, the greater the probability, the lower the degree of chaos, and the lower the entropy.

  • exercise 1

In the figure below, the entropy value in Figure 2 is large and the degree of confusion is high, while the data in Figure 1 is relatively pure, all of which are circles, and the smaller the uncertainty, the smaller the entropy value.
insert image description here

  • Exercise 2
    Set A: [1,2,3,4,5,6,7,8,9,10]
    Set B: [1,1,1,1,1,1,1,1,9,10]
    The entropy value of set B is small, the data of B is relatively pure, the probability of occurrence of 1 is relatively high, the entropy value is small, and the stability is relatively high.
    When making a decision tree, the lower the entropy value, the purer the data, and the more obvious the effect of classification. In the numerical model, it is hoped that the entropy value will become smaller and smaller.

information gain

Entropy can represent the uncertainty of the sample set, the greater the entropy, the greater the uncertainty of the sample. Therefore, the difference in entropy of the set before and after the division can be used to measure the quality of the division result of the sample set Y using the current feature.
Information gain represents the degree to which feature X reduces the uncertainty of class Y, and determines the choice of a node.
In the figure below, the entropy data of the root node Y is relatively large at 0.9, and the data is relatively chaotic. Now, Y is divided into Y_1 and Y_2, and the entropy distribution is 0.2 and 0.5. It can be seen that the classification effect of Y_1 is more obvious. 0.9 − 0.2 > 0.9 − 0.5 0.9-0.2>0.9-0.50.90.2>0.90.5
insert image description here

Information gain g of feature A to training data set D ( D , A ) g(D,A)g(D,A ) , defined as the setDDD' s information entropyH ( D ) H(D)H ( D ) and featureAAA given conditionDDD' s information conditional entropyH ( D ∣ A ) H(D|A)H ( D A ) difference. That is, the formula is:g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A)=H(D)-H(D|A)g(D,A)=H(D)H ( D A )
is essentially the gap between initial entropy and information entropy.

  • Example of Information Gain Calculation
    As shown in the figure below, we judge whether to play golf based on the four characteristics of weather, temperature, humidity, and wind.
    insert image description here
    Features: Weather, temperature, humidity, windy
    Label: Whether to play basketball
    Each feature determines the label, and its impact on the final result is shown in the figure below:
    insert image description here
    Calculation process:
    It is impossible to judge which feature is the root node from the data alone. Information gain is used to select the characteristics of the root node. The larger the information gain, the better. The larger the gain, the more effective the division is, and the data is moving from impure data to purer data.
    • 1. Calculate the initial entropy
    • 2. Calculate the entropy of each feature
    • 3. Perform spherical aberration to obtain information gain
    • 4. Select the feature with large information gain as the root node

insert image description here
insert image description here
insert image description here
The calculation result is that the information gain of temperature is the largest, select temperature as the root node, and then select child nodes from humidity, weather, and wind to build a complete decision tree.
Obtaining a decision tree through information gain is a kind of decision-making algorithm, called ID3 algorithm.

Flaws of ID3 Algorithm

insert image description here
If you add a column ID to the table just now, each number in the ID column is different, and the ID feature is used to classify. Each ID value is not equal and is a separate category. Each category has only itself, which means The data is relatively pure, and the probability of each data occurrence is 100%, which means that if the ID is 0, its entropy is 0; if the ID is 6, the entropy is still 0. If information gain is used, the ID column is used as the root node, which has nothing to do with the final result of playing or not.
Therefore, the ID3 algorithm is flawed. The information gain tends to classify more features. Some noise data will affect the overall classification and even the structure of the entire tree. In order to solve this flaw, the information gain rate is proposed.

Information Gain Rate (C4.5)

When calculating ID3, it tends to select attributes with more values. In order to avoid this problem, C4.5 adopts the method of information gain rate to select attributes. Information gain rate = information gain / attribute entropy, attributes are features. Entropy represents the degree of chaos of the data, the uncertainty of the data. If an attribute has multiple values, the data will be divided into multiple parts, and the probability of the data will increase. Although the information gain will increase, the attribute entropy will also increase, and the overall information gain rate will not change as much. After being divided into multiple parts, the entropy of each category becomes lower. For the whole sample, the uncertainty increases and the entropy increases. The more classifications, the greater the information gain and the greater the information entropy. Information gain and entropy change in the same direction at the same time, so the ratio of the two will reduce the impact, and the relationship between the two is proportional to reduce the increase in information gain.

CART algorithm

The CART algorithm, the English full name is (Classification And Regression Tree), and the Chinese name is Classification and Regression Tree. ID3 and C4.5 algorithms can generate binary trees or multi-fork trees, while CART only supports binary trees. At the same time, the CART decision tree is quite special, it can be used as both a classification tree and a regression tree.
The CART classification tree is similar to the C4.5 algorithm, except that the Gini coefficient is used as the measurement index for attribute selection instead of information gain or information gain rate. The Gini coefficient is a common indicator used to measure the income gap in a country. The Gini coefficient itself reflects the uncertainty of the sample. When the Gini coefficient is smaller, it means that the difference between samples is small and the degree of uncertainty is low. The process of classification itself is a process of purification. When using the CART algorithm to construct a classification tree, the one with the smallest Gini coefficient will be selected as the attribute division, which is the opposite of entropy.
Gini coefficient formula: G ini ( D ) = 1 − ∑ k = 1 ∣ y ∣ P k 2 Gini(D)= 1-\displaystyle{\sum_{k=1}^{|y|}P_k^2}G ini ( D )=1k=1yPk2
If the probability is 1, the higher the purity of the data, the Gini coefficient will be 0. For the problem of two classifications, y is not 0 or 1, and
the Gini coefficient builds a decision tree:

  • Calculate the initial Gini coefficient
  • Calculate the Gini coefficient of each feature separately
  • Calculate the Gini coefficient gain by doing the difference
    insert image description here
    insert image description here
    . According to the calculation, the temperature has the largest Gini coefficient and can be selected as the root node.

Continuous data processing implementation

In the actual process, we have many continuous values. For example, the annual income in the figure below is a continuous value, which is actually a numerical attribute. How do we calculate the Gini coefficient?
insert image description here
The process is as follows:

  • Sort out-of-order values ​​from small to large
  • Continuous dichotomy
    insert image description here
    Take the middle value of the data in pairs, for example, the middle value of 60 and 70 is 65, the middle value of 70 and 75 is 72.5, and so on. Taking 65 as the dividing point, there is only one number less than 65, which accounts for one-tenth of the data. Of the remaining nine-tenths, 6 have no loans and 3 have loans. The Gini gain is calculated to be 0.02; in turn, 72.5 is used as the division Points are calculated to obtain the Gini coefficient gain in the above figure, and the point with the best Gini gain is taken as the root node.
    insert image description here
    The processing process of continuous data is the process of discretizing continuous values.

pruning

Pruning is to slim down the decision tree. The goal of this step is to get different results without too much judgment. For example, when gradient descent is performed 10,000 times, the best result may have appeared in hundreds of times. Pruning is to prevent the occurrence of "overfitting" (Overfitting), and when the best result is obtained, the stop. If the overfitting training set is too perfect, the results of the test set will be unsatisfactory during the test.
It is necessary to use pruning to prevent overfitting from happening. If you keep splitting without pruning, then 100% of the data will be split to a correct result. As shown in the figure below, the splitting is performed all the way.
insert image description here

The effect of continuously dividing down is the same as dividing into half, so there is no need to keep dividing down, otherwise it will always take up computer resources.

pre-pruning

Make some settings before constructing the branch:

  • Specify the height or depth of the tree. For example, the above figure splits 10 4 times. If the specified height is 3, the 4th time will not be executed.
  • The minimum number of samples contained in each node, for example, if the minimum number of samples is specified as 3, it will not be executed after 3
  • If the entropy of the specified node is less than a certain value, it will no longer be divided. For example, if the specified entropy is 0.2, and the entropy of exactly 2 places is 0.2, it will not be executed further

post pruning

A simplified version of the pruned decision tree can be obtained by pruning on the generated overfitting decision tree.

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/131942399