Decision tree - (1) Generation and measurement indicators

**Note: This blog is the reading notes of Li Hang's "Statistical Learning Method" and Zhou Zhihua's "Machine Learning". Although there are some own understandings, there are still a lot of texts extracted from the books of Mr. Li and Mr. Zhou.

Decision tree is a basic classification and regression method.

This chapter focuses on decision trees for classification.

The decision tree model has a tree structure,

When classifying, it can be considered as a conditional probability distribution defined on the feature space and class space. Its main advantages are that the model is readable and the classification speed is fast.

When learning, use the training data to build a decision tree model according to the principle of minimizing the loss function.

When making predictions, the new data is classified using a decision tree model.

Decision tree learning usually includes 3 steps: feature selection, decision tree generation and decision tree pruning.

Decision tree is proposed by Quinlan in 1986 ID3 algorithm and 1993 proposed C4.5 algorithm.

And the CART algorithm proposed by Breiman et al. in 1984. Decision tree (decision tree) is a basic classification and regression method.

Generation of decision tree

        The algorithm of decision tree learning is usually a process of recursively selecting the optimal feature and dividing the training data according to the feature, so that each sub-data set has the best classification. This process corresponds to the division of the feature space , which also corresponds to the construction of the decision tree.       

        To start, build the root node and put all the training data in the root node. Select an optimal feature, and divide the training data set into subsets according to this feature, so that each subset has a best classification under the current conditions. If these subsets have been basically correctly classified, then construct leaf nodes , and divide these subsets into the corresponding leaf nodes;

        If there are still subsets that cannot be basically correctly classified, then select new optimal features for these subsets, continue to divide them, and build corresponding nodes. This goes on recursively until all training data subsets are basically Correct classification, or until there are no suitable features. Finally, each subset is divided into leaf nodes, that is, there is a clear class. This generates a decision tree.

Feature selection

        Feature selection is to select features that have the ability to classify training data. This can improve the probability of decision tree learning,

If the results of classification using a feature are not significantly different from the results of random classification, the feature is said to have no classification ability

. Throwing away such features empirically has little effect on the accuracy of decision tree learning. Usually the criterion of feature selection is information gain or information

Gain ratio, classification error rate, Gini index.

Introduced below

We give an example to illustrate, so that everyone can understand.

Example loan application, dataset T

Figure 1.1

Classification error rate

        The classification error rate refers to the probability that any example in the set is classified into the wrong class. It is the most direct measure of impurity.

In the binary classification, for any leaf node t, the predicted category should be the category that contains the most sample trees in t.

Because all examples in t are predicted to be the category with the highest probability of occurrence, the classification error rate of t is recorded as Error(t)

Error(t)=1-max[p(i|t)]

When all the samples in t belong to the same category, that is, the purest, Error(t) takes the minimum value of 0; ) takes the maximum value (1-1/n).

E.g:

Taking Figure 1.1 as an example, with age as a node, it can be divided into three subsets.

The classification error rate for the age feature is:

Error(D)=1-max[\frac{6}{15} ,\frac{9}{15}]=\frac{6}{15}

The classification error rate for its subset is:

Error(D_{1})=1-max[\frac{2}{5} ,\frac{3}{5}]=\frac{2}{5}             

Error(D_{2})=1-max[\frac{3}{5} ,\frac{2}{5}]=\frac{2}{5}

Error(D_{3})=1-max[\frac{4}{5} ,\frac{1}{5}]=\frac{1}{5}

So the overall classification error rate is reduced
Error(D)-\frac{5}{12}Error(D_{1})-\frac{5}{12}Error(D_{2})-\frac{5}{12}Error(D_{3})=\frac{6}{15}-(\frac{5}{15}\cdot \frac{2}{5}+\frac{5}{15}\cdot \frac{2}{5}+\frac{5}{15}\cdot \frac{1}{5})=\frac{1}{15}

In the same way, the classification error rate of each feature is calculated

Make a comparison, choose the one with the largest error rate, and repeat the above steps to continue splitting.

shortcoming:

Classification errors are not sensitive enough to changes in classification probabilities, resulting in inefficient decision trees.

information gain

        Entropy, which is a concept in thermodynamics, is used to measure the degree of chaos in a system. The greater the entropy is when a system is disordered and chaotic. In information theory, entropy is a measure of the uncertainty of random variables. The greater the value of entropy, the greater the uncertainty of random variables, that is, the greater the amount of information.

        Let X be a discrete random variable with a finite number of values, and its probability distribution is

Then the entropy of the random variable X is defined as
 

 When the random variable takes only two values, such as 1, 0, there are:

 The entropy is:

Conditional entropy H(Y\X) represents the uncertainty of random variable Y under the condition of known random variable X. The conditional entropy (conditional entropy) of random variable Y under the given condition of random variable X (conditional entropy) H(Y=X) , defined as the mathematical expectation of the entropy pair of the conditional probability distribution of Y given the conditions of X

When the probability in entropy and conditional entropy is obtained by data estimation (especially maximum likelihood estimation), the corresponding entropy and conditional entropy are called empirical entropy and empirical conditional entropy, respectively.

At this time, if there is 0 probability, let 0log0=0.

Information gain refers to the degree to which the uncertainty of the information of class Y is reduced by knowing the information of feature X.

Information gain: The information gain g(D, A) of feature A to training data set D is defined as the sum of the empirical direct H(D) of set D and the empirical conditional entropy H(D=A) of D under the given condition of feature A. bad, i.e.

In general, the difference between the entropy H(Y) and the conditional entropy H(Y=X) is called mutual information. The information gain in decision tree learning is equivalent to the mutual information between classes and features in the training dataset.

E.g:

Taking Figure 1.1 as an example, with age as a node, it can be divided into three subsets.

The entropy of age is:

H(D)=-\frac{6}{15}\log\frac{6}{15} -\frac{9}{15}\log\frac{9}{15}

The empirical entropy of its subset is:

Entroty(D_{1})=-\frac{2}{5}\log_{2}\frac{2}{5}-\frac{3}{5}\log_{2}\frac{3}{5}

Entroty(D_{2})=-\frac{3}{5}\log_{2}\frac{3}{5}-\frac{2}{5}\log_{2}\frac{2}{5}

Entroty(D_{3})=-\frac{4}{5}\log_{2}\frac{4}{5}-\frac{1}{5}\log_{2}\frac{1}{5}

Then the entropy decreases after splitting, that is, the information gain of age is:

g(T,D)=-\frac{5}{15}\log_{2}\frac{5}{15}-\frac{5}{15}\log_{2}\frac{5}{15}-\frac{5}{15}\log_{2}\frac{5}{15}

Calculate the entropy of other features and choose the largest mutual information

        The size of the information gain value is relative to the training data set, and has no absolute meaning. When the classification problem is difficult, that is to say, when the experience of the training data set is large, the information gain value will be too large. On the contrary, the information gain value will be too small. This problem can be corrected using the information gain ratio. This is another criterion for feature selection.

      

Information gain rate

        In fact, the information gain criterion has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm [Quinlan, 1993] does not directly use information gain, Instead, the optimal partitioning attribute is selected using the "gain ratio", which is defined as

Called the "intrinsic value" of attribute α [Quinlan, 1993]. The larger the number of possible values ​​of attribute α (that is, the larger V is), the larger the value of IV(a) is. IV(a) is called internal information.

 E.g:

Taking Figure 1.1 as an example, with age as a node, it can be divided into three subsets.

The entropy of age is:

H(D)=-\frac{6}{15}\log\frac{6}{15} -\frac{9}{15}\log\frac{9}{15}

The classification error rate for its subset is:

Entroty(D_{1})=-\frac{2}{5}\log_{2}\frac{2}{5}-\frac{3}{5}\log_{2}\frac{3}{5}

Entroty(D_{2})=-\frac{3}{5}\log_{2}\frac{3}{5}-\frac{2}{5}\log_{2}\frac{2}{5}

c'c(D_{3})=-\frac{4}{5}\log_{2}\frac{4}{5}-\frac{1}{5}\log_{2}\frac{1}{5}

Then the entropy decreases after splitting, that is, the information gain of age is:

g(T,D)=H(D)-\frac{5}{15}Entroty(D_{1})-\frac{5}{15}Entroty(D_{2})-\frac{5}{15}Entroty(D_{3})

 IV(D)=-\frac{5}{15}Entroty(D_{1})-\frac{5}{15}Entroty(D_{2})-\frac{5}{15}Entroty(D_{3})

 EntrotyRatio(T,D)=\frac{g(T,D)}{IV(D)}

It should be noted that the gain rate criterion has a preference for attributes with a small number of possible values. Therefore, the C4.5 algorithm does not directly select the candidate partition attribute with the largest gain rate, but uses a heuristic [Quinlan, 1993] ]: First find out the attributes whose information gain is higher than the average level from the candidate partition attributes, and then select the one with the highest gain rate.

Gini index

The Gini index is a ratio between 0 and 1. The larger the value of the month, the higher the degree of inequality, which measures the impurity of a node.

The CART decision tree [Breiman et al., 1984] uses a "Gini index" to select partitioning attributes. Define the data set D, the purity of the data set D can be measured by the Gini value:

Intuitively, if the dataset D is defined, Gini(D) reflects the probability that two samples are randomly selected from the dataset D and their class labels are inconsistent. Therefore, the smaller the Gini(D), the higher the purity of the dataset D. high.
 

 E.g:

Taking Figure 1.1 as an example, with age as a node, it can be divided into three subsets.

The Gini index of T for the dataset:

Gini(T)=1-(\frac{9}{15})^{2}-(\frac{6}{15})^{2}

After splitting, D1, D2, D3

Gini(T_{1})=1-(\frac{2}{5})^{2}-(\frac{3}{5})^{2}

Gini(T_{1})=1-(\frac{3}{5})^{2}-(\frac{2}{5})^{2}

Gini(T_{1})=1-(\frac{1}{5})^{2}-(\frac{4}{5})^{2}

Then the Gini index decreases after splitting:

Gini(T)-(\frac{5}{15})Gini(T_{1})-(\frac{5}{15})Gini(T_{2})-(\frac{5}{15} )Gini(T_{3})

The relationship between the Gini index, the first half and the classification error rate in the two-class classification
 

        Taking binary classification as an example, the relationship between Gini index Gini(p), entropy (unit bit) half H(p) and classification error rate in binary classification problem is shown. The abscissa represents the probability p, and the ordinate represents the loss. It can be seen that the curves of Gini index and half of entropy are very close, and both can approximately represent the classification error rate.

 Implementation in python:

import numpy as np
from matplotlib import pyplot as plt 
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
 
p = np.linspace(0.0001, 0.9999 ,50)
Gini = 2 * p * (1-p)
H = (-p * np.log2(p) - (1 - p) * np.log2(1 - p))/2.0
x1 = np.linspace(0,0.5,50)
y1 = x1
x2 = np.linspace(0.5,1,50)
y2 = 1- x2
 
plt.figure(figsize=(10,5))
plt.plot(p, Gini, 'r-', label='基尼指数')
plt.plot(p, H, 'b-', label='半熵')
plt.plot(x1, y1, 'g-', label='分类误差率')
plt.plot(x2, y2, 'g-')
plt.legend()
plt.xlim(-0.01, 1.01)
plt.ylim(0, 0.51)
plt.show()

The next article introduces several algorithms:

Follow my articles.


Reference: Li Hang "Statistical Statistical Learning Methods"

Reference: Zhou Zhihua "Machine Learning"

Guess you like

Origin blog.csdn.net/qq_21402983/article/details/124246835