Machine Learning Notes 02--Decision Tree Algorithm (teach you to understand)---information entropy, information gain, gain rate, Gini coefficient

Table of contents

1. What is a decision tree

2. How to calculate information entropy and information gain

 2.1 Information entropy:

 2.2 Information gain:

 2.3 Gain rate:

 2.4 Gini coefficient


        Introduction: Aside from the concept, we can guess the literal meaning from the name of this algorithm, decision tree, decision-making is decision-making ability, the tree can be understood as the tree in the data structure we have learned, and the branch is continuously expanded from a root node , until the leaf node, the leaf node corresponds to the decision result. Then the decision tree starts from the root node and judges which branch to go according to the current given conditions until it encounters a leaf node, indicating that our round of decision-making is over.

1. What is a decision tree

       Introduce the concept, as we said in the introduction, when we introduce samples, as well as the attributes (features) of the samples, the categories of the samples, and then combine them into the tree, the decision tree can be perfectly defined. Suppose we now have a data set with three samples A, B, and C in the data set; each sample corresponds to four features F(1), F(2), F(3), F(4), and its sample A The corresponding features are F(A1), F(A2), F(A3), F(A4), and the corresponding features of sample B are F(B1), F(B2), F(B3), F(B4), The features corresponding to sample C are F(C1), F(C2), F(C3), and F(C4); the data set corresponds to two categories, that is, these three samples belong to two categories. Assuming that we construct a decision tree based on the feature F(1) as the root node, we can divide the three samples based on the different values ​​​​of the F(1) feature. After division, the value of each subset on the feature is the same, but the category they belong to is not necessarily the same, then we can use other features as nodes to divide until the category of the subset after division are the same, or all the features of the sample are used up but the categories of the divided subsets are still different, then we take the category that appears most in the subset as the category of the leaf node.

       We can take the watermelon data set in the machine learning watermelon book as shown in the figure below as an example, which will help to better understand my above description.

       First of all, we need to know that a decision tree has a root node, several leaf nodes, and several non-leaf nodes, that is, intermediate nodes that are neither the beginning nor the end.

       Let's first analyze the structure of the watermelon dataset in the figure. There are 17 samples in this data set, and each sample occupies one line; first, the first line {color, root, knock, texture, navel, touch} are 6 different features (attributes) corresponding to each of our samples; Each feature (attribute) has several different values. For example, the feature (attribute) of color has three different feature values ​​{green, jet black, light white}.

       Assuming that we construct a decision tree with the feature of color as the root node, then we can divide the 17 samples of this data set D into three sub-data sets, of which the first sub-data set is a feature with a value of green D1={1, 4, 6, 10, 13, 17}; the second sub-dataset is the feature value of black D2={2, 3, 7, 8, 9, 15}, the third sub-data set is the feature value of light white D3={5, 11, 12, 14, 16}, as shown in the figure below.

        After division, D1, D2, and D3 are three independent sub-datasets, and we analyze the D1 data set. The current dataset D1 is shown in the figure below. Three of the categories it belongs to are good melons, and three are not good melons, that is, the categories of all samples in the current data set D1 are different, so we can divide the data set D1 again.

        Assuming that we divide the data set D1 according to the feature (attribute) texture, the data set D1 can be divided into two sub-data sets, respectively, the sub-data set D11={1, 4, 6, 10} with clear feature values , and the sub-dataset D12={13, 17} whose feature value is slightly vague. As shown below.

        Analyze data sets D11 and D12 after division, as shown in the figure below. We can get that the categories of data set D12 are not good melons; in data set D11, three are good melons and one is not good melons.

Dataset D11

 Dataset D12

       So for data set D12, we don't need to further divide the data set, because the categories they belong to are the same. The current node can be marked as a leaf node, and its corresponding sample category is: not a good melon. As shown below.

       Let's continue the analysis, and the data set D11 is divided according to the characteristics (attributes). Since there are three different values ​​of the root feature, the data set D11 can be divided into three sub-data sets Da, Db, Dc; the feature value is the curled sub-data set Da={1, 4}; the feature is The sub-dataset Db={6} whose value is slightly curled; the sub-dataset Dc={10} whose feature value is stiff; as shown in the figure below.

         Analyze the current data sets Da, Db, and Dc, as shown in the figure below. The categories of the data sets Da and Db are all consistent as good melons, and the categories of the data set Dc are all consistent as not good melons. So we can mark the nodes where the data sets Da, Db, and Dc are located as leaf nodes, which correspond to a category result.

 Da data set

 Db data set

 DC data set

Note: The serial number of the red mark above is the order in which we select the features. Marking it out can better see the order in which we select the features.

       Since the data sets Da, Db, and Dc belong to the same category after being divided according to the characteristics (attributes) this time, there is no need to divide these data sets. The decision tree is shown in the figure below. Here we only construct a decision tree for a branch D1 starting from the original root node. The construction process of its D2, D3 and D1 branches is the same, and the construction process of its decision tree can be regarded as recursive.


        Some people may ask questions when looking at the above build process. That is, why does the feature of the root node use color, the second feature uses texture, and the third feature uses root? When I use these features, I use assumptions, and there will definitely be some cultural assumptions that are assumptions and are not convincing. So how do we select a certain feature as the division condition of our current data set?

This leads to our following concepts - information entropy (entropy), information gain.

2. How to calculate information entropy and information gain

        2.1 Information entropy:

        A measure of the purity of a data set. Purity is the degree to which the classes of samples contained in a data set tend to be consistent. In the process of building a decision tree, we certainly hope that the higher the purity of the data set after division, the better.

      Assuming that there are y categories in the current sample set D, and the proportion of samples of the kth category in the entire set is pk, then the information entropy formula of the set D is as follows:

                                                

        The formula means that for the current set D, multiply the ratio pk of the number of samples in each category to the total number of set D multiplied by the logarithm of pk, then add each category, and finally take the complex number. We can think about it, if the proportion pk of the current class k samples in the set D is very large, so that pk is close to 1, then the class is calculated using the above formula close to 0, and the minimum value calculated by this formula is 0 . Therefore, the smaller the value of information entropy Ent(D), the higher its purity.

        2.2 Information gain:

        From the analysis of the watermelon data set D in the first part, we can see that a feature (attribute) has several different values. For example, the feature (attribute) of color has three different feature values ​​{green, jet black, light white}.

        Assume that the current data set has several different features (attributes), and one of the features (attributes) a has V different values ​​{a1, a2,...,aV}; if we use the feature (attribute) a to the data Set D is divided, then we will generate V branches, just like we obtained three branches by dividing the first part with color as the feature (attribute); among them, the vth (lowercase) branch node is included in the data set D Take the sample Dv whose feature value is av; in the current branch node, we can calculate its information entropy Ent(dv) with the current data set Dv, using the above formula; and because the number of samples contained in each branch node is not necessarily The same, so we can calculate a weight to assign to each node, and its weight is equal to the ratio of the number of samples contained in the data set Dv with the value of av according to the current feature (attribute) to the number of samples contained in the data set D before division , so that it can be guaranteed that the more samples the branch node occupies, the greater the weight. At this point, we can calculate the information gain obtained by dividing based on feature (attribute) a. Its information gain calculation formula is as follows.

                             

 The greater the information gain, the greater the purity of the data set after using feature (attribute) a. According to this conclusion, we can calculate the information gain of each feature (attribute) for the current node which feature (attribute) should be used to divide the data set, and then select the one with the largest information gain as the data set of the node. characteristics (properties).

The above is the famous ID3 decision tree learning algorithm.

If you want to focus on examples, you can refer to the example on page 75 of the machine learning watermelon book. Since the calculation symbols involved are cumbersome, I won’t go into details here. You can further understand and experience based on the above theory.

 2.3 Gain rate:

        Why is there a gain rate after information gain? If we regard the number of the watermelon data set as the feature (attribute) of the sample, then divide it according to the number, and we can get an information gain of 0.998, which is obviously higher than other candidate division attributes. Because if we divide by number, there are several numbers, then there are several branches, and each branch contains only one sample. However, the decision tree constructed in this way does not have generalization ability and cannot predict new samples.

        Information gain has a preference for features (attributes) with many values. In order to reduce the adverse effects of this preference, we introduce a gain rate, which can also be called the famous C4.5 decision tree algorithm.

Calculated as follows:

                                                  

Among them, the greater the number of possible values ​​of attribute a, the greater the IV(a).

It should be noted that the gain rate criterion has a preference for less desirable attributes, so the C4.5 algorithm does not directly select the feature with the largest gain rate for division. Instead, a heuristic is used : first find out the attributes with higher than average information gain from the candidate partition features. Then select the one with the highest gain rate.

2.4 Gini coefficient

        With the introduction of the Gini coefficient, there is a new algorithm CART decision tree.

        Here, the purity of the data set D is no longer measured by information entropy, but by Gini value. The formula is:

                                             

        From this formula, we can see that Gini(D) reflects the probability that two samples are randomly drawn from the data set D, and their class labels are inconsistent. The smaller its value, the higher the purity.

        The Gini index of attribute a can be defined as the following formula:

                                      

        Therefore, we can calculate the Gini index for each feature (attribute), and use the corresponding feature (attribute) with the smallest value as the basis for our current division.

        These are some theoretical basis for selecting attributes when we construct a decision tree. In practice, we will encounter the problem of over-fitting of the decision tree, which involves the pruning of the decision tree. That is, which branches we want to remove and which branches to keep.

This is a summary that took a day to write. If you reprint it, please indicate the source. If you find it useful, you can share it with those who need it and learn together.

Guess you like

Origin blog.csdn.net/BaoITcore/article/details/125142150