Decision Tree Series-CART

CART, also known as classification regression tree, is a decision tree optimized on the basis of ID3. Learn CART to remember the following key points:

(1) CART can be both a classification tree and a classification tree;

(2) When CART is a classification tree, the GINI value is used as the basis for node splitting; when CART is a regression tree, the minimum variance of the sample is used as the basis for node splitting;

(3) CART is a binary tree.

Next, CART will be introduced with a practical example:

                                                                    Table 1 Original data sheet

Watch TV time

Marital status

Profession

age

3

unmarried

student

12

4

unmarried

student

18

2

married

teacher

26

5

married

Office worker

47

2.5

married

Office worker

36

3.5

unmarried

teacher

29

4

married

student

21

Understand CART from the following ideas :

Classification tree? Return to the tree?

      The function of the classification tree is to predict the category to which the object belongs based on the characteristics of an object, and the purpose of the regression tree is to predict the attributes of an object based on the information of the object and express it in numerical values.

      CART can be both a classification tree and a decision tree. As shown in the above table, if we want to predict whether a person is married, then the CART constructed will be a classification tree; if we want to predict the age of a person, then the construction will be Return to the tree.

How do classification trees and regression trees make decisions? Suppose we build two decision trees to predict whether the user is married and the actual age, as shown in Figure 1 and Figure 2:

                                      Figure 1 Decision tree for predicting marriage status Figure 2 Decision tree for predicting age

       Figure 1 shows a classification tree. The output result of its leaf nodes is an actual category. In this example, it is a marriage (married or unmarried). Choose the category with the largest proportion of leaf nodes as the output category;

       Figure 2 is a regression tree that predicts the actual age of the user and is a specific output value. How to get this output value? Under normal circumstances, choose to use the median, average or mode for representation. Figure 2 uses the average of the node age data as the output value.

How does CART choose the split attribute?

      The purpose of splitting is to make the data pure and make the output result of the decision tree closer to the true value. So how does CART evaluate the purity of nodes? If it is a classification tree, CART uses GINI value to measure node purity; if it is a regression tree, it uses sample variance to measure node purity. The more impure the node, the worse the effect of node classification or prediction.

The calculation formula of GINI value:

                               

      The more impure the node, the larger the GINI value. Take two categories as an example. If all the data of a node has only one category, then if the two categories have the same quantity, then .

Regression variance calculation formula:

                                                                               

      The larger the variance, the more scattered the data of the node, and the worse the prediction effect. If all the data of a node are the same, then the variance is 0. At this time, you can definitely consider the output value of the node; if the data of the node is very different, then the output value may be very different from the actual value. Big.

      Therefore, whether it is a classification tree or a regression tree, CART must choose the attribute that minimizes the GINI value or regression variance of the child node as the splitting scheme. That is to minimize (classification tree):

                               

Or (regression tree):

                                                                                                     

How does CART split into a binary tree?

     The splitting of nodes is divided into two situations, continuous data and discrete data.

     CART's processing of continuous attributes is similar to that of C4.5. It finds the optimal split point by minimizing the split GINI value or sample variance, and divides the node into two, which will not be described here . Please see C4.5 for details .

     For discrete attributes, theoretically, as many discrete values ​​as there are, it should be split into as many nodes. But CART is a binary tree, each split will only produce two nodes, what should we do? It is very simple, as long as one of the discrete values ​​is independently used as a node, and the other discrete values ​​generate another node. This splitting scheme has as many discrete values ​​as there are as many partitioning methods. Take a simple example: if a discrete attribute has three discrete values ​​X, Y, Z, the splitting methods for this attribute are {X}, { Y, Z}, {Y}, {X, Z}, {Z}, {X, Y}, respectively calculate the Gini value or sample variance of each division method to determine the optimal method.

     Taking the attribute "occupation" as an example, there are three discrete values, "student", "teacher", and "worker". There are three division schemes for this attribute, namely {"student"}, {"teacher", "office worker"}, {"teacher"}, {"student", "office worker"}, {"office worker"} , {"Student", "teacher"}, respectively calculate the GINI value or sample variance of the child nodes of the three division schemes, and select the optimal division method, as shown in the following figure:

The first classification method: {"student"}, {"teacher", "office worker"}

Predict whether you are married (classification):

                    

Predicted age (regression):

            

 

The second classification method: {"teacher"}, {"student", "office worker"}

 

Predict whether you are married (classification):

                    

Predicted age (regression):

            

The third classification method: {"office worker"}, {"student", "teacher"}

 Predict whether you are married (classification):

                    

Predicted age (regression):

            

In summary, if you want to predict whether you are married, you can select {"office worker"}, {"student", "teacher"}; if you want to predict age, select {"teacher"}, {"student", "Office workers"} division method.

 

How to prune?

      CART adopts CCP (cost complexity) pruning method. Cost complexity Select the non-leaf node with the smallest gain of the node surface error rate, delete the left and right child nodes of the non-leaf node, if there are multiple non-leaf nodes with the same surface error rate gain value, select the child node of the non-leaf node The most non-leaf nodes are pruned.

It can be described as follows:

Let the non-leaf nodes of the decision tree be .

a) Calculate the surface error rate gain value of all non-leaf nodes 

b) Select the non -leaf node with the smallest surface error rate gain value (if multiple non-leaf nodes have the same small surface error rate gain value, select the non-leaf node with the largest number of nodes).

c) Pruning

The calculation formula of the surface error rate gain value:

                               

among them:

Consideration leaf nodes represents the error, , is the error rate of the node, as the proportion of the amount of node data;

Indicates error cost subtree , error rate of the sub node i, data indicating the proportion of the node of node i;

Indicates the number of subtree nodes.

Examples:

The figure below is one of the subtrees. Let the total data volume of the decision tree be 40.

The surface error rate gain value of this subtree can be calculated as follows:

 

The surface error coverage rate of this subtree is calculated, and the decision tree can be pruned only by obtaining the gain value of the surface error rate of other subtrees.

 

Actual program and source code

flow chart:

(1) Data processing

         The original data is digitized and stored in the form of two-dimensional data. Each row represents a record, the first n-1 columns represent attributes, and the last column represents classification labels.

         The data in Table 1 can be transformed into Table 2:

                                                                           Table 2 Data after initialization

Watch TV time

Marital status

Profession

age

3

unmarried

student

12

4

unmarried

student

18

2

married

teacher

26

5

married

Office worker

47

2.5

married

Office worker

36

3.5

unmarried

teacher

29

4

married

student

21

        

      Among them, for the "marriage status" attribute, the numbers {1, 2} respectively represent {unmarried, married}; for the "occupation" attribute {1, 2, 3,} respectively represent {student, teacher, office worker};

The code is as follows:

         static double[][] allData; //store data for training

    static List<String>[] featureValues; //Discrete values ​​corresponding to discrete attributes

featureValues ​​is a linked list array, the length of the array is the number of attributes, and each element of the array is a linked list of discrete values ​​of the attribute.

(2) Two categories: node category and split information

a) Node class Node

      This class represents a node, and the attributes include the split attribute of the node selection, the output class of the node, the child node, and the depth. Note that compared with ID3, two new attributes are added: leafWrong and leafNode_Count respectively represent the total classification error of leaf nodes and the number of leaf nodes, mainly for the convenience of pruning.

Tree node

b) Splitting information category, which stores the splitting information of the node, including the row coordinates of each child node, the number of each type of child node, the attribute of the node split, the type of attribute, etc.

Split information

The main method findBestSplit(Node node,List<int> nums,int[] isUsed), this method splits the node

among them:

node represents the node that is about to split;

nums represents a list of row coordinates of node data;

isUsed indicates the usage of all attributes up to the node position;

This method of findBestSplit mainly has the following components:

1) Judgment of node splitting stop

The node split conditions are as described above, and the source code is as follows:

Conditions for stopping division

2) Find the optimal split attribute

Finding the optimal split attribute needs to calculate the GINI value or sample variance of each split attribute after splitting. The calculation formula is given above, and the calculation code for the GINI value is as follows:

GINI value calculation

3) Split and iteratively process child nodes at the same time

In fact, it is a recursive process, executing the findBestSplit method to split each child node.

findBestSplit source code:

Node selection attributes and split

(4) Pruning

Cost complexity pruning method (CCP):

CCP cost complexity pruning

All core codes of CART:

CART core code

to sum up:

(1) CART is a binary tree, and each split will produce two child nodes. For continuous data, directly use a processing method similar to C4.5. For discrete data, choose the best two discrete value combination methods .

(2) CART can be both a classification number and a binary tree. If it is a classification tree, select the split attribute that minimizes the GINI value of the node after the split; if it is a regression tree, select the split attribute that minimizes the sample variance of the two nodes.

(3) CART, like C4.5, requires pruning, using CCP (cost-complexity pruning method).

Guess you like

Origin blog.csdn.net/qq_41587243/article/details/87783966