Three ways of decision tree classification algorithm

1. Basic introduction of classification

  The problem of classification has only appeared in our lives since ancient times. Classification is an important branch of data mining, and it has a wide range of applications in various aspects, such as medical disease identification, spam filtering, spam message interception, customer analysis and so on. Classification problems can be divided into two categories:

  •   Categorization: Categorization refers to the classification of discrete data, such as judging whether a person is male or female based on a person’s handwriting. There are only two categories here, and the categories are discrete collection spaces {male, female}.
  •   Prediction: Prediction refers to the classification of continuous data, such as predicting the humidity of the weather at 8 o'clock tomorrow. The humidity of the weather changes at any time. The weather at 8 o'clock is a specific value, which does not belong to a limited collection space. Forecasting is also called regression analysis, which is widely used in the financial field.

  Although the processing methods for discrete data and continuous data are different, they are actually transformed into each other. For example, we can judge according to a certain characteristic value of comparison. If the value is greater than 0.5, it is considered as male, and if the value is less than or equal to 0.5, it is considered as female. In this way, it is transformed into a continuous processing method; the segmented processing of weather and humidity values ​​is also transformed into discrete data.

  Data classification is divided into two steps:

  1. Construct a model and use the training data set to train the classifier;
  2. Use the built classifier model to classify the test data.

  A good classifier has good generalization ability, that is, it can not only achieve a high accuracy rate on the training data set, but also achieve a high accuracy rate on the test data set that has not been seen before. If a classifier only performs well on the training data, but performs poorly on the test data, the classifier has been over-fitted. It just records the training data and does not capture the characteristics of the entire data space.

 

Two, decision tree classification

The decision tree algorithm realizes classification by means of the branch structure of the tree. The following figure is an example of a decision tree. The internal node of the tree represents the judgment of a certain attribute, and the branch of the node is the corresponding judgment result; the leaf node represents a class label.

  The above table is a decision tree that predicts whether a person will buy a computer. Using this tree, we can classify new records. Starting from the root node (age), if a person’s age is middle-aged, we directly judge This person will buy a computer. If he is a teenager, he needs to further judge whether he is a student; if he is an old person, he needs to further judge his credit level until the leaf node can determine the type of record.

  One advantage of the decision tree algorithm is that it can generate rules that people can directly understand. This is a feature that algorithms such as Bayesian and neural networks do not have; the accuracy of decision trees is also relatively high, and no background knowledge is required. Classification is a very effective algorithm. There are many variants of decision tree algorithms, including ID3, C4.5, C5.0, CART, etc., but the basis is similar. Let's take a look at the basic idea of ​​the decision tree algorithm:

  • Algorithm: GenerateDecisionTree(D,attributeList) generates a decision tree based on the training data record D.
  • enter:
    • Data record D, contains the training data set of the class mark;
    • The attribute list attributeList, the candidate attribute set, is used to judge the attributes in the internal node.
    • The attribute selection method AttributeSelectionMethod(), the method of selecting the best classification attribute.
  • Output: a decision tree.
  • process:
    1. Construct a node N;
    2. If all records in data record D have the same class label (denoted as class C):
      • Then mark node N as a leaf node as C, and return node N;
    3. If the attribute list is empty:
      • Then mark node N as a leaf node as the class with the most class labels in D, and return node N;
    4. Call AttributeSelectionMethod(D,attributeList) to select the best split criterion splitCriterion;
    5. Mark node N as the best split criterion splitCriterion;
    6. If the value of the split attribute is discrete, and the decision tree is allowed to split into multiple branches:
      • Subtract the split attribute from the attribute list, attributeLsit -= splitAttribute;
    7. Take the value j for each split attribute:
      • The set of records satisfying j in D is Dj;
      • If Dj is empty:
        • Then create a new leaf node F, mark it as the class with the most class labels in D, and hang node F under N;
      • otherwise:
        • Recursively call GenerateDecisionTree(Dj,attributeList) to get the subtree node Nj, and hang Nj under N;
    8. Return node N;

   Steps 1, 2, and 3 of the algorithm are all obvious. The best attribute selection function of step 4 will be introduced later. Only knowing that it can find a criterion so that the category of the subtree obtained from the judgment node is as good as possible. Pure, pure here means that there is only one class label; Step 5 sets the test expression for node N according to the split criterion. In the sixth step, when constructing a multi-fork decision tree, the discrete attributes are used only once in the node N and its subtrees, and they are deleted from the list of available attributes after they are used. For example, in the previous figure, using the attribute selection function, the best split attribute determined is age. There are three values ​​for age, and each value corresponds to a branch. The attribute of age will not be used later. The time complexity of the algorithm is O(k*|D|*log(|D|)), k is the number of attributes, and |D| is the number of records in the record set D.

Three, attribute selection method

  The attribute selection method always chooses the best attribute as the most split attribute, that is, to make the category of each branch record as pure as possible. It sorts the attributes of all attribute lists according to a certain standard to select the best attribute. There are many ways to select attributes. Here I introduce three commonly used methods: Information gain, gain ratio, and Gini index.

  • Information gain (Information gain)

  Information gain is based on Shannon's information theory, and the attribute R it finds has such characteristics: the information gain before and after the splitting of attribute R is the largest than other attributes. The definition of this information is as follows:

  Where m represents the number of category C in the data set D, Pi represents the probability that any record in D belongs to Ci, and Pi=(the number of records in the set of Ci category in D/|D|) when calculating. Info(D) represents the amount of information needed to separate the different classes of the data set D.

  If you understand information theory, you will know that the above information Info is actually the entropy in information theory. Entropy represents a measure of uncertainty. If the uncertainty of the category of a data set is higher, the entropy is greater . For example, if we throw a cube A into the air, remember that the surface that hits the ground is f1, the value of f1 is {1,2,3,4,5,6}, the entropy of f1 entropy(f1)=-(1/ 6*log(1/6)+...+1/6*log(1/6))=-1*log(1/6)=2.58; now we replace the cube A with a regular tetrahedron B, mark When landing, the ground surface is f2, the value of f2 is {1,2,3,4}, the entropy of f2 entropy(1)=-(1/4*log(1/4)+1/4*log( 1/4)+1/4*log(1/4)+1/4*log(1/4)) =-log(1/4)=2; if we replace it with a ball C, remember when it hits the ground The face that touches the ground is f3. Obviously, no matter how you throw the ground, it is the same face, that is, the value of f3 is {1}, so its entropy(f3)=-1*log(1)=0. It can be seen that the more faces, the greater the entropy value, and when there is only one face of the ball, the entropy value is 0, which means that the degree of uncertainty is 0, that is, the downward face is certain when it hits the ground.

  With the above simple understanding of entropy, let's talk about information gain. Suppose we choose attribute R as the split attribute. In dataset D, R has k different values ​​{V1,V2,...,Vk}, so D can be divided into k groups according to the value of R {D1,D2, ...,Dk}, after splitting by R, the amount of information needed to separate the different classes of the data set D is: 

  Information gain is defined as the difference between the two amounts of information before and after splitting:

  Information gain Gain (R) represents the amount of information that attribute R brings to classification. We look for the largest attribute of Gain to make the classification as pure as possible, that is, to separate different classes most likely. However, we found that all attributes Info(D) are the same, so seeking the maximum Gain can be transformed into seeking the latest InfoR(D). Info(D) is introduced here just to illustrate the principle behind it, to facilitate understanding, we don't need to calculate Info(D) during implementation. As an example, the data set D is as follows:

Record ID age Input level student Credit rating Whether to buy a computer
1 teens high no general no
2 teens high no good no
3 middle aged high no general Yes
4 elderly in no general Yes
5 elderly low Yes general Yes
6 elderly low Yes good no
7 middle aged low Yes good Yes
8 teens in no general no
9 teens low Yes general Yes
10 elderly in Yes general Yes
11 teens in Yes good Yes
12 middle aged in no good Yes
13 middle aged high Yes general Yes
14 elderly in no good no

  This data set determines whether a person will buy a computer based on his age, income, whether he is a student, and credit rating. That is, the last column "whether to buy a computer" is the category standard. Now we use information gain to select the best classification attributes, and calculate the amount of information divided by age:

  The whole formula is made up of three items. The first item is teenagers, 5 of 14 records are teenagers, of which 2 (accounting for 2/5) purchase computers, and 3 (accounting for 3/5) do not purchase computers. The second is middle-aged, and the third is old-age. Similarly, there are:

  It can be concluded that Info age (D) is the smallest, that is, after splitting by age, the result is the purest category. At this time, age is the test attribute of the root node, and it is divided into three branches according to young people, middle-aged, and old :

  Note that after the age attribute is used, the age is not required for subsequent operations, that is, the age is deleted from the attributeList. In the future, follow the same method to construct decision subtrees corresponding to D1, D2, and D3. The ID3 algorithm uses a method of selecting attributes based on information gain.

  • Gain ratio

  The information gain selection method has a big flaw. It always tends to select attributes with more attribute values. If we add a name attribute to the above data record, assuming that each person’s name in the 14 records is different, then the information Gain will choose name as the best attribute, because after splitting by name, each group contains only one record, and each record belongs to only one category (either buy a computer or not), so the purity is the highest, and the name is used as the test split There are 14 branches below the node. But such a classification is meaningless, it has any generalization ability. The gain ratio improves on this, it introduces a split message:

  The gain ratio is defined as the ratio of information gain to split information:

  We find the attribute with the largest GainRatio as the best split attribute. If there are many values ​​for an attribute, then SplitInfoR(D) will be large, thus making GainRatio(R) small. However, the gain ratio also has disadvantages. SplitInfo(D) may take 0, which has no calculation meaning; and when SplitInfo(D) tends to 0, the value of GainRatio(R) becomes unreliable. The improvement measure is to add one to the denominator Smooth, add an average of all split information here:

  • Gini index (Gini index)

  The Gini index is another measure of the impurity of data, which is defined as follows:

  Where m still represents the number of categories C in the data set D, Pi represents the probability that any record in D belongs to Ci, and when calculating Pi=(the number of records in the set belonging to the Ci category in D/|D|). If all records belong to the same class, then P1=1, Gini(D)=0, and the impurity is the lowest at this time. This Gini index is before the split, which is the Gini index obtained according to the classification label item (the result of verification is required).

In the CART (Classification and Regression Tree) algorithm, the Gini index is used to construct a binary decision tree. For each attribute, the non-empty proper subset of its attribute is enumerated. The Gini coefficient after the attribute R is split is:

This Gini index starts with a classified Gini index, which can get different Gini index values ​​according to different feature labels.

  D1 is a non-empty proper subset of D, D2 is the complement of D1 in D, that is, D1+D2=D. For attribute R, there are multiple proper subsets, that is, GiniR(D) has multiple values, but we Choose the smallest value as the Gini index of R. At last:

  We turn the attribute with the largest increase in Gini(R) as the best split attribute.

4. Summary of the formula:

1. Information gain

Information entropy:

Information entropy after splitting:

Information gain is defined as the difference between the two amounts of information before and after splitting:

2. Gain ratio

Split ratio:

  The gain ratio is defined as the ratio of information gain to split information:

3. Gini Index

Gini index before the split:

The Gini coefficient after the split of attribute R is:

 The attribute with the largest increment of Gini(R) is regarded as the best split attribute

Guess you like

Origin blog.csdn.net/qq_41587243/article/details/88314403