Machine learning (3)-decision tree

1. What is a decision tree

1.1 The basic idea of ​​decision tree

It is practical to use the picture to better understand the fundamental difference between the LR model and the decision tree model algorithm. We can think about a decision problem: whether to go on a blind date, the mother of a girl will introduce the object to this female sea.

image

Everyone understands it clearly! The LR model is a brain that puts all the features into learning, and the decision tree is more like if-else in a programming language, to make conditional judgments, which is the fundamental difference.

1.2 The growth process of "tree"

Decision tree is based on the "tree" structure to make decisions, then we have to face two problems:

  • How long does the "tree" grow.
  • When will this "tree" stop growing?

Understand these two problems, then this model has been established, the overall process of the decision tree is the idea of ​​"divide and conquer", one is the recursive process from root to leaf, and one is to find a "divide" at each intermediate node "The attribute is equivalent to a characteristic attribute. Next, let's solve the above two problems one by one.

When will this "tree" stop growing

  • The samples included in the current node all belong to the same category and do not need to be divided; for example: the samples are all decided to blind date, belong to the same category, that is, no matter how the characteristics change will not affect the result, this does not need to be divided.
  • The current attribute set is empty, or all samples have the same value on all attributes and cannot be divided; for example: all the sample features are the same, which makes it impossible to divide, and the training set is too single.
  • The sample set contained in the current node is empty and cannot be divided.

1.3 How does the "tree" grow

In life, we will encounter many places that need to make decisions, such as: places to eat, digital product purchases, tourist areas, etc. You will find that these choices are dependent on the choices made by most people, that is, Follow popular choices. In fact, it is the same in the decision tree. When most of the samples are of the same type, then the decision has been made.

We can abstract the choice of the masses, which introduces a concept of purity, and think about it as well. Mass selection means higher purity. Well, a bit deeper, it involves a sentence: the lower the information entropy, the higher the purity. I believe that everyone has heard of the concept of "entropy" more or less. Information entropy is generally used to measure the amount of "information" included. If the attributes of the samples are the same, it will make people think that this The information is very simple and there is no differentiation. On the contrary, the attributes of the samples are different, so the amount of information contained is much.

I got a headache when I got here, because the formula for information entropy is about to be introduced, it is actually very simple:

Pk means that the proportion of the k-th sample in the current sample set D is Pk.

Information gain

Without further ado, just put the formula directly:

image

If you do n’t understand, do n’t care. The simple sentence is: information entropy before division-information entropy after division. It represents the "step" in the direction of purity.

Well, with the previous knowledge, we can start the "tree" growth.

1.3.1 ID3 algorithm

Explanation: Calculate the information entropy at the root node, and then divide and calculate the information entropy of its nodes in turn according to the attributes. Use the root node information entropy-the information entropy of the attribute nodes = information gain, and arrange them in descending order according to the information gain, ranking first It is the first division attribute, and so on, and so on, which gives the shape of the decision tree, that is, how "long" it is.

If you do n’t understand, you can check the example of the picture I shared, combined with what I said, you can understand

  1. The first picture.jpg
  2. The second picture.jpg
  3. The third picture.jpg
  4. The fourth picture.jpg

However, there is a problem with information gain: there is a preference for attributes with a large number of possible values, for example: consider "number" as an attribute. To solve this problem, another algorithm C4.5 was introduced.

1.3.2 C4.5

In order to solve the problem of information gain, an information gain rate is introduced:

among them:

The greater the number of possible values ​​of the attribute a (ie, the larger the V), the larger the value of IV (a). ** Information gain ratio essence: multiplying a penalty parameter on the basis of information gain. When the number of features is large, the penalty parameter is small; when the number of features is small, the penalty parameter is large. ** However, there is a disadvantage:

  • Disadvantages: The information gain rate is biased towards features with fewer values.

Use information gain rate: Based on the above shortcomings, it is not directly to select the feature with the largest information gain rate, but to find the features with higher information gain than average among the candidate features, and then select the feature with the highest information gain rate among these features .

1.3.3 CART algorithm

The mathematician is really clever and thought of another way to express purity, called the Gini index (annoying formula):

image

Represents the probability of a randomly selected sample in the sample set being divided. For example, now there are several balls of 3 colors in a bag, reach out and take out 2 balls, the probability of different colors, now understand. The smaller the Gini (D), the higher the purity of the data set D.

for example

Suppose now that there is a feature "education", this feature has three feature values: "undergraduate", "master", "doctor",

When the feature "education" is used to divide the sample set D, there are three division values, so there are three possible division sets. The divided subsets are as follows:

1. Division point: "Undergraduate", the divided sub-collection: {Undergraduate}, {Master, Doctor}

2. Division point: "Master", the divided sub-collection: {Master}, {Undergraduate, PhD}

3. Division point: "Master", the divided sub-collection: {Ph.D.}, {Undergraduate, Master}}

For each of the above divisions, the purity of dividing the sample set D into two subsets based on the division feature = a certain feature value can be calculated:

Therefore, for a feature with multiple values ​​(more than 2), it is necessary to calculate the purity Gini (D, Ai) of the subset after dividing the sample D with each value as the dividing point (where Ai represents the feature Possible values)

Then find out the partition with the smallest Gini index from all possible partitions of Gini (D, Ai). The partition point of this partition is the best partition point that uses feature A to partition the sample set D. At this point, you can grow into a "big tree".

1.3.4 Three different decision trees

  • ID3: An attribute with many values ​​makes it easier to make the data more pure and its information gain greater.

    The training is a huge and shallow tree: unreasonable.

  • C4.5: Use information gain rate instead of information gain.

  • CART: Replace entropy with Gini coefficients to minimize impurity, rather than maximize information gain.

2. Why doesn't the tree structure need to be normalized?

Because numerical scaling does not affect the position of the split point, it does not affect the structure of the tree model. Sorting according to the characteristic value, the sorting order is unchanged, then the branch and the split point will not be different. Moreover, the tree model cannot perform gradient descent because the tree model (regression tree) is searched for the best advantage by finding the optimal split point, so the tree model is stepped, and the step point is not derivable, and Derivation is meaningless, so there is no need for normalization.

Since tree structures (such as decision trees, RF) do not need to be normalized, why do non-tree structures such as Adaboost, SVM, LR, Knn, KMeans need to be normalized.

For the linear model, when the eigenvalues ​​are very different, when using gradient descent, the loss contour is elliptical, and it takes multiple iterations to reach the best advantage. However, if normalization is performed, the contour line is circular, prompting SGD to iterate to the origin, resulting in fewer iterations.

3. The difference between classification decision tree and regression decision tree

Classification And Regression Tree (CART) is a type of decision tree. The CART algorithm can be used to create both Classification Tree and Regression Tree. The process of building the two is slightly different.

Regression tree:

The CART regression tree assumes that the tree is a binary tree, by continuously splitting the features. For example, the current tree node is split based on the j-th eigenvalue, and the sample with the eigenvalue less than s is divided into the left subtree, and the sample greater than s is divided into the right subtree.

The CART regression tree essentially divides the sample space in this feature dimension, and the optimization of this space division is a NP-hard problem. Therefore, heuristic methods are used to solve the decision tree model. The objective function generated by a typical CART regression tree is:

Therefore, when we want to solve the optimal cutting feature j and the optimal cutting point s, it is transformed into solving such an objective function:

So as long as we traverse all the cut points of all features, we can find the optimal cut features and cut points. Finally get a return tree.

Reference article: Detailed explanation of classic algorithms-CART classification decision tree, regression tree and model tree

4. How to prun a decision tree

The basic pruning strategies of decision trees are Pre-Pruning and Post-Pruning.

  • Pre-pruning: The core idea is that before each actual node is further divided, the data of the verification set is used to verify whether the division can improve the accuracy of the division. If it is not possible, mark the node as a leaf node and exit further division; if it can, continue to recursively generate nodes.
  • Post-pruning: Post-pruning is to generate a complete decision tree from the training set, and then investigate the non-leaf nodes from the bottom up. If the subtree corresponding to the node is replaced by the leaf node energy band To improve the generalization performance, the subtree is replaced with a leaf node.

Reference article: Decision Tree and Decision Tree Generation and Pruning

Guess you like

Origin www.cnblogs.com/dhName/p/12737856.html