Briefly describe the relationship between decision trees, random forests and XGBOOST

Original link: https://www.it610.com/article/1281962579127713792.htm

Briefly describe the relationship between decision trees, random forests and XGBOOST

This article mainly explains: decision trees, random forests and xgboost, with an explanation of AdaBoost and GBDT

1. Decision tree

These algorithms all rely on the decision tree or various magical revisions of the decision tree, so the decision tree must be mastered clearly. Decision tree is a common machine learning algorithm. The purpose of decision tree is to construct a model that can predict the value of target variable by learning simple decision rules-IF THEN rules from the characteristic attributes of sample data . Take the example of watermelon, given the characteristics like color, root and knocking sound, how to judge whether a watermelon is a good watermelon. 

Briefly describe the relationship between decision trees, random forests and XGBOOST_第1张图片 A decision tree for the watermelon problem

 Through a series of judgments, it is finally concluded that the watermelon with [color=green, root=curled, knocking=soundy] is a good melon.

Generally speaking, a decision tree contains a root node, several internal nodes and several leaf nodes. Among them, the leaf node corresponds to the decision result, and each other node corresponds to an attribute test. The most important thing here is how to choose the best attributes for division? Why is color the first attribute and not the root? This leads to various partitioning options.

1.1 Information gain (ID3 algorithm)

The current decision tree algorithms mainly include: ID3, ID4.5 and CART. In ID3, the partition attribute is selected according to the information gain as a criterion. We hope that the samples contained in the branch nodes of the decision tree belong to the same category as possible, that is, the "purity" of the nodes is getting higher and higher.

'Information entropy' is the most commonly used indicator to measure the purity of a sample set.

Assuming that the proportion of the k-th sample in the current sample set D is p_{k}(k = 1, 2, .....  | and |), the information entropy of D is defined as:

Ent(D) = -\sum_{k=1}^{|y|}p_{k} * log_{2}^{p_{k}}

Among them, | and |is the total category of the sample category of D. For example, if the label is good melon and bad melon in the example of watermelon, then | and |  = 2.

Information entropy is an inherent property of a node, which is a fixed value for a certain data set .

After defining the entropy of information gain is defined, assuming selection attribute a has a value V (i.e., v can be divided into parts) . According to the rules of the decision tree, D will be divided into V different node data sets. Considering that different nodes contain different numbers of samples, the relative weights are assigned to the branches, that is, the more samples, the greater the weight of the branch. Therefore, we can calculate the'information gain' obtained by dividing the sample set D with the attribute a :

Gain(D, a) = Ent(D) -\sum_{v=1}^{V}\frac{|D^{v}|}{|D|}Ent(D^{v})

<1> The greater the information gain, the greater the purity obtained by dividing by attribute a.

<2>Ent(D) is information entropy, which is an inherent property of this data set, and is a fixed value

 

<3> \frac{|D^{v}|}{|D|}indicates the proportion of branch nodes. Adjust the weight. Obviously, the larger the data set, the higher the weight of the branch node.

Let us calculate an example, or take the watermelon as an example:

 Watermelon dataset

There are 17 samples in total in the watermelon dataset. Obviously,| and |  = 2 . Among them, there are 8 positive cases and 9 negative cases. According to the above formula, the information entropy of the root node can be obtained as:

Ent(D) = -\sum_{k=1}^{|y|}p_{k} * log_{2}^{p_{k}} = -\left ( \frac{8}{17} log_{2}^{\frac{8}{17}} + \frac{9}{17} log_{2}^{\frac{9}{17}} \right ) = 0.988

Then, we have to divide according to the current attributes. Taking color as an example, there are three values: cyan, jet black and light white. Remember the divided three subsets D1{color=cyan}, D2{color=black}, D3{color=light white}. In D1, there are 6 samples, including 3 positive examples and 3 negative examples; in D2, there are 6 samples, including 4 positive examples and 2 negative examples; in D3, there are 5 samples, of which 1 Positive examples, 4 negative examples. The information entropy of the three branch nodes divided according to the'color' is:

 

 

Therefore, the information gain can be obtained as (a=color):

 

 Therefore, the information gain divided according to the color attribute is 0.109.

The attribute information gain is: G (D, root)=0.143, G (D, knock)=0.141, G (D, texture)=0.381, G (D, umbilical)=0.289, G(D, Touch)=0.006.

Obviously, the information gain of'texture' is the largest, so it is selected as the partition attribute. And so on, until the current nodes are all of the same category.

However, in the above table, we overlooked a feature, that is, the number, if divided according to the number, then the information gain is 0.998. But such a division does not make any sense. This is the shortcoming of the information gain criterion : preference is given to attributes with a large number of values . The information gain rate criterion is explained below.

1.2 Information gain rate criterion (C4.5 algorithm)

In order to reduce the disadvantage that the information gain criterion has preference for attributes with a large number of values, the gain ratio is used to select the optimal partition attribute. The gain rate is defined as:

Gain(D, a) = \frac{Gain(D, a) }{IV\left ( a \right )}

among them:IV(a) = -\sum_{v=1}^{V}\frac{|D^{v}|}{|D|} log_{2}^{​{\frac{|D^{v}|}{|D|}}}

<1>  IV (a) is the'fixed value' of attribute a. The greater the number of values ​​of the attribute a (the greater the v), the IV (a)greater the value. Such as IV (number) = 17, IV (color) = 1.580.

<2>Compared with ID3, C4.5 is multiplied by a coefficient to limit attributes with a large number of preference values.

<3> In the C4.5 algorithm, it is not uncommon to directly select the attribute with the largest gain rate, but first find the attribute with higher than average information gain from the candidate attributes, and then select the attribute with the highest information rate from the candidate attributes.

1.3 Gini Index (CART Decision Tree)

In the CART decision tree, the Gini index is used to select attributes. First, the Gini value of data set D is defined:

Gini (500) = \ sum_ {k = 1} ^ {| y |} \ sum_ {k = k {k}} p_ p_ {k} = 1 \ sum_ {k = 1} ^ {| y |} p_ {k} ^ {2}

To put it vividly, the Gini value represents the probability that two samples are randomly selected from D and their categories are inconsistent. With the Gini value, the Gini index can be defined on this basis:

Gini_index(D, a) = \sum_{v=1}^{V}\frac{|D^V|}{|D|}Gini\left ( D^v \right )

Therefore, we choose the attribute that minimizes the Gini index after division as the optimal division attribute.

The three main decision tree algorithms have been explained. There are some other processing methods to optimize the decision tree, such as: pruning, continuous value and missing value processing, multivariate decision tree and so on. I won't introduce them one by one.

1.4 Pruning
Pruning is an important method to overfit the decision tree, which is mainly divided into the following two types:

Pre-pruning: This strategy is to estimate before dividing a node. If the generalization accuracy of the decision tree cannot be improved, stop dividing and set the current node as a leaf node. So how to measure the generalization accuracy is to leave a part of the training data as a test set, and compare the prediction accuracy of the test set before and after the division before each division.
Advantages: Reduce the risk of overfitting and reduce the time required for training.
Disadvantages: Pre-pruning is a greedy operation. Some divisions may not improve accuracy temporarily, but subsequent divisions can improve accuracy. Therefore, there is a risk of underfitting.
Post-pruning: This strategy is to first establish a decision tree normally, and then pruning the entire decision tree. According to the reverse order of the breadth-first search of the decision tree, the internal nodes are pruned in turn. If a subtree rooted at the internal node is replaced with a leaf node, the generalization performance can be improved and pruning is performed.
Priority: Reduce the risk of over-fitting, reduce the risk of under-fitting, and the effect of decision tree is better than pre-pruning.
Disadvantage: The time cost is much larger.

2. Integrated learning

The tree model algorithms derived from ensemble learning mainly include: random forest, AdaBoost and XgBoost.

First introduce the integrated learning, integrated learning by combining multiple weak learners, voting on the results, usually can obtain superior performance than a single learner.

If the base learners are independent of each other, as the number of learners increases. The error rate will decrease exponentially and eventually approach 0 (the base learner should be better than the random guessing learner, and the accuracy of the base learner should be higher than 50% for the binary classification problem). The mathematical formulas here are not deduced. If you are interested, please refer to Zhou Zhihua's Watermelon Book.

However, facing the same problem, individual learners are training the same sample and cannot be independent of each other! ! . In other words, in the face of a problem, no matter how many decision trees we integrate, it is very likely that we will not achieve the results we expected.

Therefore, there are currently two main types of integrated learning methods:

1. Bossting: There is a strong dependency between individual learners. The serialization method of serial generation mainly represents Adaboost, GBDT, XgBoost

2. Bagging: There is no strong dependency between individual learners, and the parallel method generated at the same time mainly represents the random forest

2.1 Bagging

We first introduce bagging. Random forest is an extended variant of bagging. In the face of the same problem, it is impossible to achieve "mutual independence", so the idea of ​​bagging makes as much difference between individual learners as possible. Given a data set (N samples), use the boot strap method (take M randomly from N samples, M

2.1.1 Random Forest

Immediately the forest is an extended variant of bagging. On the basis of bagging, random attribute selection is further introduced in the training process. Specifically, bagging is to randomly select samples, while random forest randomly selects k attributes from d attributes on the basis of randomly selecting samples. Generally, it is recommended .

The advantages of random forest: simple, easy to implement, and low computational overhead.

2.2 Boosting

From a bias-variance perspective, Boosting mainly focuses on reducing bias.

There is a strong dependency between individual learners, a serialization method of serial generation. The main mechanism of this kind of method is as follows: first train a base learner from the initial training set, adjust the sample according to the performance of learning, and train the next base learner based on the adjusted sample, and repeat until the number of base learners reaches Specify the number T in advance, and then combine the T base learners by weight.

2.2.1 AdaBoost

1. AdaBoost changes the weight of the training data, that is, the probability distribution of the sample. The idea is to focus on the samples that are incorrectly classified, reduce the weight of the samples that were correctly classified in the previous round, and increase those that were incorrectly classified. The sample weight of the classification. Then, learn based on some basic machine learning algorithms used, such as decision trees.

2. AdaBoost uses a weighted majority voting method to increase the weight of weak classifiers with a small classification error rate, and reduce the weight of weak classifiers with a large classification error rate. This is easy to understand. Of course, a weak classifier with a high accuracy rate and a good score should have a greater say in the strong classifier.

Here is a brief description of the AdaBoost process.

For a training set D, there are m samples, the number of training times T, and the base learning algorithm R.

Adaboost process:

1. Initialize the weights, the first time all weights are the same, they are all 1/m, and the sample set becomes D1

2.for t = 1...T:

3. Calculate the result based on the data set at time t

4. Calculate the error rate of this learner\varepsilon _t = P_D_t\left ( h_t \neq f(x) \right )

5. If \varepsilon _t > 0.5, exit the loop

6. Update the weight of the classifier (not the sample weight) \alpha _t =ln\left ( \frac{1- \varepsilon _t}{ \varepsilon _t} \right )

7. Update the sample weight D_{t+1} =\frac{D_{t}exp\left ( -\alpha_{t}f(x)h_{t} \right )}{Z_{t}}, where Zt is the normalization factor to ensure that the sum of the probabilities is 1.

8.end

Output:

The output of this AdaBoost depends on the output obtained by voting by all base learners according to the weight. The greater the error rate, the greater the weight of the base learner. In each iteration, the wrong sample will get a larger weight, and the correct sample will get a smaller weight, which changes the probability distribution of the sample. Use the changed sample distribution as the next input.

The fifth step is actually to determine whether the currently generated base learner meets the basic requirements, which is better than random guessing.

The specific formula derivation can refer to Zhou Zhihua's Watermelon Book.

2.2.2 GBDT

To be continued

Guess you like

Origin blog.csdn.net/ch206265/article/details/108324808