Tree model of machine learning

Tree model there are ID3, C4.5, C5.0, OC1 and CART, etc., the most used model for the CART tree, decision tree model sklearn is based on CART.

Before introducing the tree model to explain the information entropy, information gain and GINi factor.

Information entropy: entropy measures the uncertainty of things, the more uncertain things, the greater its entropy.

Information gain: it measures the degree to know the current feature class after the uncertainty reduced. Information gain greater the greater the degree of uncertainty reduced, the better determination of the categories.

Gini coefficient and the nature of information entropy as: a measure of the size of the uncertainty of random variables.

The Gini Gini index (D) represented by the uncertainty of the set D, the Gini index Gini (D, A) represents elapsed set A = a D after dividing uncertainty. The larger the Gini index, the greater the uncertainty in the sample.

Gini coefficient characteristics are as follows :( specific reference: https: //blog.csdn.net/yeziand01/article/details/80731078)

1) fewer number of categories, the lower the Gini coefficient;    
2) the same number of categories, the higher category concentration, the lower the Gini coefficient.
When fewer categories, the higher category, when the concentration, the lower the Gini coefficient; The more categories, the category of low concentration when higher Gini coefficient.

A, ID3

ID3 algorithm is to use the information to determine the gain of the current node should be to build a decision tree with what characteristics, with the largest gain information calculated features to build a decision tree of the current node

Specific examples of the information entropy is calculated as follows:

For example, we have 15 samples D, the output is 0 or 1. 9 of which output 1, output 0 6. There is a characteristic sample A, a value of A1, A2 and A3. In the sample value of the output of A1, has three outputs 1, 2 output is 0, the output value of the sample A2, the two outputs 1,3 of outputs 0, the value of A3 sample, 4 is output to output 0 1,1.

 

ID3 existing problems: first, without considering continuous features, such as length, density values are continuous, can not be used in ID3. This greatly limits the use of ID3. Second, ID3 uses information gain great feature to establish priority node of the decision tree. However, under the same conditions, the values of more features than the characteristic information values less large gain. (For information gain as the evaluation criteria tend to feature more choice, in fact, is very simple. From the definition of information gain we know, it measures the output characteristics of the uncertainty in the future to know the degree of reduction. As more features when the number of classes this will generally reduce the degree of uncertainty of some more, because the more features number of categories, the small number of samples in each category, and more likely to be divided samples scattered to various characteristics categories, namely uncertainty smaller sample .) third, ID3 algorithm in the case of missing values do not consider. Fourth, do not consider the problem of over-fitting.

Two, C4.5

C4.5 is improved come on the shortcomings of ID3. First, values for a continuous process. For example, a feature comprising consecutive m values, to sort these consecutive values of m, and then take the m value between the value of this feature twenty-two as discrete values, so put discrete continuous feature transform characteristics of the type. When the conditional entropy is calculated based on the values divided, it is less than this value as a class, as another is greater than this value, while after completing the calculation of the conditional entropy property later selection process may also be involved in producing a child node.

For continuous processing value examples are as follows :( specific reference: https://blog.csdn.net/u012328159/article/details/79396893 )

Second, the introduction of gain ratio information to determine the size of the current tree node should be constructed with what characteristics, using information gain is calculated to establish the current node tree than the maximum characteristic.

Gain ratio information = information gain / entropy feature

Examples of entropy is calculated following features:

Wherein A has three eigenvalues ​​A1, A2, A3. 3 the number of samples corresponding to a value no missing characteristic feature A is 2, 3 are divided into a simultaneously A1, A2, A3. Corresponding weight was adjusted to 2 / 9,3 / 9, 4/9. Wherein the entropy -2 / 9 * log (2/9) -3 / 9 * log (3/9) -4 / 9 * log (4/9)

Third, it provides a missing value handling. Fourth, it provides overfitting pruning operation to solve the decision tree.

C4.5 Processing continuity and maximum values ​​of the difference between the discrete values ​​in that for values ​​in continuous values ​​to discrete values ​​as twenty-two, after processing it can participate in the selection process to produce a child node.

C4.5 algorithms disadvantages :

Due to the use of entropy still C4.5 model, there are a lot of time-consuming log operation, if the value is there are a lot of sort of continuous operation.

C4.5 only be used for classification, and use the model more complex multi-branch tree.

Three, CART

CART: It can also be used for classification regression, decision tree method sklearn used.

Case gini coefficient is calculated:

CART classification trees

CART classification tree for processing successive values of metrics and C4.5 when thinking as continuous values into the discrete type, but in the current node should be determined what decision tree construction wherein it is selected (Gini coefficients represent Gini coefficient model of impurity) Gini coefficient is smaller, then the feature is divided by the sample to reduce the uncertainty, the more the better characteristics.

CART classification trees for discrete values of dealing with C4.5 also there are differences . For dealing CART classification trees discrete values, the idea of using two separate stop dispersion characteristics. Recall that ID3 or C4.5, if a feature A decision tree node is selected, if it has A1, A2, A3 three categories, we will look at the establishment of a trigeminal node in the tree. This results in a decision tree is a multi-tree. However, different methods CART classification tree used, he uses a stop-half, or in this example, CART classification tree will consider A divided {A1} and {A2, A3}, {A2 } and {A1, A3}, {A3} and {A1, A2} three cases, find the smallest Gini coefficient combinations, such as {A2} and {A1, A3}, then the binary tree node is a node corresponding to the sample A2, the other nodes are {A1 , A3} the corresponding node. Also, because this is not the characteristic values A completely separate, later we have a chance to select the sub-node continues to divide feature A A1 and A3. This ID3 and C4.5 or different, in a subtree ID3 or C4.5, the discrete features will only be involved in setting up a node.

Summary CRAT C4.5 classification tree and the difference is that, when the node is divided C4.5 selection of information gain ratio as a yardstick, and selection CART Gini coefficient. When achievements C4.5 created for multi-branch tree, while CART created through continuous binary binary tree.

CART for discrete values ​​and continuous values ​​Processing:

To deal with the problem CART classification trees consecutive values, their ideas and C4.5 are the same, are a continuous feature discrete. The only difference lies in the different metrics in the choice of the division point, using the C4.5 gain ratio information, the CART tree is the classification Gini coefficient.

For dealing CART classification trees discrete values, the idea of ​​using two separate stop dispersion characteristics.

Recall that ID3 or C4.5, if a feature A decision tree node is selected, if it has A1, A2, A3 three categories, we will look at the establishment of a trigeminal node in the tree. This results in a decision tree is a multi-tree. However, different methods CART classification tree used, he uses a stop-half, or in this example, CART classification tree will consider A divided {A1} and {A2, A3}, {A2} and {A1, A3}, {A3} and {A1, A2} three cases, find the smallest Gini coefficient combinations, such as {A2} and {A1, A3}, then the binary tree node is a node corresponding to the sample A2, the other nodes are {A1 , A3} the corresponding node. Also, because this is not the characteristic values ​​A completely separate, later we have a chance to select the sub-node continues to divide feature A A1 and A3. This ID3 and C4.5 or different, in a subtree ID3 or C4.5, the discrete features will only be involved in setting up a node.

CART regression tree

CART solve regression problem:

First of all, we need to understand, what is the return of a tree, what is the classification tree. The difference is that the output of the sample, if the sample is a discrete output value, then this is a classification tree. If the result is a continuous sample output value, then this is a regression tree.

In addition to different concepts, and to establish the difference between the predicted CART regression trees and CART classification trees are mainly the following two points:

1) successive values ​​of different treatment methods

Different make predictions after 2) decision tree method.

For continuous process value, we know CART classification tree is used in the strengths and weaknesses of each division point using the Gini coefficient to measure the size of features. This is more suitable for classification models, but for the regression model, we use common and metrics variance, a measure of target CART regression tree is, for any division features A, corresponding to an arbitrary dividing point s, the feature is divided into data sets D1 and D2, and D1 are determined variance of each set of D2, D1 and D2 are the variances of the corresponding feature and minimum feature value as a division point.

Specific analysis: CART tree if there are 4 regression variables, each variable has the value 10 (continuous variable) then each division point to calculate each variable mean square error variables and selection of priority the division point to the fact 4 * 9 = 36 is calculated, and the size ratio.

For making predictions after the way the tree is established, the above mentioned categories CART classification trees predictive probability of leaf nodes in the largest category as the current node. And not Regression Trees output category, it uses a mean or median leaves to predict the final output.

In addition to the above-mentioned outside of established algorithms and predictive regression tree CART and CART classification trees is no different.

CART tree pruning strategy:

CART CART classification trees and regression tree pruning strategies are in addition to a use variance in loss of time measurement, using a Gini coefficient, the basic algorithm exactly the same, here we are concerned with.

Since the decision-making algorithm is very easy to over-fitting the training set, which led to poor generalization ability, in order to solve this problem, we need to CART trees are pruned, that is similar to the linear regression of regularization, to increase the generalization of the decision tree ability. However, there are a lot of pruning, so we should choose it? CART approach uses a pruning process, i.e. Mr. into a decision tree, and then to generate all possible CART tree after pruning, and cross-validation to verify the effect of various pruning, select the best generalization pruning strategy.

That is, the CART tree pruning algorithm can be summarized as two steps, the first step is to generate a decision tree pruning effect from the original tree, the second cross section is to test the predictive power pruning verified, select the number after the generalization of the best pruning predictive power as the final CART tree.

CART disadvantages:

1) it should have noticed that there is ID3, C4.5 or CART, doing feature selection is to choose the best time to do a feature classification decision, but most, classification decisions should not be made a certain feature decision, but it should be decided by a set of features. Thus obtained more accurate decision trees. This tree is called multivariate decision tree (multi-variate decision tree). When selecting the optimal characteristics, multivariate decision tree is not an optimal choice for a feature, but a feature of selecting the optimal linear combination to make decisions. This represents the algorithm is OC1 , little introduction here.

2) If the sample a little bit of change occurs ( here refers to the fluctuations in data distribution, and decision trees have a good fault tolerance refers to the presence of outliers feature does not give a final prediction great impact. ), it will lead to dramatic changes in the tree structure. This can be resolved by random forest ensemble learning like inside.

 

Guess you like

Origin www.cnblogs.com/dyl222/p/11085032.html