Data mining: model selection - Tree Model

Decision Tree

Decision Trees: supervised learning method is a nonparametric, it can summarize data from a series of features and labels in decision rules, and with the structure of the tree diagram to present these rules to address the classification and regression problems.
To give a very intuitive example, following a data table, wherein the information to determine whether to play.
Here Insert Picture Description
According to previous information, the new information to determine what belongs to the situation, and then draw conclusions. But simply from the point of view this table, the feeling will be particularly cumbersome. So people think of using decision trees to handle this situation, if-then structure it is relatively easy to understand intuitively.
Here Insert Picture Description
Here are some concepts of decision tree node.
Here Insert Picture Description
Decision tree learning typically comprises three steps: feature selection, and generating a decision tree pruning tree. Which feature selection and decision tree pruning is the core issue.

Feature Selection

Impurity

Decision tree nodes need to find out the best methods and the best branches, and to measure the "best" indicators called "impurity." Purity is not calculated based on leaf node.
Here Insert Picture Description
Because the categorical decision tree leaf node on the decision rule is majority rule , if a leaf node, a category accounting for 90%, then we prefer that category, was sentenced to sample the probability of error is very low; if a category accounted for 51%, accounting for 49% of the other categories, the samples was sentenced to fault probability is high. Therefore, do not lower the purity, the better the fit of the training set of decision trees . How not to measure the purity of it? Here to introduce the concept of entropy.

Entropy and Gini index

Metric entropy of an information, indicating the degree of information chaos . The more information ordered, the lower the information entropy . Here's the formula entropy. t represents a node of the decision tree, setting P (i | t) represents the junction given
proportion of the sample points belong to class i t occupied, the higher this ratio, the purer the node.
Here Insert Picture Description
Another indicator is the Gini (Gini) index, is mainly used for determining the purity of the CART decision tree, the larger the Gini index, the greater the uncertainty in the sample set . The formula is:
Here Insert Picture Description
with the following example to illustrate this is calculated without purity:
Here Insert Picture Description
Here Insert Picture Description
tree final optimized such that the overall goal is the lowest purity is not a leaf node, i.e. not corresponding to the minimum measure of purity .
Using different methods to measure the impurity will produce different decision trees. There are ID3, C4.5, CART classification and regression trees.

ID3

The ID3 information entropy to measure the impurity, the goal is to minimize all the leaf nodes of the total entropy. So, ID3 selected segmentation point, will choose the smallest feature information entropy child node of the node is formed segmentation segmentation. That information entropy between parent and child nodes should be large enough . Information gain is the difference between the two.
Here Insert Picture Description
For example:
to determine whether the computer's purchase behavior will occur.
Here Insert Picture Description
In the absence of treatment, first calculate the total entropy. s1 and s2 represents the number of samples corresponding to the level of classification.
Here Insert Picture Description
No time-cut, the total entropy of 0.940 to herein as the cutoff age points example:
Here Insert Picture Description
information entropy segmentation is 0.694.
Here Insert Picture Description
Gain information calculated:
Here Insert Picture Description
Next, other characteristics of the information gain is calculated:
Here Insert Picture Description
found that age a first division point, the information gain is the maximum, therefore, all the selected points as the first feature of age. Other points of computing division, too, finally get a decision tree.
Here Insert Picture Description

Decision Tree

Decision tree model is a typical greedy model, the overall goal is a global optimal solution (each time looking for information gain the greatest features split), but the global optimal solution with the features presented in the search space increases exponentially increased efficient difficult to obtain.
So we settled for, consider using local optimization step by step derivation result - as long as the maximum gain information, we will be able to get sub-optimal model. Of course, local optimization is not necessarily equal to the global optimum.
Limitations on the ID3 :
1. such as ID, ID following classification levels many, therefore, in accordance with this division to obtain information gain will be great, but ID no practical significance to us, such a division is problematic.
2. Handling continuous variables, with the meaning of the ID is the same, the data is very fragmented, get information gain will be great, although this feature makes sense, but other features are not fair.
3. The existence of missing values will affect the calculation of information gain.
4. A decision tree is to find information gain only if the depth of the leaf nodes in the tree category or reaches the set value, will stop, this will make the effect of the decision tree training set is very good, but the new data set with train set not the same, resulting effect on the test set is relatively poor.
Here Insert Picture Description

C4k5

C4.5 is added as a categorical variable level of penalty term in the calculation of the total entropy of the child node information gain calculation method .
The p entropy calculation formula (i | t) i.e. the total number of samples in a sample categories, into a P (v) i.e. the total number of samples in a proportion of child nodes of the parent node of the total number of samples.
Such a branch index, so that when we cut points automatically to avoid too many those classification levels, entropy decreases the excessive influence of the characteristics of the model, reducing over-fitting case.
IV is calculated as follows:
Here Insert Picture Description
the more classified, smaller the value of P (v) is, the greater the value IV.
Here Insert Picture Description
In C4.5, the information gain prior to use as an indication of the degree of branching divided by selecting segmentation field, i.e., information gain ratio. Nature is the largest information gain, and a smaller branch of the column (that is, purity upgrade soon, but those features are not relying on the special category to enhance the fine) . The larger the IV, that is, the more the level of classification of a column, the greater the proportion punishment Gain ratio achieved. Of course, we still hope GR bigger is better.
Here Insert Picture Description
For example:
here is calculated with the gain ratio information.
Here Insert Picture Description
For each age, divided into three branches 5,4,5 samples. Calculating P (v) 5 / 14,4 / 14/5/14
into the formula IV value. Then the information gain previously calculated by dividing the value of IV, get the information gain ratio. Select the maximum value divided GR.
Here Insert Picture Description
Processing the continuous data on C4.5 :
Here Insert Picture Description
C4.5 when processing continuous data is first sorted first, then the middle two numbers selected number of neighboring points as segmentation (if the N values of age), but not said image into which the same ID, to generate the N-1 category but rather converted to the N-1 binary scheme, i.e., a plurality of N-1 discrete variables, and then calculates gain ratio information, feature division.
Here Insert Picture Description
Here Insert Picture Description
Therefore, the process proceeds to the tree model data set comprises a continuous variable in construction, to consume more computing resources. Since the tree is divided in accordance with the minimum purity is not the way, to have a greater influence on behalf of the points classification for the final result, which also provides advice we binning continuous variables.
As shown above, according to the division of 36.5, the result of the target field of gender still better classification results.

CART

CART algorithm for each of a feature divided by two, thus CART decision tree is a binary tree structure model.

When the Regression CART, feature selection using the mean square error, when the problem of classification, using the Gini index .

CART classification tree algorithm processes

Output discrete values. Process flow C4.5 almost, but will feature selection of indicators into the Gini index. Predict the result is the method of majority vote .
Here Insert Picture Description
CART classification tree processing continuous data , the mode with C4.5 process is the same. First sort the data, then the value of the intermediate value of adjacent feature. Dividing points are calculated Gini index, select the smallest as cut points.
Here Insert Picture Description
CART classification tree processing discrete data when the category of the continuous two feature points, to obtain different combinations, Gini calculated, to find the minimum division Gini index to the feature.
Here Insert Picture Description

CART regression tree algorithm flow

Continuous output value.
Here Insert Picture Description
Continuous data processing CART regression trees , the Gini index is not employed, but by minimizing the mean square error criterion and the method is characterized as the division point selection.
Such as age, according to the division of a node, the node obtains the left average predicted value c1, and c2 and then obtains the right average predicted value of the node.
After obtaining the left and right nodes respectively yi cm and the mean square error, then the mean square error of the two summed result, a goal is to find a node, such node split determined and the minimum mean square error.
After the obtained node, corresponding to the data into two, the left node is the predicted value c1, the left node is the predicted value C2. That is true about the mean value in the node.
The result is predicted using the mean or median final leaves .
Here Insert Picture Description
For example :
first of all to find division point, and calculates the mean square error and m (s). Found that when x = (the split point to 6.5), the minimum value m (s) of 6.5. I.e., this dividing point is about, less than 6.5, the prediction is 6.24 (average value of the left), the right side of the same reason.
Calculate the residual subdivision down.
Here Insert Picture Description
After the calculated first residual sum of square errors divided.
Steps above, but this time the data is residual. Find division point of 3.5, so that addition of the above segmentation a segmentation point 3.5.
Thus constantly dividing, until a particular division, the residual sum of squares of data to meet the requirements, i.e. to stop splitting.
Here Insert Picture Description
Here Insert Picture Description
A question : Why is described here under the tree do not like the linear model as data quality requirements are high, such as normalization, missing values, outliers.
First, the decision tree is a division of space. If x is 1,10,100,1000 ...... this, the gap is relatively large, but they just split point, the actual calculation is not used, the value used is the label behind. Thus, the decision tree without excessive data preprocessing.
Second problem : here again that data into the decision tree. There are discrete or continuous data above description, but if the data in both discrete, when there are continuous, how to deal with? It is a decision tree can identify discrete values or continuous values of thing?
Again, regardless of the data into discrete or continuous, they just split point, only a small discrete data partitioning, data partitioning consecutive points .
Question three : CART tree compared to the multi-tree C4.5, binary tree in which the advantages?
For example, for N consecutive data samples, since the multi-tree C4.5 generates the N-1 discrete features, the calculation amount increases. The CART will generate two discrete features that target and other features. Then the following calculations, then this other classification. This is the core principle of CART, substantially reduced the amount of calculation.

Here Insert Picture Description

Tree pruning

Due to the characteristics of the decision tree algorithm ( total leaf node is not the lowest purity ), therefore, the decision tree is easy to over-fitting. Pruning can be controlled using the method had problems fitting.
Tree pruning and pruning divided after the first pruning.

First pruning

Set up before pruning threshold value of each parameter, this value is reached, stop growing tree.

1. Control of the depth of the tree.
2. The number of samples in the node
3. Calculate the gain ratio information, the size of the Gini index, less than a certain value is to stop production.
Here Insert Picture Description

After pruning

Mr. into a decision tree, the decision tree is then generate all possible after pruning, the ability to select the best generalization tree by cross validation. CART tree pruning method is divided into two steps:
Here Insert Picture Description

Comparison of Three Tree Model

Here Insert Picture Description

Pros and cons of the decision tree

1. The decision tree can be intuitive to show people drawing.
7. The decision tree processing similar bin.
Here Insert Picture Description
1. The decision tree idea, so the total is not a minimum purity, the decision tree algorithm is easy to over-fitting. Therefore, the decision tree parameter adjustment is toward less fit.
5. Due to the nature of the decision tree, wherein the sample tends to orient a large proportion (such as the final result of the classification tree, will be selected as the output category more), imbalance data for the effect is not very good.
Here Insert Picture Description

references

https://blog.csdn.net/weixin_46032351/article/details/104543864?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task
https://weizhixiaoyi.com/archives/141.html
https://www.bilibili.com/video/BV1vJ41187hk?from=search&seid=13147394097118063633

Published 26 original articles · won praise 29 · views 10000 +

Guess you like

Origin blog.csdn.net/AvenueCyy/article/details/105107305