Decision Tree - discrete continuous values how to construct a decision tree

Details of the decision tree: https://blog.csdn.net/suipingsp/article/details/41927247

What is a decision tree :

    A decision tree is the process through a series of rules to classify data. It provides a method under what conditions will be similar to what the value of the rule. Decision tree classification trees and regression trees into two types of decision tree classification tree made of discrete variables, regression trees to make a decision tree for continuous variables.

2, a decision tree generation process is mainly divided into the following three parts:

Feature Selection : Feature selection refers to selecting a feature from the training data in many standard features as splitting the current node, how to select the feature has a lot of different quantitative assessment criteria standards, which have shown different decision tree algorithms.

Decision Trees : According to a feature selection evaluation criteria, from top to bottom recursively generating sub-nodes, the data set can not be separated until the tree is stopped to stop growing. Tree, the recursive structure is most easily understood way.

Pruning : decision tree is easy to over-fitting, generally need to be pruned, reduce the size of the tree structure, alleviate over-fitting. There are pre-pruning and pruning pruning two kinds.

3, based on three decision tree algorithm of information theory

The ID3 : the ID3 algorithm according to the evaluation information gain selection feature and information theory , each time the maximum gain selection feature information to make a judgment module. Using information gain if in fact there is a drawback that it is biased in favor of a large number of property value - that is to say in the training set, the greater the number of distinct values a property taken, then the more likely to take it as a property division, and sometimes makes no sense to do so, and the other ID3 data processing features can not continuously distributed , so there C4.5 algorithms. Characterized CART algorithm also supports data continuously distributed.

C4.5 is an improved algorithm ID3 of advantages, inherited the ID3 algorithm. C4.5 algorithm with attribute information gain ratio is selected to overcome the bias of the selected attribute values when selecting multiple attribute information for insufficient gain prune the tree construction process; to complete the processing of the discrete continuous attributes ; capable of incomplete data for processing. C4.5 algorithm generated classification rule is easy to understand, high accuracy; however inefficient, because the tree construction process, the need for data set sequential scanning and sorting a plurality of times. Also because multiple data sets to be scanned, C4.5 only suitable to reside in memory of the data set.

CART algorithm stands for the Classification And Regression Tree, using the Gini index (Gini index selected minimum feature s) as a standard split, it is also contains the pruning operation. ID3 and C4.5 algorithms algorithm While learning the training sample set can dig as much information, but it generated a large tree branch , a larger scale. In order to simplify the decision tree of scale and improve efficiency to generate decision tree, select the test appeared to attribute GINI coefficient based on the decision tree algorithm CART.

4, entropy and gini index:

https://www.cnblogs.com/muzixi/p/6566803.html

Entropy:

gini index: https://blog.csdn.net/e15273/article/details/79648502 (example details how to calculate gini index)

5, cart tree and how to handle it and his continuous value loss function

cart tree is divided into classification tree (variables are discrete variables) and regression tree (continuous variables), binary tree, ID3 multi-tree.

Important foundation CART algorithm includes the following three aspects:

(1) half (Binary Split): each determination process, variables are two observation points.

    CART algorithm uses a binary recursive partitioning technique, the algorithm will always set the current sample is divided into two sub-sets of samples, so that each non-leaf node of the decision tree generation are only two branches. Therefore CART algorithm generates decision trees are simple binary tree structure. Thus CART algorithm is applied to the sample characteristics or the value of a scene, for a continuous process is similar to the features C4.5 algorithm.

(2) univariate split (Split Based on One Variable): each division are optimal for a single variable.

(3) pruning strategy: key points CART algorithm, but also a key step in the whole Tree-Based Algorithm.

The pruning process is particularly important, it plays an important role in optimum Decision Tree. Studies have shown the importance of tree pruning process than the generation process is more important for different criteria for the classification tree generated maximum (Maximum Tree), after pruning we are able to retain the most important attributes of division, not very different. Instead pruning method is more critical to generate optimal tree .

2, the process features a continuous numerical
C4.5 discrete attributes may be processed and to be processed continuity properties. When selecting a branch node on the properties, the properties described for a discrete, ID3 and C4.5 processing method same. For continuous distribution characteristics which approach is:

First continuous discrete attribute properties into further processing. Although the essential attribute values ​​is continuous, but it is limited discrete sampled data, if there are N samples, then we have N-1 Species discretization method: <= vj aliquoted left subtree, > assigned to the right child vj. Calculating the maximum rate at which information gain N-1 case. In addition, for continuous attributes to be sorted (in ascending order), only in the decision attribute (ie classification changes) need only place a change of cut, which can significantly reduce the amount of computation. Proved using this indicator in determining the gain characteristic continuous demarcation point (because use of a gain ratio, splittedinfo affect the split point information measurement accuracy, if a breakpoint is exactly the number of continuous features into two equal portions maximum inhibition ), while select properties when it is using the gain of this indicator can choose the best classification features.

In C4.5, the continuous processing property is as follows:
the value of 1. characterized in ascending order

2. The midpoint between the two values ​​as the feature point may split the data set into two parts, the information gain is calculated for each possible split points (InforGain). Optimization algorithm is counting only those features that change the classification of property value.

3. Select the corrected information gain (InforGain) split largest point as the best feature of the split point

4. Calculate the information gain of the best split point (Gain Ratio) characteristics as Gain Ratio. Note that this is the need to gain the best split point information is corrected: subtracting log2 (N-1) / | D | (N is the number of consecutive feature values, D is the number of training data, this is because the correction : when the discrete and continuous attributes properties coexist, C4.5 algorithms tend to choose to do the best feature continuous tree split point)

Data classification tree obtained as a result: 0 value. The result of a regression tree obtained value, the average interval. Therefore, a loss function is:

https://www.cnblogs.com/liuwu265/p/4688403.html

Regression Trees: Use a minimum squared error criterion.

Classification tree: Use the minimization criteria Gini index

Or other indicators look good or bad, auc and KS, reaction model.

5, CART pruning
recursive achievements of the process of classification and regression tree analysis, it is not difficult to find that there is essentially a data over-fitting problem. When the decision tree structure, because the training data noise or outlier, many branches reflect the training anomalies in the data, the use of such a decision tree category unknown data to categorize the accuracy is not high. So try to detect and subtract this branch, detection and subtract these processes are known as tree branches pruned. Tree pruning method for processing data problems too much adaptation. Typically, this method uses statistical measures, minus the most unreliable branch, which will lead to a faster classification, improve the ability of the tree is independent of the training data correctly classified. Tree pruning commonly used simply two ways: pre-pruning (Pre-Pruning) and pruning (Post-Pruning). Pre-pruning is stopped in accordance with principles of the early growth of the tree, such as the depth of the tree reaches a desired depth of the user, the node number smaller than the specified number of samples, the purity is not less than the maximum amplitude of index decreased the amplitude and the like designated by the user; after pruning the branches in the tree is cut by fully grown implemented to cut through a branch tree node of the deletion node, the pruning there may be used various methods, such as: the complexity of the cost of pruning, the minimum error pruning, pruning and so pessimistic error.

CART pruning method often used afterwards, during a second decision tree construction key tree pruning is grown by training set independent validation data set.

 

Decision tree approach to handling missing values:

Decision tree does not need to deal with missing values, of course, too many missing values ​​may be truncated part of this feature.

How xgb handle missing values, https://blog.csdn.net/qq_19446965/article/details/81637199 , https://www.zhihu.com/question/58230411

2.xgboost ruffle handle missing values?
Different methods xgboost handle missing values and other tree model. According to the authors TianqiChen be treated as sparse matrix in the paper [1] in section 3.4 of the introduction, xgboost the missing values, missing values value itself is not taken into account when a node split. Missing values will be assigned to the data layer, the count loss left subtree and right subtree, respectively, that a superior choice. If the training is no missing data, missing data when there has been forecast, then the default is classified into the right subtree. Specific reference may be introduced [2,3].

. What kind of model is more sensitive to missing values?
Tree model sensitivity to missing values low, most of the time you can always use the missing data.
When it comes to the distance metric (distancemeasurement), computing the distance between the two points, the missing data becomes more important. Because it involves the "distance from" concept, then the missing values are handled properly can lead to poor results, such as K-nearest neighbor (KNN), support vector machine (SVM).
Linear model cost function (loss function) often involves calculating the distance (Distance), the count value of the fine sieve and the difference between the true value, which may lead to missing values sensitive.
Robust neural network is strong, not very sensitive to missing data, but a | g do not have much data Kegongshiyong.
Bayesian model of the missing data is relatively stable, small amount of data time heteroaryl Bayesian model.
Overall, the data values for missing missing after treatment:

Small amount of data, Naive Bayes
, medium or large amounts of data, with a horizontal tree, priority xgboost
large amount of data, the neural network may be used
to avoid using a distance metric related models, such as KNN and SV

In-depth understanding of machine learning - model-based decision tree (a): classification trees and regression trees

https://blog.csdn.net/hy592070616/article/details/81628956

Guess you like

Origin blog.csdn.net/x_iesheng/article/details/93665998