[Data Algorithm Siege Lion] Decision Tree

The branch of the decision tree is very large. Today, I will sort out this very fraudulent thing. In fact, the effect of the tree algorithm is very good, just like deep learning now, for example, everyone is exposed to GBDT=. =

 

Common value handling

  Continuous value processing : emmm discretization, the median point is a good choice.

  Missing value processing : In fact, missing values ​​are not considered in the process of tree generation, but in the process of classifying samples, the corresponding probabilities without missing values ​​are taken for nodes with unknown attribute values, and then enter the node at the same time. of all branches. In other words, the final result of the sample needs to compare the magnitude of these probabilities.

 

tree pruning

The pruning operation needs to use the training set and the validation set at the same time. The training set finds the feature with the largest information gain, and the validation set performs the precision calculation on the branch corresponding to the selected feature.

  Pre-pruning : The accuracy is calculated when a feature node is added to the decision tree. When the accuracy of a feature node to be added is higher than that of the previous decision tree, it is determined to be added. Therefore, the under-fitting of pre-pruning is the main factor. The problem.

  Post-pruning : After the decision tree is generated, the accuracy of the feature nodes is judged. When the accuracy of a feature node to be deleted is improved compared with the previous decision tree, it is determined to be deleted. If it is equal, it can be considered to be retained. Generally, post-pruning is the better approach.

 

single decision tree

ID3 : This is the most commonly used tree when explaining the theory of decision trees in the book. It is calculated directly using information gain (don't forget that when calculating information gain, entropy is multiplied by probability). However, there is a problem with the direct use of information gain, that is, when the number of values ​​of the node is large (that is, the number of samples under a certain value is small), the information gain will be quite high (according to the formula of information gain) ), which will lead to overfitting and reduced generalization ability.

  C4.5 : Improve the way ID3 takes nodes, using the information gain rate, but it can be seen from the formula (it will not be added with mathematical formulas, then add =.=), the information gain rate preference node is taken The value is less, so C4.5 uses a heuristic selection method to select the one with the highest information gain rate from a bunch of features (attributes) whose information gain is higher than the average.

  CART : Classification and regression tree, this is actually a standard for a decision tree, that is, a binary tree is built, and secondly, it is not the information gain that is used to judge the nodes, but the Gini coefficient (the formula has changed, but the truth is the same). Let’s talk about regression tree here. The so-called regression tree is actually a given interval for the value to be evaluated. The requirement of this interval is that the MSE is the smallest, and the classification tree uses information gain, etc., but the indicator has changed =. =, when the generation of the tree reaches the limited upper limit or meets the requirements of the division, the regression tree is even generated.

 

Decision Trees in Ensemble Learning

  AdaBoost : There is nothing to say about this. Use the entire training set for serial training, and the linear combination of each misclassification and escalation -> decision tree, the effect is ok.

  GBDT : Gradient boosting tree. Many theories about GBDT are very complicated. In fact, the principle is the same as Boosting of classification tree, except that classification uses wrongly classified samples to escalate weights, and regression uses MSE (or other loss functions) iterations. , and the generation of the tree follows the principle of low variance and high deviation (because of iteration, the generalization ability must not be poor). The goal of GBDT is to minimize the loss function. Here I have to mention XGBoost, XGBoost is an improvement of GBDT, one is an improvement of classifier, GBDT uses CART, and XGBoost can use linear classifier (this time it can not be regarded as a tree =. =). One is that the loss function of XGBoost considers the second derivative information (the specifics are not very clear =.=), and regularization (such as L2 norm) is added to the loss function to obtain a simpler and more efficient model. One is that XGBoost adds something called Shrinkage, similar to ETA. Multiplying the leaf nodes of the tree by this reduction can reduce the influence of each tree, so that the weak classifiers generated by subsequent iterations have a larger learning space. One is the parallel mechanism. In fact, Boosting is serial, and the parallel of XGBoost refers to sorting the characteristics of the data in advance (this is also the most time-consuming place, because it is necessary to find the best split point), so that each training Weak classifiers all use the same data, of course, multi-threading can be opened in parallel =. =

  Random Forest:与AdaBoost对应,随机抽取训练集(放回/不放回)并行训练->决策树的线性组合,一般来说效果比AdaBoost好,可以使用OOB(袋外误差)作为估计泛化能力。

  Extremely Randomized Trees:极端随机树,比Random Forest的随机性更强,ET虽是使用全部数据进行训练,但结点的选择是随机的,怎么理解呢,就是不用基尼系数这些指标去判断要不要取某个结点了,而是随机设置(当然对于不同的结点,这个随机还是会有一定范围),这样生成的树可以说每颗都不一样了,单颗树的结果是不准确的,但组合起来效果就很好,泛化能力很强。

 

 

先写这么多,撒花~~~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324732546&siteId=291194637