Understanding CART decision tree

CART algorithm

principle

CART called the Classification and Regression Tree.

Regression Trees

Compared ID3, CART through all the features and feature values, and then divide the data subsets using two yuan segmentation method, each node is split only two branches. Then calculate the total variance clutter subset of data to measure the data subset, a subset of the total data variance more pure smaller, and finally the total variance corresponding to the smallest eigenvalue division manner and characteristic, while segmentation is based on two yuan equal or less than the characteristic value and the characteristic value is greater than the data into two. Usually here that the total variance is the variance of each data subset * number of samples to calculate the output value by the sample data subset. The final output is to take the median or mean of the data of each leaf node.

Classification tree

Compared ID3, CART Gini impurity generally choose to replace the way information gain measure impurity subset of data. Gini impurity, the higher the purity data. Compared regression trees, classification trees for discrete or continuous process using two yuan features are segmented manner, but in the calculation of the degree of confusion subset of data, the total variance alternative Gini impurity manner.

Gini impurity definition: focus on a child from a randomly selected data, an error metric which is divided into the group of the probability of the other.

Not in a hurry to understand this sentence, look at the following explanation of the Gini is not purity of expression.

  • Suppose a data set there are K classes, the probability of the k-th category is p_k, expressions Gini coefficient is:

    the above formula, p_k represents the probability that the k-th category appear, then the 1-p_k obviously the current data set other probabilities for all categories appear in addition to the k-th class, so the two is multiplied by the current data set, the probability of the k-th category and all other categories have emerged, the higher the probability, the more impure data sets.
    Now look at the definition above, we should like to understand it.
  • For a given sample D, assuming K classes, the number of categories for the k-th Gini coefficient expression CkCk, the sample D is:
  • For Sample D, if a value according to a characteristic of the A, D into the two portions D1 and D2, at the eigenvalues ​​of A, D is the Gini coefficient expression:

Algorithm library calls

Decision Tree in scikit-learn library using the CART algorithm is tuned off, we can do both classified and can do return.

Code

Classification tree corresponds DecisionTreeClassifier, regression trees correspond DecisionTreeRegressor. Sample code and parameters are as follows:

from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
clf_dt.fit(train_X, train_y)
predictions_dt = clf_dt.predict(test_X)[:,None]

parameter

parameter DecisionTreeClassifier DecisionTreeRegressor
criterion Feature selection criteria. You may be used "gini" or "entropy", former represents the Gini coefficient, which represents information gain. In general use the default Gini coefficient "gini" on it, that is, CART algorithm. Unless you prefer the best features of similar ID3, C4.5 selection method.  May be used "mse" or "mae", the former is the mean square error, which is the mean of the difference and the sum of absolute values. Recommended to use the default "mse". Generally "mse" is more accurate than "mae". Unless you want to compare two different parameters of the effect of place.
splitter Wherein the division point selection criteria. You can use "best" or "random". The former division of finding the optimal point in all division points characteristics. The latter is locally optimal random search section is divided at the division point point. The default is "best" for small sample size, when and if the sample is very large amount of data, at this time the Decision Tree recommend "random"  Same as on the left.
max_features The maximum number of features to consider when dividing. You can use many types of value, default is "None", means that all of the division considering the number of features; consider the most log2Nlog2N feature if it is "log2" means that when divided; if it is "sqrt" or "auto" means when divided up for N - √N features. If is an integer, the absolute number of features to consider behalf. If a floating-point number, considering the characteristics of the representative of the percentage, i.e., considering (percentage xN) wherein the rounded number. Characterized wherein N is the total number of samples. Generally the biggest feature, if not more than the number of sample characteristics, such as less than 50, we use the default "None" on it, if very large number of features, flexibility to use other values ​​we have just described is controlled considering the division number, to control the generation time of the decision tree. Same as on the left.
max_depth  The maximum depth of the decision tree. Default can not enter if you do not enter it, the decision tree does not limit the depth of the tree at the time of the establishment of sub-sub-tree. Generally, less data can be time or less regardless of the feature value. If the model under the sample size and more features are also more cases, it is recommended to limit the maximum depth. The value depends on the distribution of the data. Common values ​​can be 10 to 100. Same as on the left.
min_samples_split Internal node subdivision minimum required number of samples. This value limits the conditions of the sub-division of the tree to continue, if a node is less than the number of samples min_samples_split, will not continue to try to choose the best features to be divided. The default is 2. If the sample size is small, do not control this value. If the sample size is very large magnitude, it is recommended to increase this value. I had a project example, there are about 100,000 samples, establishing tree, I chose min_samples_split = 10. It can be used as reference. Same as on the left.
min_samples_leaf Leaf node minimum number of samples. This value limits the minimum number of samples of the leaf node, a leaf node if the number is less than the number of samples, will be pruned and brother together. The default is 1, can enter the minimum number of integer samples, or at least the number of samples as a percentage of the total samples. If the sample size is small, do not control this value. If the sample size is very large magnitude, it is recommended to increase this value. 100,000 sample project before using min_samples_leaf value of 5 for reference purposes only.
min_weight_fraction_leaf 叶子节点最小的样本权重和。这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝。 默认是0,就是不考虑权重问题。一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。 同左。
max_leaf_nodes 最大叶子节点数。通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。如果加了限制,算法会建立在最大叶子节点数内最优的决策树。如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。 同左。
class_weight 指定样本各类别的的权重,主要是为了防止训练集某些类别的样本过多,导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重,或者用“balanced”,如果使用“balanced”,则算法会自己计算权重,样本量少的类别所对应的样本权重会高。当然,如果你的样本类别分布没有明显的偏倚,则可以不管这个参数,选择默认的"None" 不适用于回归树。
min_impurity_split 节点划分最小不纯度。这个值限制了决策树的增长,如果某节点的不纯度(基尼系数,信息增益,均方差,绝对差)小于这个阈值,则该节点不再生成子节点。即为叶子节点 。 同左。
presort 数据是否预排序。这个值是布尔值,默认是False不排序。一般来说,如果样本量少或者限制了一个深度很小的决策树,设置为true可以让划分点选择更加快,决策树建立的更加快。如果样本量太大的话,反而没有什么好处。问题是样本量少的时候,我速度本来就不慢。所以这个值一般懒得理它就可以了。 同左。

决策树可视化

安装graphviz

可视化需要先安装graphviz,这是一个开源的图形可视化软件,官网:https://graphviz.gitlab.io,下载适合自己操作系统的文件即可。

如果是windows操作系统,可以点此下载安装包。安装完成记得设置一下环境变量将 xxx/Graphviz2.38/bin/加入PATH

安装python插件:

pip install graphviz
pip install pydotplus

代码示例

from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus
import os
# 如果执行程序总是找不到Graphviz命令,可以加入下面这行代码,注意修改路径为你的正确的Graphviz/bin目录的路径
os.environ["PATH"] += os.pathsep + 'H:/program_files/Graphviz2.38/bin'
# 训练模型
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
# 用pydotplus生成iris.pdf
dot_data = tree.export_graphviz(clf, out_file=None,
                                feature_names=iris.feature_names,
                                class_names=iris.target_names,
                                filled=True, rounded=True,
                                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("iris.pdf")

查看pdf文件

参考资料

https://www.cnblogs.com/pinard/p/6056319.html

https://blog.csdn.net/JJBOOM425/article/details/79997440

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx

ok,本篇就这么多内容啦~,感谢阅读O(∩_∩)O。

Guess you like

Origin www.cnblogs.com/anai/p/12160820.html