10 - Decision Tree

Introduction

Decision Tree: go step by step from the root to start a leaf node (decision)
All data will eventually fall leaf node, can do both classification can also do regression
Here Insert Picture Description
nodes:
add nodes in the data equivalent to cut back
the decision tree training and testing:
the training phase: from a given training set to construct a tree (start selecting features from the root, how to feature segmentation)
testing phase: according to the model tree constructed from top to bottom on the line to go again
soon well constructed a decision tree, the classification or prediction task is simple, just go again on it, then the difficulty lies in how to construct a tree, which is not so easy, we need to consider a lot of issues still

For a data time to be segmented, the need for segmentation based on different characteristics; when a lot of features, how to choose a suitable feature it? Wherein the root node selected here should be the best (one more data points can be cut), after the sequence of points down, the effect gradually decreases

Objectives: a measure of the standard, calculated to classify the situation after the branch selected by different characteristics, to find out the best one as a root node, and so on

Measure - entropy

Entropy: Entropy is a measure of the uncertainty of a random variable (explanation: that white is the degree of disarray physical, such as grocery market which has everything, it is certainly a mess, which stores only sell one brand would be more stable )
Here Insert Picture Description
in the middle of the examples above, it is clear entropy a collection is lower, because there are only two categories of a, B and a number of relatively stable in many categories, the entropy would be much larger
in the middle of the above formula, pi represents the is a probability value, and multiplying the log, in fact, be understood that log function evaluations to calculate the area of (analogy), the probability pi close to 0, then log function tends to negative infinity, when the entropy calculated at this time will be very Great; when the probability pi oriented in time 1, log function tends to 0, the entropy calculated at this time will be very small

Information gain

表示特征X使得类Y的不确定性减少的程度。(分类后的专一性,希望分类后的结果是同类在一起)
比如:在进行第一个节点判断之后,熵值是10;在经过第二个节点判断之后,熵值是8,而信息增益就是10-8=2
到此,对于特征X的选择,就是看信息增益,信息增益最大的就会放在根节点,后面的节点针对于信息增益的值递减

决策树构造实例

数据:14天打球情况
特征:4种环境变化
目标:构造决策树
Here Insert Picture Description
对于上面的数据,存在有4个特征,那么使用哪一个特征来作为根节点呢?表面看来都好像可以,我们可以使用信息增益来进行判断
Here Insert Picture Description
首先我们需要获取在没有进行任何判断的时候,对于这14个值的熵如下(有9天打球,5天没打球):
Here Insert Picture Description
可以看到当前的一个熵是0.94
之后对4个特征逐一进行分析,这里先从outlook特征开始:
Here Insert Picture Description
Here Insert Picture Description

信息增益(ID3)有什么问题

在使用上面这样的方式进行构建决策树的时候,如果在已给的列中间存在一个ID列,这样的情况下计算每一个ID对应的熵值都是1,算法会认为根据ID来进行决策是最好的,但是ID仅仅是一个编号,并不会对结果造成什么影响

信息增益率(C4.5)

在信息增益的基础上,解决了ID的问题,考虑自身熵

CART

使用GINI系统来当做衡量标准
Here Insert Picture Description

决策树剪枝策略

为什么要剪枝:决策树过拟合风险很大(训练集中表现的较好,但是在测试集中表示的较差),理论上可以完全分得开数据
剪枝策略:预剪枝、后剪枝
预剪枝:边建立决策树边进行剪枝的操作(更实用)
后剪枝:当建立完决策树后进行剪枝操作
Here Insert Picture Description

实战

使用sklearn中自带的数据集进行

Here Insert Picture Description
如上,在sklearn的datasets中间自带有一些简单的数据集,这里可以使用中间的房屋价格的数据集
Here Insert Picture Description
在sklearn中间存在有这样的树模型,首先通过sklearn将这个tree模型进行导入;之后首先需要将这个树模型进行一个实例化,在实例化的时候需要给这个模型进行一个传参,这里的max_depth即为这个树模型指定一个深度,即说明即将创建的这个树模型的深度为2;.fit中间仅仅需要传入两个参数,一个是x值,一个是y值
对于树模型参数,通常有如下几种:
1、criterion gini or entropy:对于一个决策树,如何选择其标准,可以使用gini也可以选择entropy(熵)
2、splitter best or random:前者是在所有特征中找最好的切分点,后者是爱部分特征中(数据量大的时候),默认情况下都是best,很少有去修改这个值的
3、max_features None(所有):log2,sqrt,N特征小于50的时候一般使用所有的
4、max_depth:数据少或者特征少的时候可以不管这个值,如果模型的样本量多,特征也多的情况下,可以尝试限制下
5、min_samples_split:如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值
6、min_samples_leaf:这个值限制了叶子节点最小的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝,如果样本量不大,不需要管这个值,大些如果10W可以尝试下5
7, min_weight_fraction_leaf: This value limits the leaf nodes of all the sample weights and minimum weight, if less than this value, it will default pruned and brother together is 0, that is, without considering the problem of heavy weights, generally speaking, if we have a more this diversity has missing values, or the distribution of sample classification tree large deviation, sample weights will be introduced, then we need to pay attention to the value of
8, max_leaf_nodes: by limiting the maximum leaf nodes, you can prevent over-fitting, default is None, which does not limit the maximum leaf nodes. If you add the limit, the algorithm will create the best in the biggest decision tree leaf nodes. If the feature is not much, we can not consider this value, but if the feature is divided into many things, can be limited by a specific value can get cross-validation
9, class_weight: Specifies the sample weights each category, mainly to prevent certain types of training set excessive lead training sample decision tree is too biased towards those categories. Here you can specify your own weight for each sample weights if you use "balanced", the algorithm automatically calculates the weight, small sample size of sample weights corresponding to the type of weight will be higher
10, min_impurity_split: This value limits the growth of the decision tree, not if a node purity (Gini coefficient, information gain, the mean square error, the absolute difference) is less than the threshold then the node is not generated child nodes. Is the leaf node
11, n_estimators: to establish the number of trees
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
above, to complete the view shows a decision tree, if you want to carry out such a map to save it, you can use the following way
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/Escid/article/details/91038389