Decision tree summary - DecisionTreeClassifier (1)

Organized according to the courses of the dishes, easy to remember and understand

Decision tree in sklearn

模块sklearn.tree

The classes of decision trees in sklearn are all under the "tree" module. This module contains a total of five classes:

tree.DecisionTreeClassifier classification tree
tree.DecisionTreeRegressor regression tree
tree.export_graphviz Export the generated decision tree to DOT format, dedicated for drawing
tree.ExtraTreeClassifier High random version of the classification tree
tree.ExtraTreeRegressor High random version of regression tree

The basic modeling process of sklearn

image.png

from sklearn import tree            #导入需要的模块
clf = tree.DecisionTreeClassifier() #实例化
clf = clf.fit(X_train,y_train)      #用训练集数据训练模型
result = clf.score(X_test,y_test)   #导入测试集,从接口中调用需要的信息
复制代码

DecisionTreeClassifier and wine dataset

class sklearn.tree.DecisionTreeClassifier (criterion=’gini’, splitter=’best’, max_depth=None,min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None,random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,class_weight=None, presort=False)

Important parameters

criterion
  • In order to convert the table into a tree, the decision tree needs to find the best node and the best branching method. For classification trees, the metric to measure this "best" is called " impurity ". In general, the lower the impurity, the better the decision tree fits the training set. The core of the branching method of the decision tree algorithm currently used is mostly around the optimization of a certain impurity-related index.
  • 不纯度基于节点来计算,树中的每个节点都会有一个不纯度,并且子节点的不纯度一定是低于父节点的,也就是说,在同一棵决策树上,叶子节点的不纯度一定是最低的

Criterion这个参数正是用来决定不纯度的计算方法的。sklearn提供了两种选择:

  • 输入”entropy“,使用信息熵(Entropy)

    • image.png
  • 输入”gini“,使用基尼系数(Gini Impurity)

    • image.png

其中t代表给定的节点,i代表标签的任意分类, p ( i t ) p(i|t) 代表标签分类i在节点t上所占的比例。注意,当使用信息熵时,sklearn实际计算的是基于信息熵的信息增益(Information Gain),即父节点的信息熵和子节点的信息熵之差。

比起基尼系数,信息熵对不纯度更加敏感,对不纯度的惩罚最强。但是在实际使用中,信息熵和基尼系数的效果基本相同。信息熵的计算比基尼系数缓慢一些,因为基尼系数的计算不涉及对数。另外,因为信息熵对不纯度更加敏感,所以信息熵作为指标时,决策树的生长会更加“精细”因此对于高维数据或者噪音很多的数据,信息熵很容易过拟合,基尼系数在这种情况下效果往往比较好。当模型拟合程度不足的时候,即当模型在训练集和测试集上都表现不太好的时候,使用信息熵。当然,这些不是绝对的。

参数 criterion
如何影响模型? Determine the calculation method of impurity to help find the best nodes and branches. The lower the impurity, the better the decision tree fits the training set.
What are the possible inputs? Do not fill in the default Gini coefficient, fill in gini to use Gini coefficient, fill in entropy to use information gain
How to choose parameters? Usually, the Gini coefficient is used with a large data dimension. When the noise is large, the Gini coefficient is used with a low dimension. When the data is relatively clear, there is no difference between the information entropy and the Gini coefficient. When the fitting degree of the decision tree is not enough, the information entropy is used for both. Try it, if it doesn't work, replace it with another one

At this point, the basic process of the decision tree can be briefly summarized as follows:

image.png

The decision tree stops growing until no more features are available, or the overall impurity metric has been optimal.

build a tree

The code for this part can be viewed at this location:

build a tree

The code for this part can be viewed at this location:

Import required algorithm libraries and modules
# 导包
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
复制代码
Explore data
# 使用的是sklearn的默认的数据集---红酒数据集
wine = load_wine()

# 展示数据集的各种属性(整体:一个字典)
wine

# 我们可以看到数据集的  特征数据:data  标签数据:target  标签名称:target_names  描述信息:DESCR  特征名称:feature_names 
wine.keys()

# 数据的shape 一般是二维数据  13个特征,178条数据
wine.data.shape

# 数据展示:字典的调用方式
wine.data

# 标签
wine.target

# 对于上面的DataFrame的数据进行美化
import pandas as pd
df = pd.concat([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)

# 注意这个地方的写法,可以直接使用对等的列表进行替换,也可以使用rename进行单列的替换
df.columns =["label"] + wine.feature_names

# 数据展示
df

# 特征的名称列表
wine.feature_names

# 类别的列表
wine.target_names
复制代码
Divide training set and test set
# 直接调用数据的形式传入
# 将数据分成训练集和测试集,分布的比例7/3  由test_size指定
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)

Xtrain.shape
Xtest.shape
复制代码
Modeling
# 实际上的调用方式只需要三步即可
# 第一步:实例化决策树模型
# 第二步:将训练数据加入其中,使用fit进行训练
# 第三步:使用score进行打分(这个主要是我们有真实的标签,可以将预测出来的标签和真实标签进行准确性的计算)
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest) #返回预测的准确度
score
# 92.5925
复制代码
draw a tree
feature_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']

import graphviz
dot_data = tree.export_graphviz(clf
                             ,out_file = None
                             ,feature_names= feature_name
                             ,class_names=["琴酒","雪莉","贝尔摩德"]
                             ,filled=True
                             ,rounded=True
)
graph = graphviz.Source(dot_data)
graph
复制代码

image.png

Explore decision trees
# 特征重要性,用于进行决策树进行分支的属性的重要程度
clf.feature_importances_

# 将属性和重要程度进行绑定,显示每个属性的对于决策树减小信息混乱程度的重要性
[*zip(feature_name,clf.feature_importances_)]
复制代码

We have built a complete decision tree knowing only one parameter. But back to step 4 to build the model, the score will fluctuate around a certain value , causing each tree drawn in step 5 to be different. Why is it unstable? Will it still be unstable if other datasets are used?

As we mentioned before, no matter how the decision tree model evolves, the essence of the branch is to pursue the optimization of an impurity-related indicator, and as we mentioned, the impurity is calculated based on nodes, that is It is said that when a decision tree is built, it relies on optimizing nodes to pursue an optimized tree, but can the optimal node guarantee the optimal tree? Ensemble algorithms are used to solve this problem: sklearn says that since one tree is not guaranteed to be optimal, build more different trees and take the best of them. How to build different trees from a set of data sets? In each branch, instead of using all the features, a part of the features is randomly selected, and the node with the best impurity-related index is selected as the node for branching. In this way, the generated tree will be different each time.

Summarize

The reason for the randomness:

  • The first layer of randomness, the randomness of random;
  • The second layer of randomness: plant many trees and take the one with the best performance;
  • The third layer of randomness: the randomness of features, each branch selects a different subset of total features to ensure that the trees are different;
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=30)
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest) #返回预测的准确度

score
复制代码

Guess you like

Origin juejin.im/post/7085552825316409352