Organized according to the courses of the dishes, easy to remember and understand
Decision tree in sklearn
模块sklearn.tree
The classes of decision trees in sklearn are all under the "tree" module. This module contains a total of five classes:
tree.DecisionTreeClassifier | classification tree |
---|---|
tree.DecisionTreeRegressor | regression tree |
tree.export_graphviz | Export the generated decision tree to DOT format, dedicated for drawing |
tree.ExtraTreeClassifier | High random version of the classification tree |
tree.ExtraTreeRegressor | High random version of regression tree |
The basic modeling process of sklearn
from sklearn import tree #导入需要的模块
clf = tree.DecisionTreeClassifier() #实例化
clf = clf.fit(X_train,y_train) #用训练集数据训练模型
result = clf.score(X_test,y_test) #导入测试集,从接口中调用需要的信息
复制代码
DecisionTreeClassifier and wine dataset
class sklearn.tree.DecisionTreeClassifier (
criterion
=’gini’,splitter
=’best’,max_depth
=None,min_samples_split
=2,min_samples_leaf
=1, min_weight_fraction_leaf=0.0,max_features
=None,random_state
=None, max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split
=None,class_weight=None, presort=False)
Important parameters
criterion
- In order to convert the table into a tree, the decision tree needs to find the best node and the best branching method. For classification trees, the metric to measure this "best" is called " impurity ". In general, the lower the impurity, the better the decision tree fits the training set. The core of the branching method of the decision tree algorithm currently used is mostly around the optimization of a certain impurity-related index.
- 不纯度基于节点来计算,树中的每个节点都会有一个不纯度,并且子节点的不纯度一定是低于父节点的,也就是说,在同一棵决策树上,叶子节点的不纯度一定是最低的。
Criterion这个参数正是用来决定不纯度的计算方法的。sklearn提供了两种选择:
-
输入”entropy“,使用信息熵(Entropy)
-
输入”gini“,使用基尼系数(Gini Impurity)
其中t代表给定的节点,i代表标签的任意分类, 代表标签分类i在节点t上所占的比例。注意,当使用信息熵时,sklearn实际计算的是基于信息熵的信息增益(Information Gain),即父节点的信息熵和子节点的信息熵之差。
比起基尼系数,信息熵对不纯度更加敏感,对不纯度的惩罚最强。但是在实际使用中,信息熵和基尼系数的效果基本相同。信息熵的计算比基尼系数缓慢一些,因为基尼系数的计算不涉及对数。另外,因为信息熵对不纯度更加敏感,所以信息熵作为指标时,决策树的生长会更加“精细” ,因此对于高维数据或者噪音很多的数据,信息熵很容易过拟合,基尼系数在这种情况下效果往往比较好。当模型拟合程度不足的时候,即当模型在训练集和测试集上都表现不太好的时候,使用信息熵。当然,这些不是绝对的。
参数 | criterion |
---|---|
如何影响模型? | Determine the calculation method of impurity to help find the best nodes and branches. The lower the impurity, the better the decision tree fits the training set. |
What are the possible inputs? | Do not fill in the default Gini coefficient, fill in gini to use Gini coefficient, fill in entropy to use information gain |
How to choose parameters? | Usually, the Gini coefficient is used with a large data dimension. When the noise is large, the Gini coefficient is used with a low dimension. When the data is relatively clear, there is no difference between the information entropy and the Gini coefficient. When the fitting degree of the decision tree is not enough, the information entropy is used for both. Try it, if it doesn't work, replace it with another one |
At this point, the basic process of the decision tree can be briefly summarized as follows:
The decision tree stops growing until no more features are available, or the overall impurity metric has been optimal.
build a tree
The code for this part can be viewed at this location:
build a tree
The code for this part can be viewed at this location:
Import required algorithm libraries and modules
# 导包
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
复制代码
Explore data
# 使用的是sklearn的默认的数据集---红酒数据集
wine = load_wine()
# 展示数据集的各种属性(整体:一个字典)
wine
# 我们可以看到数据集的 特征数据:data 标签数据:target 标签名称:target_names 描述信息:DESCR 特征名称:feature_names
wine.keys()
# 数据的shape 一般是二维数据 13个特征,178条数据
wine.data.shape
# 数据展示:字典的调用方式
wine.data
# 标签
wine.target
# 对于上面的DataFrame的数据进行美化
import pandas as pd
df = pd.concat([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)
# 注意这个地方的写法,可以直接使用对等的列表进行替换,也可以使用rename进行单列的替换
df.columns =["label"] + wine.feature_names
# 数据展示
df
# 特征的名称列表
wine.feature_names
# 类别的列表
wine.target_names
复制代码
Divide training set and test set
# 直接调用数据的形式传入
# 将数据分成训练集和测试集,分布的比例7/3 由test_size指定
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
Xtrain.shape
Xtest.shape
复制代码
Modeling
# 实际上的调用方式只需要三步即可
# 第一步:实例化决策树模型
# 第二步:将训练数据加入其中,使用fit进行训练
# 第三步:使用score进行打分(这个主要是我们有真实的标签,可以将预测出来的标签和真实标签进行准确性的计算)
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest) #返回预测的准确度
score
# 92.5925
复制代码
draw a tree
feature_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']
import graphviz
dot_data = tree.export_graphviz(clf
,out_file = None
,feature_names= feature_name
,class_names=["琴酒","雪莉","贝尔摩德"]
,filled=True
,rounded=True
)
graph = graphviz.Source(dot_data)
graph
复制代码
Explore decision trees
# 特征重要性,用于进行决策树进行分支的属性的重要程度
clf.feature_importances_
# 将属性和重要程度进行绑定,显示每个属性的对于决策树减小信息混乱程度的重要性
[*zip(feature_name,clf.feature_importances_)]
复制代码
We have built a complete decision tree knowing only one parameter. But back to step 4 to build the model, the score will fluctuate around a certain value , causing each tree drawn in step 5 to be different. Why is it unstable? Will it still be unstable if other datasets are used?
As we mentioned before, no matter how the decision tree model evolves, the essence of the branch is to pursue the optimization of an impurity-related indicator, and as we mentioned, the impurity is calculated based on nodes, that is It is said that when a decision tree is built, it relies on optimizing nodes to pursue an optimized tree, but can the optimal node guarantee the optimal tree? Ensemble algorithms are used to solve this problem: sklearn says that since one tree is not guaranteed to be optimal, build more different trees and take the best of them. How to build different trees from a set of data sets? In each branch, instead of using all the features, a part of the features is randomly selected, and the node with the best impurity-related index is selected as the node for branching. In this way, the generated tree will be different each time.
Summarize
The reason for the randomness:
- The first layer of randomness, the randomness of random;
- The second layer of randomness: plant many trees and take the one with the best performance;
- The third layer of randomness: the randomness of features, each branch selects a different subset of total features to ensure that the trees are different;
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=30)
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest) #返回预测的准确度
score
复制代码