Machine Learning Notes - Sklearn Implementation of Classification Decision Trees

In this paper, we learn to use Python's Sklearn framework to implement a classification decision tree. The data set used in this paper is the built-in breast cancer classification data set of sklearn.

1. Basic implementation

from sklearn.tree import DecisionTreeClassifier as dtc
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
复制代码
cancer=datasets.load_breast_cancer()
print(cancer.keys())
复制代码

The data set is stored in the form of a dictionary, where data represents the feature part of the data, target represents the label of the data, target_names represents the actual meaning represented by the data label (1/0), and feature_names represents the name of each feature

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
复制代码

We use default parameters for model training.

clf=dtc()#使用默认参数进行
clf.fit(cancer['data'],cancer['target'])
复制代码
DecisionTreeClassifier()
复制代码

Visualize with the built-in drawing method in sklearn, as follows, you can see that the picture does not look very good.

from sklearn import tree
plt.figure(figsize=(20,12))
tree.plot_tree(clf)
plt.show()
复制代码

output_4_0.pngSo use the graphviz package. When downloading this package, you must not only use pip to install it in Python, but also download the installation package on the official website and copy the specific installation path into the environment variable. For details, you can refer to other Baidu blogs, which will not be detailed here.

import graphviz#用于绘制决策树的包
xx=tree.export_graphviz(clf,
                        out_file=None,
                        feature_names=cancer['feature_names'],#加入特征名称
                        class_names=cancer['target_names'],#类别名称
                        filled=True,#填充颜色
                        rounded=False,#边缘圆润与否
                       )
graph1 = graphviz.Source(xx) 
#graph.render('111')#导出成pdf,好像还可以导出图片格式
复制代码
graph1
复制代码

output_6_0.svg

2. Parameter learning

In this section, we learn some common parameters of classification decision trees.

def graph(clf):
    clf.fit(cancer['data'],cancer['target'])
    xx=tree.export_graphviz(clf,
                        out_file=None,
                        feature_names=cancer['feature_names'],
                        class_names=cancer['target_names'],
                        filled=True,#填充颜色
                        rounded=False,#边缘圆润与否
                       )
    graph1 = graphviz.Source(xx) 
    return graph1
复制代码

2.1 Split Indicator

Splitting index criterion , optional parameters are "gini" and "entropy", which indicate gini coefficient and information gain. According to Sklearn's official website, its decision tree uses the cart algorithm by default. The indicator used by the cart algorithm is the gini coefficient, which is used by ID3. The indicator is information gain (can it be used as ID3 if the indicator is replaced by entropy?), C4.5 uses the information gain rate, and the information gain rate is not provided here.

It can be seen that the following figure is different from the above figure with gini coefficient, which may be because the splitting index is different, and the splitting point and splitting feature are also different.

clf2=dtc(criterion='entropy')#使用信息增益
graph2 = graph(clf2)
graph2
复制代码

output_10_0.svg

2.2 Split way

The decision tree has two splitter methods, namely "best" and "random". which defaults to best

best means that the best split point is selected each time. Although the decision tree is random when branching, it will still preferentially select more important features for branching. random means that the split point is randomly selected each time.

lran=[]
lbes=[]
for i in range(100):
    clf3=dtc(splitter="random")
    clf4=dtc(splitter="best")
    random=cross_val_score(clf3, cancer['data'], cancer['target'], cv=5)
    best=cross_val_score(clf4, cancer['data'], cancer['target'], cv=5)
    lran.append(random.mean())
    lbes.append(best.mean())
复制代码
plt.figure(figsize=(16,6))
plt.plot(lran,label='random',c='darkblue')
plt.axhline(np.mean(lran),c='green')
plt.plot(lbes,label='best',c='darkred')
plt.axhline(np.mean(lbes),c='orange')
plt.legend()
plt.grid()
plt.show()
复制代码

output_13_0.png

As can be seen from the above figure, using random as the splitting point is more unstable than using best as the splitting point.

clf3_1=dtc(splitter="random",random_state=3)
clf4_1=dtc(splitter="best",random_state=3)
复制代码
graph3_1 = graph(clf3_1) 
graph3_1
复制代码

output_15_0.svg

graph4_1 = graph(clf4_1) 
graph4_1
复制代码

output_16_0.svg

From the above two decision trees, it can be seen that using random as the splitting method results in a deeper and larger decision tree.

2.3 Maximum depth

max_depth:int型,无默认值,表示树的最大深度,如果为None,则决策树的节点将会一直分裂直到所有的叶子结点里面的样本少于min_samples_split的样本。我认为这个参数是一个比较粗暴的剪枝方式,一旦分裂到最大深度,决策树分裂光速停止。

lmd=[]
for i in range(1,10):
    ll=0
    for j in range(100):
        clf5=dtc(max_depth=i)
        ll+=cross_val_score(clf5, cancer['data'], cancer['target'], cv=5).mean()
    lmd.append(ll/100)
复制代码
plt.plot(lmd,marker='o',c='darkblue',mfc='orange')
plt.show()
复制代码

output_19_0.png

clf5=dtc(max_depth=3)
graph5= graph(clf5) 
graph5
复制代码

output_20_0.svg

2.4最小拆分样本数量

min_samples_split:拆分内部节点所需的最小样本数,整数值或者浮点值,默认为2,如果是整数值n,则表示最小能够拆分的叶子结点中的样本数为n,如果为浮点数p,则表示最小叶子结点样本数为n_samples*p

以整数值为例,这个参数的意义就是当某个节点的样本数量大于等于min_samples_split的时候,如果满足拆分条件,就拆分。

lms=[]
for i in range(3,50,3):
    ll=0
    for j in range(100):
        clf6=dtc(min_samples_split=i)
        ll+=cross_val_score(clf6, cancer['data'], cancer['target'], cv=5).mean()
    lms.append(ll/100)
复制代码
plt.plot([i for i in  range(3,50,3)],lms,marker='8',mfc='pink')
plt.show()
复制代码

output_23_0.png

clf6=dtc(min_samples_split=20)
graph6= graph(clf6) 
graph6
复制代码

output_24_0.svg

2.5最小叶子结点样本数量

min_samples_leaf

int 或 float,默认值为 1

叶节点上所需的最小样本数。任何深度的分割点只有在每个左右分支中至少留下训练样本时,才会考虑该分割点。这可能会产生平滑模型的效果,尤其是在回归中。

  • 如果为 int,则将其视为最小数=min_samples_leaf
  • 如果为浮点数p,则为分数,是每个节点的最小样本数=min_samples_leaf=(p* n_samples)
lml=[]
for i in range(3,50,3):
    ll=0
    for j in range(100):
        clf7=dtc(min_samples_leaf=i)
        ll+=cross_val_score(clf7, cancer['data'], cancer['target'], cv=5).mean()
    lml.append(ll/100)
复制代码
plt.plot([i for i in  range(3,50,3)],lml,marker='o',mfc='lightgreen')
plt.show()
复制代码

output_27_0.png

clf7=dtc(min_samples_leaf=20)
graph7= graph(clf7) 
graph7
复制代码

output_28_0.svg

2.6最大特征数

max_features:当寻找最佳分裂点的时候考虑的特征数量。

int:如输入的是整数n,则表示使用的特征数量为n。

float:如果输入的是浮点数p,则表示用的特征数量为n_features*p

auto:max_features=sqrt(n_features)

sqrt:max_features=sqrt(n_features)

log2:max_features=log2(n_features)

None:max_features=n_features

我认为这个参数控制的是,用于每次选择分裂特征的样本数量。如果每次参数设置为9,则每次进行分裂的时候就从某九个特征里选择特征。

lmscore=[]
for i in range(1,31):
    ll=0
    for j in range(100):
        clf8=dtc(max_features=i)
        ll+=cross_val_score(clf8, cancer['data'], cancer['target'], cv=5).mean()
    lmscore.append(ll/100) 
复制代码
plt.plot([i for i in  range(1,31)],lmscore,marker='o',mfc='green')
plt.show()
复制代码

output_31_0.png

dep=[]
for i in range(1,31):
    ll=0
    for j in range(100):
        clf8=dtc(max_features=i)
        clf8.fit(cancer['data'], cancer['target'])
        ll+=clf8.get_depth()#获取深度
    dep.append(ll/100)    
复制代码
plt.plot([i for i in  range(1,31)],dep,marker='o',mfc='green')
plt.title("depth")
plt.show()
复制代码

output_33_0.png 可以看出,决策树深度和max_features有关,大体呈现负相关。

2.7类别权重

class_weight字典,字典列表或者“balanced”请注意,对于多输出(包括多标签),应在其自己的字典中为每列的每个类定义权重。例如,对于四类多标签分类,权重应为 [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1:1}] 而不是 [{1:1}, {2:5}, {3:1}, {4:1}]。

print("1的数量",sum(cancer['target']))
print("0的数量",len(cancer['target'])-sum(cancer['target']))
复制代码
1的数量 357
0的数量 212
复制代码

首先可以看出本数据集并不平衡,我们对各个类别的权重进行更换,计算平均分数,可以看出,当1的权重大概为0.4的时候,验证分数最高,大致来看,这个权重刚好把两类平衡了,因此也可以用class_weight=“balanced”替换。

lmcw=[]
for i in range(1,10):
    ll=0
    for j in range(100):
        clf9=dtc(class_weight={1:i/10,0:(1-i/10)})
        ll+=cross_val_score(clf9, cancer['data'], cancer['target'], cv=5).mean()
    lmcw.append(ll/100)
复制代码
plt.plot([i/10 for i in range(1,10)],lmcw,marker='o',mfc='red')
plt.show()
复制代码

output_37_0.png

2.8其他参数

min_impurity_decrease:浮点型,默认为0,如果节点的分裂导致杂质的减少值大于或等于此值,则节点将被分裂。

max_leaf_nodes:整型,默认为0,以最佳优先的方式使用max_leaf_nodes生成树。最佳节点的定义是杂质的相对减少。如果None则不限制叶子结点的数量。

min_weight_fraction_leaf:浮点型,默认为0.0,一个叶子结点的样本占所有输入样本的最小权重,当sample_weight没有给出的时候,每个样本的权重相同。

random_stateint, RandomState instance or None, default=None,控制估计器的随机性。特性总是随机排列在每次分割,即使splitter设置为“best”。当max_features < n_features时,算法将在每次分割时随机选择max_features,然后在它们之间找到最佳分割。但是,即使max_features=n_features,在不同的运行中找到的最佳拆分也可能不同。如果对几个分割的改进标准是相同的,并且必须随机选择一个分割,那么就是这种情况。为了在拟合过程中获得确定性行为,随机状态必须固定为一个整数。

3.属性和方法

3.1属性

classes_ndarray of shape (n_classes,) or list of ndarray,类标签(单输出问题),或类标签数组的列表(多输出问题)。

**feature_importances_:ndarray of shape (n_features,)**类标签(单输出问题),或类标签数组的列表(多输出问题)。

max_features_:max_features的推断值。

**n_classes_**int or list of int:类的数量(对于单个输出问题),或者包含每个输出的类的数量的列表(对于多输出问题)。

3.2方法

加粗为常用方法

decision_path(X[, check_input])返回决策树的决策过程。

fit(X, y[, sample_weight, check_input, ...])在训练集上训练。

get_depth()返回决策树深度

get_n_leaves()返回决策树叶子结点个数

get_params([deep])返回分类器的参数

predict(X[, check_input])返回测试样本的分类结果或者回归(回归问题)结果

predict_log_proba(X)返回测试样本的分类对数概率。

predict_proba(X[, check_input])返回测试样本的分类概率。

score(X, y[, sample_weight])返回测试样本的平均

参考链接

sklearn.tree.DecisionTreeClassifier — scikit-learn 1.0.2 documentation

1.10. Decision Trees — scikit-learn 1.0.2 documentation

Guess you like

Origin juejin.im/post/7078890210276147231