[SkLearn classification, regression algorithm] DecisionTreeClassifier classification tree



DecisionTreeClassifier classification tree

class sklearn.tree.DecisionTreeClassifier (criterion=’gini’, splitter=’best’, max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None,
random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
class_weight=None, presort=False)

① Important parameters: calculation of criterion impurity

♦ Basic concepts

In order to transform the table into a tree, the decision tree needs to find the best node and the best branching method. For classification trees, the measure of this "best" is called "impurity". Generally 不纯度越低,决策树对训练集的拟合越好speaking, . The core of the branching method in the decision tree algorithm used now is mostly around the optimization of a certain impurity-related index.

Impurity is calculated based on nodes. Each node in the tree will have an impurity, and the impurity of child nodes must be lower than that of the parent node, that is to say, in the same decision tree, the impurity of leaf nodes must be Is the lowest .

Criterion参数It is the calculation method used to determine impurity. sklearn provides two options:

  • Enter " entropy", use信息熵(Entropy)
  • Enter " gini", use基尼系数(Gini Impurity)
    Insert picture description here

Where t represents a given node, i represents any classification of the label, p (i ∣ t) p(i|t)p ( i t ) represents the proportion of label category i on node t.

Note: When using information entropy, sklearn actually calculates the information gain (Information Gain) based on information entropy, ie 父节点的信息熵和子节点的信息熵之差. Compared to the Gini 信息熵对不纯度更加敏感,对不纯度的惩罚最强coefficient . But in actual use, the effects of information entropy and Gini coefficient are basically the same. The calculation of information entropy bikini coefficient is slower, because the calculation of Gini coefficient does not involve logarithm. In addition, because information entropy is more sensitive to 信息熵作为指标时,决策树的生长会更加“精细”,因此对于高维数据或者噪音很多的数据,信息熵很容易过拟合impurity, the Gini coefficient is often better in this case.

For the understanding of information entropy, please refer to the blog post~

Back to top


♦ Selection method

Insert picture description here

Back to top


♦ Basic calculation process

Insert picture description here

  • 直到没有更多的特征可用,或整体的不纯度指标已经最优,决策树就会停止生长。

Back to top


♦ Case

import pandas  as  pd
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# 获取数据集
wine_data = load_wine()
x = pd.DataFrame(wine_data.data)
y = wine_data.target
feature = wine_data.feature_names
x.columns = feature

# 划分测试集、训练集
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3, random_state=420)

# 建模
clf = DecisionTreeClassifier(criterion="entropy").fit(xtrain, ytrain)
# 返回预测的准确度 accuracy
score = clf.score(xtest, ytest)  # 0.9629629629629629

feature_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']
import graphviz
dot_data = tree.export_graphviz(clf
                                ,feature_names= feature_name
                                ,class_names=["琴酒","雪莉","贝尔摩德"]
                                ,filled=True
                                ,rounded=True
                                )
graph = graphviz.Source(dot_data)
print(graph)
  • filled=True Fill in color according to the category

  • rounded=True Whether the displayed box has rounded corners

  • Through the drawn decision tree model, you can see that it entropyis the result of the previously mentioned information entropy impurity calculation method. The smaller the value is, the verification:不纯度基于节点来计算,树中的每个节点都会有一个不纯度,并且子节点的不纯度一定是低于父节点的
    Insert picture description here

  • Check the importance of features and match them. (Similar to the regression coefficient in the regression model)

#特征重要性
feature_im = clf.feature_importances_
match = [*zip(feature_name,clf.feature_importances_)]

Insert picture description here
Back to top


② Important parameters: random_state & splitter

random_state用来设置分枝中的随机模式的参数, The default is None, the randomness will be more obvious in high-dimensional data, and the randomness of low-dimensional data (such as the iris data set) will hardly appear. Enter any integer, and the same tree will always grow to stabilize the model.

splitter也是用来控制决策树中的随机选项的,有两种输入值, , 输入”best"In the tree branches, although random, but still prefers the more important features of M. (importance can be viewed through the property feature_importances_), 输入“random"decision tree This is also a way to prevent overfitting. When you predict that your model will overfit, use these two parameters to help you reduce the possibility of overfitting after the tree is built. Of course, once the tree is built, we still use pruning parameters to prevent overfitting.

clf = tree.DecisionTreeClassifier(criterion="entropy"
                                 ,random_state=30
                                 ,splitter="random"
                                 )
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
score
import graphviz
dot_data = tree.export_graphviz(clf
                               ,feature_names= feature_name
                               ,class_names=["琴酒","雪莉","贝尔摩德"]
                               ,filled=True
                               ,rounded=True
                               )  
graph = graphviz.Source(dot_data)
graph
  • Obviously, when the splitter parameter is set to random, the depth of the fitted tree model is deeper than the best (because the picture is large, the interception is not complete).
    Insert picture description here
    Back to top

③ Pruning parameters

Without restrictions, a decision tree will grow until the indicator of impurity is the best, or there are no more features available. Such a decision tree tends to overfit, which means that it will perform very well on the training set, but it will perform poorly on the test set. The sample data we collected cannot be completely consistent with the overall situation, soWhen a decision tree has too good interpretability for the training data, the rules it finds must include the noise in the training samples and make it insufficiently fit to the unknown data。(没有绝对的完美,就是说对于模型对于训练集的拟合越是完美,细节过于到位,其局限性就越高,或许拟合出来的模型对于新数据集的适应能力反而较差)

In order to make the decision tree have better generalization , we have to prune the decision tree. 剪枝策略对决策树的影响巨大,正确的剪枝策略是优化决策树算法的核心. sklearn provides us with different pruning strategies:

▶ max_depth

max_depth限制树的最大深度, All branches exceeding the set depth are cut off. This is the most widely used pruning parameter, which is very effective in high dimensionality and low sample size. If the decision tree grows one more layer, the demand for sample size will double, so limiting the depth of the tree can effectively limit overfitting. It is also very useful in integrated algorithms. In actual use, it is recommended to try starting from 3 to see the effect of fitting and then decide whether to increase the set depth.
Insert picture description here

▶ min_samples_leaf & min_samples_split

  • min_samples_leaf限定, Each child node of a node after branching must contain at least min_samples_leaf training samples, otherwise branching will not occur, or branching will occur in the direction of satisfying that each child node contains min_samples_leaf samples. Generally used with max_depth, it has a magical effect in the regression tree, which can make the model smoother. Setting the number of this parameter too small will cause overfitting, and setting too large will prevent the model from learning the data. Generally speaking, it is recommended to start with 5. If the sample size contained in the leaf node changes greatly, it is recommended to enter a floating point number as a percentage of the sample size to use. At the same time, this parameter can ensure the minimum size of each leaf, and can avoid low variance and over-fitting leaf nodes in the regression problem. For classification problems with few categories, 1 is usually the best choice.
    Insert picture description here

  • min_samples_split限定, A node must contain at least min_samples_split training samples before this node is allowed to be branched, otherwise branching will not occur.
    Insert picture description here

▶ max_features & min_impurity_decrease

  • Generally max_depth is used to "refine" the tree.
  • max_featuresLimit the number of features considered when branching, and features that exceed the limit will be discarded. Similar to max_depth, max_features is a pruning parameter used to limit the overfitting of high-dimensional data, but its method is more violent. It directly limits the number of features that can be used and forces the parameters of the decision tree to stop. If you don’t know the decision tree In the case of the importance of each feature in the case, forcibly setting this parameter may lead to insufficient model learning. If you want to prevent overfitting through dimensionality reduction, it is recommended to use PCA, ICA or the dimensionality reduction algorithm in the feature selection module.
  • min_impurity_decreaseLimit the size of the information gain, and branches with information gain less than the set value will not occur. What sklearn actually calculates is Information Gain based on information entropy, ie 父节点的信息熵和子节点的信息熵之差. That is to say, the larger the entroy difference between the parent node and the child node, the greater the contribution of this layer to the fitting effect of the entire tree

▶ Confirm the optimal pruning parameters

  • So how do you determine the value of each parameter? At this time, we will use the curve to determine the hyperparameters to make judgments, and continue to use the
    decision tree model clf that we have trained. 超参数的学习曲线,是一条以超参数的取值为横坐标,模型的度量指标为纵坐标的曲 线,它是用来衡量不同超参数取值下模型的表现的线. In the decision tree we built, our model metric is score.
import matplotlib.pyplot as plt

scores = []
for i in range(10):
    clf = tree.DecisionTreeClassifier(criterion="entropy"
                                     ,random_state=30
                                     ,splitter="random"
                                     ,max_depth=i+1
                                     ).fit(xtrain, ytrain)
    score = clf.score(xtest, ytest)
    scores.append(score)
fig = plt.figure(figsize=(10,6))
plt.plot(range(1,11),scores)
plt.show()

Here we use different max_depth parameters to draw the learning curve. It can be seen from the image that when the depth of the tree is 5, the accuracy of the model reaches the maximum, and it decreases when it is 6, and then remains unchanged. It shows that the depth of the tree after 6 has almost no effect on the model.
Insert picture description here

▶ Target weight parameters: class_weight & min_weight_fraction_leaf

  • 完成样本标签平衡的参数. Sample imbalance means that in a set of data sets, a category of labels naturally occupies a large proportion. For example, in the bank, it is necessary to judge "whether a person with a credit card will default", that is, the ratio of yes vs no (1%: 99%). In this classification situation, even if the model does nothing and predicts the result as "No", the correct rate can be 99%. Therefore, we need to use the class_weight parameter to balance the sample labels to a certain extent, give more weight to a small number of labels, make the model more biased towards the minority class, and model in the direction of capturing the minority class. 该参数默认None,此模式表示自动给与数据集中的所有标签相同的权重.

  • With the weight, the sample size is no longer simply the number of records, but is affected by the weight of the input. Therefore, at this time, pruning needs to be used with the weight-based pruning parameter min_weight_fraction_leaf. Please also note that weight-based pruning parameters (such as min_weight_fraction_leaf) will be less biased toward the dominant class than criteria that do not know sample weights (such as min_samples_leaf). If the samples are weighted, it is easier to optimize the tree structure using weight-based pre-pruning criteria, which ensures that the leaf nodes contain at least a small part of the sum of the sample weights.

Back to top


④ Important attributes and interfaces

The attributes are the various properties of the model that can be called after the model is trained. 对决策树来说,最重要的是feature_importances_,能够查看各个特征对模型的重要性. sklearn algorithms in many interfaces are similar, for example, we have previously used fitand score, almost can be used for each algorithm. In addition to these two interfaces, the most commonly used interfaces for decision trees are applysum predict. apply中输入测试集返回每个测试样本所在的叶子节点的索引, predict输入测试集返回每个测试样本的标签.
Insert picture description here
Back to top


to sum up

Seven parameters:

  • Criterion Impurity calculation
  • Two random related parameters ( random_state, splitter)
  • Four pruning parameters ( max_depth, min_sample_leaf, max_feature, min_impurity_decrease)

An attribute:feature_importances_

Four fitinterfaces: score, apply, ,predict


Guess you like

Origin blog.csdn.net/qq_45797116/article/details/113352201