Decision Tree study notes

Types of Decision Trees

Module sklearn.tree

tree.DecisionTreeClassifier: classification tree

tree.DecisionTreeRegressor: regression tree

tree.export.graphviz: Export the generated decision tree to DOT format, dedicated to drawing, and can visualize the decision tree

tree.ExtraTreeClassifier: High random version of the classification tree

tree.ExtraTreeRegressor

The process of decision tree modeling

The decision tree selects the final decision tree through the integration of classifiers: randomly select some features multiple times for decision tree classification, and select a decision tree model with the best fitting effect (highest score) after obtaining multiple decision trees.

The classification of each node is based on the entropy and gini coefficients to divide the nodes. After calculating the coefficients, the impurity is obtained. The lower the impurity, the more unified the data classified in the node. For example, there are apples, pears, and bananas. The node category is classified as apples, but it still contains a small number of pears and bananas. When the impurity is 0, it means that all records have the same label, and the impurity of the leaf node is 0.

Relevant code:

modeling code

clf = tree.DecisionTreeClassifier(criterion='entropy'[default gini]
, random_state=30[select features according to certain rules, will not randomly select features, input the same value will generate the same tree]
, splitter='random'【control Randomly selected features are divided into random and best. When best is a decision tree that randomly selects features, more important features will be selected for branching. When random is a branch, it will be more random, and the fitting of the selected training level will be reduced. When This method is often used to reduce the fitting when over-fitting]
, max_depth=3 [the maximum number of layers of the tree (not including the root layer), if it is too large, it will over-fit, and if it is too small, it will underfit
, min_samples_leaf=10 [the number of nodes The child node contains at least 10 training samples, if it is too large, it will be underfit, if it is too small, it will be overfitting]
, min_samples_split=10 [the node contains at least 10 nodes])

clf=clf.fit(Xtrain,Ytrain)

score=clf.score(Xtest,Ytest)

visual code

import graphviz

feature_name=['a','b','c']

dot_data=tree.export_graphviz(clf, feature_name=feature_name,class_names=['a','b','v'],filled=True,rounded=True)

graph=graphviz.Source(dot_data)

Note: feature_name is the feature of the data set, class_names is the label of the record, and filled is whether to color it.

#Decision tree view feature importance

clf.feature_importance_

[zip(feature_name,clf.feature_importances_)]

Preferences

import matplotlib.pyplot as plt

test=[]

for i in range(10):
        clf=tree.DecisionTreeClassifier(max_depth=i+1
        ,criterion = "entropy"
        ,random_state=30)

clf=clf.fit(Xtrain,Ytrain)
score=clf.score(Xtest,Ytest)
test.append(score)

plt.plot(range(1,11),test,color='red',label='max_depth')
plt.legend()
plt.show()

Traverse the parameters and select the best parameters, not only max_depth, but also other parameters.

important interface

clf.apply(Xtest)#Returns the index of the leaf node where each sample is located

clf.predict(Xtest)#Returns the classification and regression results of each sample

Guess you like

Origin blog.csdn.net/xzhu4571/article/details/125298576