Classification Algorithms—Decision Tree Exercise

1) What do Xtrain and Ytrain mean respectively?

Xtrain represents the training set divided in the feature column in the data set, which is used as feature data when training the model. Xtest represents the test set divided by the feature column in the data set, which is used to evaluate the quality of the trained model, and is used as the feature data when evaluating the model.

2) What do Ytrain and Ytest represent?

Ytrain represents the training set divided in the label column in the data set, which is used as the label data when training the model. Ytest represents the test set divided by the label column in the data set, which is used to evaluate the quality of the trained model, and is used as the class label data when evaluating the model.

3) What do the values ​​of the parameter test_size in the training set and test set classifier represent floating point and integer respectively?

When the test_size value is floating point, it indicates the proportion of the test set data; when the test_size value is integer, it indicates the number of test set data.

4) Explanation of relevant parameters of the decision tree model

1. Explain the difference between the hyperparameters min_samples_leaf and min_samples_split

min_samples_leaf is the minimum number of samples of leaf nodes. Since the sample set decreases with the top-down division of the decision tree, if we use the post-pruning method and use this hyperparameter to limit the minimum number of samples of leaf nodes, then when the leaf nodes When the number of samples is less than this threshold, this node will be pruned together with sibling nodes.

min_samples_split is the minimum number of samples required for subdivision of internal nodes. That is to say, after setting the threshold of this hyperparameter, when the number of samples of a certain node is less than this hyperparameter, the decision tree will not continue to divide.

The difference between the two is that min_samples_leaf prunes the decision tree in a post-pruning manner. That is: when the decision tree is completely divided, the nodes that are smaller than the sample number threshold are pruned according to the restriction conditions; and min_samples_split is in the pre-pruning method, first set a sample number threshold, and in the process of constructing the decision tree , when the number of training samples is less than this value, stop the growth of the tree.

2. Draw the learning curve according to the decision tree model parameter min_samples_leaf

test1 = []
for i in range(10):
clf1=tree.DecisionTreeClassifier(min_samples_leaf=i+2,criterion="entropy",random_state=30,splitter="random")
    clf1=clf1.fit(Xtrain,Ytrain)
    score1=clf1.score(Xtest,Ytest)
    test1.append(score1)
plt.plot(range(1,11),test1,label="min_samples_leaf")
plt.legend()
plt.show()

operation result:

 In this image, when the number of iterations is i=2, the accuracy of the fit of the model is the highest, and then the number of iterations is flat between 2 and 4, and it begins to decline after i=4. Therefore, when the value of i is in the interval (2, 4), the fitting effect is the best.

3. Finalized model

clf1=tree.DecisionTreeClassifier(criterion="entropy",splitter='random',random_state=10,max_depth=5,min_samples_leaf=10,min_samples_split=10)
clf1=clf1.fit(Xtrain,Ytrain)
score1=clf1.score(Xtrain,Ytrain)
print("训练集分类分数",score1)
Ypredict = clf1.predict(Xtest)
print("测试集分类结果",Ypredict)
scoretest=clf1.score(Xtest,Ytest)
print("测试集分类分数",scoretest)
test2 = []
for i in range(10):
clf2=tree.DecisionTreeClassifier(min_samples_split=i+2,criterion="entropy",random_state=30,splitter="random")
    clf2=clf2.fit(Xtrain,Ytrain)
    score2=clf2.score(Xtest,Ytest)
    test2.append(score2)
plt.plot(range(1,11),test2,label="min_samples_split")
plt.legend()
plt.show()
dot_data2 = tree.export_graphviz(clf2,out_file=None,feature_names=feature_name,class_names=["琴酒","雪莉","贝尔摩德"],filled=True,rounded=True)
graph2 = graphviz.Source(dot_data2)
print(graph2)
graph2.render(filename=r"tree_test2",format='jpg')

 

Guess you like

Origin blog.csdn.net/m0_52051577/article/details/130353371