The predict_proba(self, X) of DecisionTreeClassifier

DecisionTreeClassifier之predict_proba(self,X)

Through the process of debugging the code today, I found that my understanding of the decision tree code is still not deep enough. Mark today.

question:

Logically speaking. Call the predict_proba(X_test) method of DecisionTreeClassifier to get the probability that each sample belongs to a different class. And the sum of the probability is equal to 1. However, the result of outputting predict_proba (X_test) found that, assuming that the category is 6, the probability of a single sample belonging to a different category becomes [0,0,0,1,0,0], which was confusing at that time. Now, what about the good probability, how did it become a fixed value? ? ?

Here is the sample code and solution:
Sample code:

from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# 获取数据集
iris=load_iris()
X,y=iris["data"],iris["target"]
# 对数据进行归一化处理
X=preprocessing.scale(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.4,random_state=42)
# 定义决策树的参数
clf=DecisionTreeClassifier(random_state=0)
#修改后
#clf=DecisionTreeClassifier(random_state=0,min_samples_leaf=10)
# 训练数据集
clf.fit(X_train,y_train)
# 可视化树模型
with open("D:/PycharmProjects/forestSee\GrowingDcTree/BreastCancer.dot",'w') as f:
    f=tree.export_graphviz(clf,out_file=f)
# 预测模型的有效性
content=clf.predict_proba(X_test)
print(content)
# 把这个predict_proba存起来
from  pandas import DataFrame
data = DataFrame(content)
data.to_csv('D:\PycharmProjects/forestSee\Label/BreastCancer.csv')

The result obtained above is (part of):

下面第一行为标签值值,第二列为样本的序号
    0  1  2
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
5,1.0,0.0,0.0
6,0.0,1.0,0.0
7,0.0,0.0,1.0
8,0.0,0.0,1.0
9,0.0,1.0,0.0
10,0.0,0.0,1.0
11,1.0,0.0,0.0
12,1.0,0.0,0.0
13,1.0,0.0,0.0
14,1.0,0.0,0.0
15,0.0,1.0,0.0
16,0.0,0.0,1.0

Take the trained tree and take a look
Generated decision tree

Did you find any problems? ? ?
At first glance, there is no such thing, but careful analysis will find that the samples of each leaf node of this decision tree belong to the same class, and the gini values ​​of the leaf nodes are all 0, that is, the tree is fully grown, then When using a new sample for testing, the node that each test sample falls into must be pure, then the toilet probability of a single sample point is [0,, 0, 0, 1, 0, 0], that is, it clearly belongs to a certain point. One class, then the probability obtained is of course the maximum value taken on a certain class, 1.

In order to avoid this problem, you can set the decision tree, min_samples_leaf=10, let's take a look at the probability output of predict_proba and the tree after fit

The probability output is as follows:

,0,1,2
0,0.0,0.7142857142857143,0.2857142857142857
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,0.0,0.7142857142857143,0.2857142857142857
4,0.0,0.7142857142857143,0.2857142857142857
5,1.0,0.0,0.0
6,0.0,1.0,0.0
7,0.0,0.1,0.9
8,0.0,0.7142857142857143,0.2857142857142857
9,0.0,1.0,0.0
10,0.0,0.1,0.9
11,1.0,0.0,0.0
12,1.0,0.0,0.0
13,1.0,0.0,0.0
14,1.0,0.0,0.0
15,0.0,0.7142857142857143,0.2857142857142857
16,0.0,0.0,1.0
17,0.0,1.0,0.0
18,0.0,0.7142857142857143,0.2857142857142857

The resulting decision tree is as follows:
Controlling the tree after overfitting

In this way, when a new sample is used for testing, the leaf nodes of the decision tree that the sample falls into do not necessarily belong to the same class, so the probability can be calculated, not just 1.
After today's practice, I found that I Improved ability to spot and solve problems. Come on, keep working hard!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325985964&siteId=291194637