Naive Bayesian and Probabilistic Model Evaluation Indicators

Table of contents

1. Gaussian Naive Bayes GaussianNB

1.1 Understanding Gaussian Naive Bayes

1.2 Exploring Bayesian: Gaussian Naive Bayesian Fitting Effect and Operation Speed

2. Probabilistic model evaluation indicators

2.1 Brier Score

2.2 Logarithmic likelihood function Log Loss 

2.3 Reliability Curve Reliability Curve

2.3.1 Draw a calibration curve on Bayesian using the reliability curve class

2.3.2 How does the curve change under different n_bins values

2.3.3 Build more models

2.4 Prediction probability histogram

2.5 Calibration reliability curve

2.5.1 Wrapper functions

2.5.2 Function-based plotting

2.5.3 Viewing changes in Naive Bayes accuracy based on calibration results

2.5.4 Calibration of SVC


1. Gaussian Naive Bayes GaussianNB

1.1 Understanding Gaussian Naive Bayes

        class sklearn.naive_bayes.GaussianNB(priors=None,var_smoothing=1e-09) Gaussian Naive Bayesian estimates the conditional probability of each feature and each category by assuming that P(x_{i}|Y)it obeys a Gaussian distribution (that is, a normal distribution). For the value of each feature, Gaussian Naive Bayes has the following formula:

\bg_white P(x_{i}|Y)=f(x_{i},\mu _{y},\sigma _{y})*\varepsilon =\frac{1}{\sqrt{2\Pi\ sigma _{y}^{2} }}exp(-\frac{(x_{i}}\mu _{y})^{2}}{2\sigma _{y}^2})

For any value of Y, Bayesian P(x_{i}|Y)aims to solve the maximization, so that it can compare which value the sample is closer to under different labels. With P(x_{i}|Y)the goal of maximization, Gaussian Naive Bayes will solve the formula for us \mu _{y},\sigma _{y}. After solving the parameters, put in a x _{i}value to get a P(x_{i}|Y)probability value. This class includes two parameters, but when instantiating, we don't need to input any parameters to the Gaussian Naive Bayesian class. It can be said to be a very lightweight class. So Bayesian doesn't have too many parameters to tune.

# 导入需要的库和数据
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits=load_digits() #手写数据集
x,y=digits.data,digits.target
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
# 建模
gnb=GaussianNB().fit(xtrain,ytrain)
acc_score=gnb.score(xtest,ytest)  #返回预测的精确性accuracy
acc_score
0.8592592592592593
# 查看预测结果
y_pred=gnb.predict(xtest)
y_pred
array([6, 1, 3, 0, 4, 5, 0, 8, 3, 8, 6, 8, 7, 8, 8, 8, 5, 9, 5, 6, 5, 4,
       7, 4, 8, 2, 7, 2, 8, 9, 2, 8, 3, 6, 0, 3, 8, 8, 1, 5, 2, 8, 8, 9,
       2, 2, 0, 7, 3, 6, 7, 2, 8, 0, 5, 4, 1, 9, 4, 0, 5, 8, 9, 1, 7, 8,
       7, 5, 8, 2, 4, 4, 8, 2, 6, 1, 2, 1, 7, 8, 8, 5, 9, 4, 3, 6, 9, 7,
       4, 2, 4, 8, 0, 5, 7, 7, 7, 4, 7, 8, 8, 7, 0, 7, 2, 1, 9, 9, 8, 7,
       1, 5, 1, 8, 0, 4, 8, 9, 5, 6, 4, 8, 3, 8, 0, 6, 8, 6, 7, 6, 1, 8,
       5, 0, 8, 2, 1, 8, 8, 6, 6, 0, 2, 4, 7, 8, 9, 5, 9, 4, 7, 8, 8, 6,
       7, 0, 8, 4, 7, 2, 2, 6, 4, 4, 1, 0, 3, 4, 3, 8, 7, 0, 6, 9, 7, 5,
       5, 3, 6, 1, 6, 6, 2, 3, 8, 2, 7, 3, 1, 1, 6, 8, 8, 8, 7, 7, 2, 5,
       0, 0, 8, 6, 6, 7, 6, 0, 7, 5, 5, 8, 4, 6, 5, 1, 5, 1, 9, 6, 8, 8,
       8, 2, 4, 8, 6, 5, 9, 9, 3, 1, 9, 1, 3, 3, 5, 5, 7, 7, 4, 0, 9, 0,
       9, 9, 6, 4, 3, 4, 8, 1, 0, 2, 9, 7, 6, 8, 8, 0, 6, 0, 1, 7, 1, 9,
       5, 4, 6, 8, 1, 5, 7, 7, 5, 1, 0, 0, 9, 3, 9, 1, 6, 3, 7, 2, 7, 1,
       9, 9, 8, 3, 3, 5, 7, 7, 7, 3, 9, 5, 0, 7, 5, 5, 1, 4, 9, 2, 0, 6,
       3, 0, 8, 7, 2, 8, 1, 6, 4, 1, 2, 5, 7, 1, 4, 9, 5, 4, 2, 3, 5, 9,
       8, 0, 0, 0, 0, 4, 2, 0, 6, 6, 8, 7, 1, 1, 8, 1, 1, 7, 8, 7, 8, 3,
       1, 4, 6, 1, 8, 1, 6, 6, 7, 2, 8, 5, 3, 2, 1, 8, 7, 8, 5, 1, 7, 2,
       1, 1, 7, 8, 9, 5, 0, 4, 7, 8, 8, 9, 5, 5, 8, 5, 5, 8, 1, 0, 4, 3,
       8, 2, 8, 5, 7, 6, 9, 9, 5, 8, 9, 9, 1, 8, 6, 4, 3, 3, 3, 3, 0, 8,
       0, 7, 7, 6, 0, 8, 9, 8, 3, 6, 6, 8, 7, 5, 8, 4, 5, 8, 6, 7, 6, 7,
       7, 8, 0, 8, 2, 2, 0, 5, 7, 3, 0, 2, 8, 2, 0, 2, 3, 6, 8, 1, 7, 5,
       7, 1, 7, 7, 2, 7, 5, 2, 6, 5, 8, 0, 0, 8, 1, 3, 7, 6, 1, 5, 6, 2,
       0, 1, 5, 7, 8, 0, 3, 5, 0, 7, 5, 4, 4, 1, 5, 9, 5, 3, 7, 1, 7, 3,
       5, 8, 5, 8, 5, 6, 1, 6, 7, 4, 3, 7, 0, 5, 4, 9, 3, 3, 6, 3, 5, 2,
       9, 8, 9, 3, 9, 7, 3, 4, 9, 4, 3, 1])
# 查看预测的概率结果
prob=gnb.predict_proba(xtest)
prob.shape #10列,每一列对应一个标签类别下的概率
# 使用混淆矩阵来查看贝叶斯结果
from sklearn.metrics import confusion_matrix as CM
CM(ytest,y_pred)
array([[47,  0,  0,  0,  0,  0,  0,  1,  0,  0],
       [ 0, 46,  2,  0,  0,  0,  0,  3,  6,  2],
       [ 0,  2, 35,  0,  0,  0,  1,  0, 16,  0],
       [ 0,  0,  1, 40,  0,  1,  0,  3,  4,  0],
       [ 0,  0,  1,  0, 39,  0,  1,  4,  0,  0],
       [ 0,  0,  0,  2,  0, 58,  1,  1,  1,  0],
       [ 0,  0,  1,  0,  0,  1, 49,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 54,  0,  0],
       [ 0,  3,  0,  1,  0,  0,  0,  2, 55,  0],
       [ 1,  1,  0,  1,  2,  0,  0,  3,  7, 41]], dtype=int64)

 1.2 Exploring Bayesian: Gaussian Naive Bayesian Fitting Effect and Operation Speed

        By plotting the learning curve of Gaussian Naive Bayes and comparing the learning curves of classification trees, random forests and support vector machines, we explore the properties of the Naive Bayes algorithm in terms of algorithm fitting. Use the class learning_curve that comes with sklearn to draw the learning curve, perform cross-validation in this class and obtain the accuracy of training and testing under different sample sizes.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve #画学习曲线的类
from sklearn.model_selection import ShuffleSplit #设定交叉验证模式的类
from time import time
import datetime
#根据输入的分类器,数据,画图所需要的一系列参数,交叉验证的模式,以及其它可能的参数,一次性画出所有的学习曲线
# 找出每个图像所需要的横纵坐标-learning_curve
# 需要绘制子图所在的画布 plt.figure()

def plot_learning_curve(estimator,title,x,y,
                       ax,#选择子图
                       ylim=None,#设置纵坐标的取值范围
                       cv=None #交叉验证
                       ):
    train_sizes,train_scores,test_scores=learning_curve(estimator,#分类器
                                                        x,y,#特征矩阵和标签
                                                        cv=cv#表示交叉验证模式
                                                        )
    ax.set_title(title)
    if ylim is not None:
        ax.set_ylim(*ylim) #设定y轴取值范围一致,方便比较
    ax.set_xlabel("Traing example")
    ax.set_ylabel("Score")
    ax.grid() #显示网格作为背景,不是必须
    ax.plot(train_sizes,np.mean(train_scores,axis=1),'o-',color="r",label="Training score")
    ax.plot(train_sizes,np.mean(test_scores,axis=1),'o-',color="g",label="Test score")
    ax.legend(loc="best")
    return ax
digits=load_digits()
x,y=digits.data,digits.target
title=["Naive Bayes","DecisionTree","SVM,rbf kernel","RandomForest","Logistic"]
model=[GaussianNB(),DTC(),SVC(gamma=0.001),RFC(n_estimators=50),LR(C=0.1,solver="lbfgs")]
cv=ShuffleSplit(n_splits=50,#把数据分为多少份
                test_size=0.2,#20%*50份的数据会被作为测试集
                random_state=0) #分交叉验证的份数的时候进行的随机抽样的模式
fig,axes=plt.subplots(1,5,figsize=(30,6))
for ind,title_,estimator in zip(range(len(title)),title,model):
    times=time()
    plot_learning_curve(estimator,title_,x,y,
                       ax=axes[ind],ylim=[0.7,1.05],cv=cv)
    print("{}:{}".format(title_,datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f")))
Naive Bayes:00:01:562608
DecisionTree:00:02:984543
SVM,rbf kernel:00:25:783674
RandomForest:00:26:049973
Logistic:00:11:874073

        It can be found that among several models, Bayesian takes less time, but its accuracy on the test set is not as good as other models. For other models, as the training samples increase, the accuracy of the training set basically maintains a score close to 100%, and the score of the test set also increases (when the sample size is small, the model is easy to overfit and the test set is accurate. lower sex), more in line with common sense. However, with the increase of the training sample size of Bayesian, although the accuracy of the test set is increasing, the accuracy of the training set is decreasing. It can be predicted that when the sample size is large enough, the Bayesian score can only be at most It is about 85% (the test set score is rarely higher than the training set).

2. Probabilistic model evaluation indicators

        Confusion matrix and precision can help us understand Bayesian classification results. However, we choose Bayesian classification. When the logarithm is large, we do not only pursue the effect, but hope to see the predicted relative probability. This probability gives the credibility of the prediction, so for probabilistic models, we can use the unique model evaluation indicators of probabilistic models to help us judge. Next, let's take a look at the evaluation indicators unique to several probability models.

2.1 Brier Score

        The degree to which probability predictions are accurate is known as " calibration , " and is a way of measuring the difference between the probabilities predicted by an algorithm and the true outcomes. In binary classification, the most commonly used indicator is called the Brill score, which is calculated as the mean square error of the probability prediction relative to the test sample, expressed as:

Among them , N is the number of samples, which is p_{i}the probability predicted by Naive Bayes, o_{i}which is the real result corresponding to the sample, and can only be 0 or 1. If the event occurs, it will be 1 , and if it does not occur, it will be 0 . This metric measures how far our probabilities are from the true label result, and actually looks a lot like mean squared error. Brill scores range from 0 to 1 , with higher scores indicating poorer Bayesian predictions. Since its essence is also measuring a loss, in sklearn, the Brier score is named brier_score_loss . This score can be imported from the module metrics to measure our model evaluation results:
from sklearn.metrics import brier_score_loss
ytest=pd.get_dummies(ytest).loc[:,1]
brier_score_loss(ytest,prob[:,1],pos_label=1)
# 注意,第一个参数是真实标签,第二个参数是预测出的概率值
# pos_label与prob中的索引一致,就可以查看这个类别下的布里尔分数是多少
0.032619662406118764

        The Brill score can be used for any model that can use the predict_proba interface to call the probability. Next, let's explore the effect of logistic regression, SVC, and Gaussian naive Bayes on the handwritten digit data set:

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.datasets import load_digits

digits=load_digits() #手写数据集
x,y=digits.data,digits.target
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)


gnb=GaussianNB().fit(xtrain,ytrain)
logi=LR(C=1.0,solver='lbfgs',max_iter=3000,multi_class="auto").fit(xtrain,ytrain)
svc=SVC(kernel="linear",gamma=1).fit(xtrain,ytrain)

brier_score_loss(ytest,logi.predict_proba(xtest)[:,1],pos_label=1)
0.01142992902197938
# 由于svc的置信度不是概率,为了可比性,我们需要将SVC的置信度“距离”归一化1,压缩到【0,1】之间
svc_prob=(svc.decision_function(xtest)-svc.decision_function(xtest).min())/(svc.decision_function(xtest).max()-svc.decision_function(xtest).min())
brier_score_loss(ytest,svc_prob[:,1],pos_label=1)
0.24286480465579566
# 如果将每个分类器每个标签类别下的布里尔分数可视化:
import pandas as pd
name=["Bayes","Logistic","SVC"]
color=["red","black","orange"]
df=pd.DataFrame(index=range(1,10),columns=name)
for i in range(1,10):
    df.loc[i,name[0]]=brier_score_loss(ytest,prob[:,i],pos_label=i)
    df.loc[i,name[1]]=brier_score_loss(ytest,logi.predict_proba(xtest)[:,i],pos_label=i)
    df.loc[i,name[2]]=brier_score_loss(ytest,svc_prob[:,i],pos_label=i)
for i in range(df.shape[1]):
    plt.plot(range(1,10),df.iloc[:,i],c=color[i],label=name[i])
plt.legend()
plt.show()

        It can be observed that the Brill score of logistic regression has an overwhelming advantage, and the effect of SVC is significantly weaker than that of Bayesian and logistic regression (SVC is forced to use the sigmoid function to compress the probability, so the result of SVC output probability is not so reliable). Bayesian is located between logistic regression and SVC, and it works well, but it is still not accurate and stable compared to logistic regression.

2.2 Logarithmic likelihood function Log Loss 

         Another commonly used probability loss measure is log loss (log loss), also known as log likelihood, logistic loss or cross entropy loss, which is the loss function used in multiple logistic regression and some extended algorithms such as neural networks. It is defined as: For a given probabilistic classifier, the negative logarithm of the likelihood that the true probability occurs conditional on the predicted probability. Since it is a loss, the smaller the value of the log likelihood function, the better, which proves that the probability estimate is more accurate and the model is more ideal. It is worth noting that the log loss can only be used to evaluate categorical models.

        For a sample, if the true label of the sample y_{true}takes a value in {0,1}, and the estimated probability of this sample under category 1 is y_{pred}, then the loss function corresponding to this sample is: (Note: The log here means that e is base natural logarithm)

-logP(y_{true}|y_{pred})=-(y_{true}*log(y_{pred}))+(1-y_{true})*log(1-y_{pred})

In sklearn, the log-likelihood function can be imported from the metrics module:

from sklearn.metrics import log_loss
score1=[]
for i in range(10):
    score1.append(log_loss(ytest,prob[:,i]))
sum(score1)/len(score1)
5.358355031493363
score2=[]
for i in range(10):
    score2.append(log_loss(ytest,logi.predict_proba(xtest)[:,1]))
sum(score2)/len(score2)
0.05586444292410096
score3=[]
for i in range(10):
    score3.append(log_loss(ytest,svc_prob[:,1]))
sum(score3)/len(score3)
0.7065400146541764

The first parameter of log_loss is the true label, and the second parameter is the predicted probability. Note that the conclusions drawn using log_loss are inconsistent with those drawn using the Brill score. When the Brill score was used as the judging standard, the estimation effect of SVC was the worst, and the results of logistic regression and Bayesian wanted to be close. When using log likelihood, although logistic regression is still the best, Bayesian is not as effective as SVC. This is because both logistic regression and SVC are algorithms that use optimization to solve the model and then classify; and Naive Bayes does not have this process. The log likelihood function directly points to the direction of model optimization, even the loss function of logistic regression itself, so it performs better on logistic regression and SVC.

        In practical applications, the log-likelihood function is the golden indicator for evaluating probability models, and is often the preferred choice for evaluating probability models. But it also has some disadvantages. First of all, it has no bounds, unlike the Brill score which has an upper limit, which can be used as a reference for the model effect. Second, it is less explanatory than Brill scores. Third, it performs significantly better on optimization-targeted models. And in some mathematical problems, for example, the probability of 0 or 1 cannot be accepted, otherwise the logarithm may reach the limit value (considering the case where the natural base with e as the base is taken to 0 or 1). So in general there are the following rules:

need log-likelihood Prefer using Brill fractions
measurement model To compare multiple models, or measure different variations of a model Measuring the performance of a single model
interpretability Expert exchange between machine learning and deep learning, academic papers Business reporting, measurement of business models
Optimal Direction Logistic regression, SVC Naive Bayes
digital problem The probability can only be infinitely close to 0 or 1, and cannot be taken to 0 or 1 Probability can be 0 or 1

        The effect of Bayesian is not as good as other models, and Bayesian principle is simple, there are few parameters available. However, the algorithm of the output probability has its own adjustment method, which is to adjust the degree of calibration of the probability . The higher the degree of calibration, the more accurate the model's prediction of probability, the more confident the algorithm is in making judgments, and the more stable the model will be. If the pursuit model must be as close to the real probability as possible in the probability prediction, then the reliability curve can be used to adjust the calibration degree of the probability.

2.3 Reliability Curve Reliability Curve

        Reliability curve (reliability curve), also known as probability calibration curve (probability calibration curve), reliability diagram (reliability diagrams), this is a curve with the predicted probability as the abscissa and the real probability as the ordinate. We want the predicted probabilities to be as close to the true probabilities as possible. The calibration curve is thus also one of the model evaluation metrics. Similar to the Brill score, the probability calibration curve is for a certain class of labels, so a class of labels will have a curve, or an average under a multi-class label can be used to represent the probability calibration curve of an entire model. But generally speaking, curves are more often used for binary classification.

        Similar to the ROC curve, in sklearn, use the class calibration_curve to obtain the horizontal and vertical coordinates, and then use matplotlib to draw. When drawing the reliability curve, the vertical axis is the real probability, but the real probability is not available in reality. But we can use an indicator class of class probabilities to help us with the calibration. A simple method is: divide the data into bins, and then stipulate that the proportion of the real minority class in each box is the true probability trueproba on this box, and the mean of the predicted probability in this box is the predicted probability predproba of this box, Then take predproba as the abscissa and trueproba as the ordinate to draw the reliability curve.

As an example, look at this table, which is an image of a set of data without binning:

 
   

Image after binning:

It can be seen that after binning, the features of the sample points are aggregated together, and the curve becomes obviously monotonous and smooth. The essence of this binning operation is equivalent to a smoothing. In sklearn, such an approach can be achieved by drawing the class calibration_curve of the reliability curve. Similar to the ROC curve, the class calibration_curve can help us get the horizontal and vertical coordinates, and then use matplotlib to draw the image. This class has the following parameters:

parameter meaning
y_true real label
y_prob The probability value or confidence under the positive category returned by the prediction
normalize

boolean, default is False

Whether to normalize the input content in y_prob to [0,1]

n_bins An integer value indicating the number of bins. If the number of bins is large, more data is required
return meaning
true proof The ordinate of the reliability curve, the structure is (n_bins,), which is the proportion of the minority class (Y=1) in each box
predproba The abscissa of the reliability curve, structured as (n_bins,), is the mean value of the probability in each bin

2.3.1 Draw a calibration curve on Bayesian using the reliability curve class

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.datasets import make_classification as mc
from sklearn.metrics import brier_score_loss
from sklearn.model_selection import train_test_split

# 创建数据集
x,y=mc(n_samples=100000,n_features=20 #总共20个特征
       ,n_classes=2 #标签为2分类
       ,n_informative=2 #其中两个代表较多信息
       ,n_redundant=10 #10个冗余特征
       ,random_state=42)
xtrain,xtest,ytrain,ytest=train_test_split(x,y,
                                           test_size=0.99, #训练集很小,测试集很大
                                           random_state=0)
# 建立模型,绘制图像
gnb=GaussianNB()
gnb.fit(xtrain,ytrain)
y_pred=gnb.predict(xtest) #预测标签
prob_pos=gnb.predict_proba(xtest)[:,1] #预测概率

from sklearn.calibration import calibration_curve
# 从类calibration_curve中获取横坐标和纵坐标
trueproba,predproba=calibration_curve(ytest,prob_pos,n_bins=10)
fig=plt.figure()
ax1=plt.subplot()
ax1.plot([0,1],[0,1],"k:",label="Perfectly calibrated") #做一条对角线来对比
ax1.plot(predproba,trueproba,"s-",label="Bayes")
ax1.set_ylabel("True probability for class 1")
ax1.set_xlabel("mean predicted probability")
ax1.set_ylim([-0.05,1.05])
ax1.legend()
plt.show()

2.3.2 How does the curve change under different n_bins values

fig,axes=plt.subplots(1,3,figsize=(18,4))
for ind,i in enumerate([3,10,100]): #i取值为3 10 100
    ax=axes[ind]
    ax.plot([0,1],[0,1],"k:",label="Perfectly calibrated") #做一条对角线来对比
    trueproba,predproba=calibration_curve(ytest,prob_pos,n_bins=i)
    ax.plot(predproba,trueproba,"s-",label="n_bins={}".format(i))
    ax1.set_ylabel("True probability for class 1")
    ax1.set_xlabel("mean predicted probability")
    ax1.set_ylim([-0.05,1.05])
    ax.legend()
plt.show()

 It can be clearly seen that the larger n_bins, the more boxes, the more accurate the probability calibration curve, but the curve that is too precise is not smooth enough to compare with the perfect probability curve we hope. The smaller n_bins is, the smaller the box is, and the rougher the probability calibration curve is. Although it is close to the perfect probability curve, it cannot truly show the results of the model probability prediction. Therefore, a number of boxes should be selected that is neither too large nor too small, so that the probability calibration curve is neither too precise nor too rough, and the twenty-one is relatively smooth, which can also reflect the trend curve of the model for probability prediction.

2.3.3 Build more models

name=["GaussianBayes","Logistic","SVC"]
gnb=GaussianNB()
logi=LR(C=1.0,solver='lbfgs',max_iter=3000,multi_class="auto")
svc=SVC(kernel="linear",gamma=1) #返回置信度
fig,ax1=plt.subplots(figsize=(8,6))
ax1.plot([0,1],[0,1],"k:",label="Perfectly calibrated")
for clf,name_ in zip([gnb,logi,svc],name):
    clf.fit(xtrain,ytrain)
    y_pred=clf.predict(xtest)
    #hasattr(obj,name):查看一个类obj中是否存在名字为name的接口,存在则返回True
    if hasattr(clf,"predict_proba"):
        prob_pos=clf.predict_proba(xtest)[:,1]
    else: #use decision function
        prob_pos=clf.decision_function(xtest)
        prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min())
    #返回布里尔分数
    clf_score=brier_score_loss(ytest,prob_pos,pos_label=y.max()) #正样本
    trueproba,predproba=calibration_curve(ytest,prob_pos,n_bins=10)
    ax1.plot(predproba,trueproba,"s-",label="%s(%1.3f)" % (name_,clf_score))

ax1.set_ylabel("True probability for class 1")
ax1.set_xlabel("mean predicted probability")
ax1.set_ylim([-0.05,1.05])
ax1.legend()
ax1.set_title('Calibration plots (reliability curve)')
plt.show()

From the results of the image, it is obvious that the probability estimation of logistic regression is the closest to the perfect probability calibration curve, so the effect of logistic regression is the most perfect. In contrast, the Gaussian Naive Bayes and SVM classifiers performed poorly. The support vector machine exhibits a shape similar to the sigmoid function, while the Gaussian Naive Bayes exhibits the opposite shape to the sigmoid function.

Sigmoid function image:

        For Bayesian, if the probability calibration curve presents a mirror image of the sigmoid function, it means that the features in the data set are not conditionally independent of each other. The "naive" principle in the Bayesian principle: the principle of mutual conditional independence of features is violated (this is actually our own setting, we set 10 redundant features, these features are noise, and they cannot be completely independent), so Bayesian performance is not good enough.

        The effect of the probability calibration curve of the support vector machine is actually the performance of a typical under-confident classifier: a large number of sample points are concentrated near the decision boundary, so the confidence of many sample points is close to 0.5. Even if the decision boundary can judge the sample point correctly, the model itself is not very sure about this result. In contrast, the confidence of a point that is far away from the decision boundary will be high, because it will not be misjudged with a high probability. When the support vector machine is faced with the data with a high degree of mixture, it has the inherent disadvantage of insufficient confidence.

2.4 Prediction probability histogram

        View the distribution of the model's predicted probabilities by plotting a histogram. The histogram is an image with the result of binning the predicted probability of the sample as the abscissa, and the number of samples in each bin as the ordinate. (! The binning here is different from the binning of the reliability curve. The binning here is to evenly divide the predicted probability into intervals)

fig,ax2=plt.subplots(figsize=(8,6))
for clf,name_ in zip([gnb,logi,svc],name):
    clf.fit(xtrain,ytrain)
    y_pred=clf.predict(xtest)
    #hasattr(obj,name):查看一个类obj中是否存在名字为name的接口,存在则返回True
    if hasattr(clf,"predict_proba"):
        prob_pos=clf.predict_proba(xtest)[:,1]
    else: #use decision function
        prob_pos=clf.decision_function(xtest)
        prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min())
    ax2.hist(prob_pos #预测概率
             ,bins=10
             ,label=name_
             ,histtype="step" #设置直方图为透明
             ,lw=2 #设置直方图每个柱子描边的粗细
    )
ax2.set_ylabel("Distribution of probability")
ax2.set_xlabel("mean predicted probability")
ax2.set_xlim([-0.05,1.05])
ax2.set_xticks([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
ax2.legend(loc=9)
plt.show()

        It can be seen that the probability distribution of Gaussian Bayesian is very high on both sides and very low in the middle. Almost 90% of the samples are near 0 and 1, which can be said to be the algorithm with the highest confidence. The Moore score is not as good as logistic regression, which proves that some of the samples near 0 and 1 in Bayesian are misclassified. The support vector Bayesian is completely opposite. It is obviously high in the middle and low on both sides, which is similar to the normal distribution. It proves what we just said, most of the samples are near the decision boundary, and the confidence is hovering around 0.5. . The logistic regression is in the middle of the Gaussian Naive Bayes and the support vector machine, that is, there are not too many samples that are too close to 0 and 1, and it does not form a normal distribution like the support vector machine. The probability distribution of a relatively healthy positive sample is what the histogram of logistic regression shows.

Avoiding Confusion: Probability Density Curves and Probability Distribution Histograms


Probability density curve: the abscissa is the value of the sample, and the ordinate is the number of samples falling in the value range of this sample. It measures how many samples are in the value range of each X. The Gaussian distribution is The sample distribution over the values ​​of X.

Probability distribution histogram: the abscissa is the probability value [0,1], and the ordinate is the number of samples falling within this probability value range, which measures how many samples are within each probability value range. This distribution makes no assumptions.

We have already learned that Naive Bayesian and SVC are not as effective as logistic regression in all aspects of predicting probability. In this case, how can we help the model or algorithm to make them more confident in their predictions and have a higher degree of confidence? high? We can correct the probabilistic algorithm using equiapproximate regression.

2.5 Calibration reliability curve

        There are two kinds of regressions that can be used for iso-approximate regression, one is a parameter calibration method based on Platt's Sigmoid model, and the other is a non-parametric calibration method based on isotonic regression (isotonic calibration). Probability calibration should happen on the test set, which must be data that the model has not seen. Mathematically, the principle of using these two methods to calibrate the probability is very complicated, and we cannot interfere in this process in sklearn.

class sklearn.calibration.CalibratedClassifierCV (base_estimator=None, method=’sigmoid’, cv=’warn’)

This is a probability calibration class with cross-validation, which uses a cross-validation generator, and for each piece of data in cross-validation, it performs model parameter estimation on the training sample, probability calibration on the test sample, and then provides us with Returns the best set of parameter estimates and calibration results. The predicted probability of each piece of data is averaged. Note that the class CalibratedClassifierCV has no interface decision_function. To view the probability generated by the calibrated model under this class, the predict_proba interface must be called.

parameter meaning
base_estimator

A classifier whose output decision function needs to be calibrated must have the predict_proba or decision_function interface. If the parameter cv = prefit, the classifier must have been fitted to the data.

method

The method of probability calibration, you can enter "sigmoid" or "isotonic"

Enter 'sigmoid' for calibration using the Platt-based Sigmoid model
Enter 'isotonic' for calibration using isotonic regression

When the calibration sample size is too small (for example, ≤ 1000 test samples), isotonic regression is not recommended because it tends to overfit. When the sample size is too small, please use sigmoids, that is, Platt calibration.

cv

An integer that determines the cross-validation strategy. Possible inputs are:

1) None, means to use the default 3-fold cross-validation.
For the case of inputting integers and None, if it is a binary classification, the class sklearn.model_selection.StratifiedKFold is automatically used for fold segmentation. If y is a continuous variable, use sklearn.model_selection.KFold for splitting.
2) The cross-validation mode or generator cv that has been built using other classes
3) Iterable, divided test set and training set index array
4) Input "prefit", it is assumed that the classifier has been fitted data. In this mode, the user must manually ensure that the data used to fit the classifier does not intersect with the data to be calibrated

2.5.1 Wrapper functions

First, we wrap the previous code for plotting reliability curves and histograms into functions. The parameters of the consideration function are: model, model name, data, and the number of bins to be divided.

import matplotlib.pyplot as plt
from sklearn.metrics import brier_score_loss
from sklearn.calibration import calibration_curve
def plot_calib(models,name,xtrain,xtest,ytrain,ytest,n_bins=10):
    fig,(ax1,ax2)=plt.subplots(1,2,figsize=(20,6))
    ax1.plot([0,1],[0,1],"k:",label="Perfectly calibrated")
    for clf,name_ in zip(models,name):
        clf.fit(xtrain,ytrain)
        y_pred=clf.predict(xtest)
        #hasattr(obj,name):查看一个类obj中是否存在名字为name的接口,存在则返回True
        if hasattr(clf,"predict_proba"):
            prob_pos=clf.predict_proba(xtest)[:,1]
        else: #use decision function
            prob_pos=clf.decision_function(xtest)
            prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min())
        #返回布里尔分数
        clf_score=brier_score_loss(ytest,prob_pos,pos_label=y.max()) #正样本
        trueproba,predproba=calibration_curve(ytest,prob_pos,n_bins=10)
        ax1.plot(predproba,trueproba,"s-",label="%s(%1.3f)" % (name_,clf_score))
        ax2.hist(prob_pos,bins=10,label=name_,histtype="step",lw=2 )

    ax2.set_ylabel("Distribution of probability")
    ax2.set_xlabel("mean predicted probability")
    ax2.set_xlim([-0.05,1.05])
    ax2.legend(loc=9)
    ax1.set_ylabel("True probability for class 1")
    ax1.set_xlabel("mean predicted probability")
    ax1.set_ylim([-0.05,1.05])
    ax1.legend(loc=9)
    ax1.set_title('Calibration plots (reliability curve)')
    plt.show()

2.5.2 Function-based plotting

from sklearn.calibration import CalibratedClassifierCV
name=["GaussianBayes","Logistic","Bayes+isotonic","Bayes+sigmoid"]
gnb=GaussianNB()
models=[gnb,
       LR(C=1.0,solver='lbfgs',max_iter=3000,multi_class="auto"),
        #定义两种校准方式
       CalibratedClassifierCV(gnb,cv=2,method='isotonic'),
       CalibratedClassifierCV(gnb,cv=2,method='sigmoid')]
plot_calib(models,name,xtrain,xtest,ytrain,ytest)

From the results of correcting Naive Bayesian, Isotonic isotonic correction greatly improves the shape of the curve, almost making the effect of Bayesian equal to that of logistic regression, and the Brill score also dropped to 0.095, one lower than logistic regression point. The Sigmoid calibration method also slightly improves the curve, but the effect is not obvious. From the histogram, Isotonic correction makes the effect of Gaussian Naive Bayes close to logistic regression, while the result after Sigmoid correction is still closer to the original Gaussian Naive Bayesian. It can be seen that when the characteristics of the data are not independent of each other, using the Isotonic method to calibrate the probability curve can get good results.

2.5.3 Viewing changes in Naive Bayes accuracy based on calibration results

gnb=GaussianNB().fit(xtrain,ytrain)
gnb.score(xtest,ytest)
0.8639292929292929
brier_score_loss(ytest,gnb.predict_proba(xtest)[:,1],pos_label=1)
0.11749080113888638
gnbisotonic=CalibratedClassifierCV(gnb,cv=2,method='isotonic').fit(xtrain,ytrain)
gnbisotonic.score(xtest,ytest)
0.8637272727272727
brier_score_loss(ytest,gnbisotonic.predict_proba(xtest)[:,1],pos_label=1)
0.09780399457496632

2.5.4 Calibration of SVC

name_svc=["SVC","Logistic","SVC+isotonic","SVC+sigmoid"]
svc=SVC(kernel="linear",gamma=1)
models_svc=[svc,
           LR(C=1.0,solver='lbfgs',max_iter=3000,multi_class="auto"),
            #定义两种校准方式
           CalibratedClassifierCV(svc,cv=2,method='isotonic'),
           CalibratedClassifierCV(svc,cv=2,method='sigmoid')]
plot_calib(models_svc,name_svc,xtrain,xtest,ytrain,ytest)

      

name_svc=["SVC","SVC+isotonic","SVC+sigmoid"]
svc=SVC(kernel="linear",gamma=1)
models_svc=[svc,
           CalibratedClassifierCV(svc,cv=2,method='isotonic'),
           CalibratedClassifierCV(svc,cv=2,method='sigmoid')]
for clf,name_ in zip(models_svc,name_svc):
        clf.fit(xtrain,ytrain)
        y_pred=clf.predict(xtest)
        #hasattr(obj,name):查看一个类obj中是否存在名字为name的接口,存在则返回True
        if hasattr(clf,"predict_proba"):
            prob_pos=clf.predict_proba(xtest)[:,1]
        else: #use decision function
            prob_pos=clf.decision_function(xtest)
            prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min())
        #返回布里尔分数
        clf_score=brier_score_loss(ytest,prob_pos,pos_label=y.max()) #正样本
        score=clf.score(xtest,ytest)
        print("{}:".format(name_))
        print("\tBrier:{:.4f}".format(clf_score))
        print("\tAccuracy:{:.4f}".format(score))
SVC:
	Brier:0.1632
	Accuracy:0.8665
SVC+isotonic:
	Brier:0.0982
	Accuracy:0.8641
SVC+sigmoid:
	Brier:0.0976
	Accuracy:0.8654

 It can be seen that after calibrating the probability, the Brill score is significantly smaller, but the overall accuracy rate is slightly reduced, which proves that after the algorithm is calibrated, although the prediction of the probability is more accurate, the judgment of the model is slightly weaker. reduce. Let's think about it: the Brill score measures the accuracy of the model's probability prediction. The lower the Brill score, the closer the probability of the model is to the real probability. After the probability calibration, the probability of the sample whose label is 1 should be closer. 1, and the sample whose label is originally 0 should be closer to 0. There is no reason for the Brill score to increase, and the model's judgment accuracy actually drops. But from our results, the accuracy of the model is not exactly the same as the correctness of the probability prediction. Why is this so?

        The reason is different for different probabilistic class models. For models such as SVC and decision tree, the probability is not a real probability, but more a "confidence". Boundary), so for these models, there may be a situation where the probability of category 1 is 0.4 but the sample is still classified as 1. This situation means that the model is not confident that this sample is 1, but it still insists on this The label of the sample is classified as 1. At this time, the probability calibration may be adjusted in a more wrong direction (for example, the point with a probability of 0.4 is adjusted closer to 0, causing the model to make a final judgment error), so the Brill score may show a trend opposite to the accuracy .

        For a model like Naive Bayes, it's a different story. Note that in Naive Bayes we have all sorts of assumptions, besides our "naive" assumptions, there are our assumptions about probability distributions (say Gaussian) that make our Bayesian come up with Probability estimates are actually biased estimates, that is to say, such probability estimates are actually not so accurate and serious. Through calibration, we make the predicted probability of the model closer to the real probability. The essence is to statistically make the algorithm closer to our estimate of the overall sample situation. Such a calibration may show an increase in accuracy on a set of data sets. , may also exhibit a drop in accuracy, depending on how close our test set is to our estimate of what the true sample looks like. This series of biased estimates makes it possible for us to have the opposite trend of Brill scores and accuracy in the probability calibration.

        Of course, there may be more and deeper reasons, such as how the mathematical details in the probability calibration process affect our calibration, how the class calibration_curve is binned, and how to generate the horizontal and vertical dimensions of the calibration curve using real labels and predicted values Coordinates, these processes may also have the process of moving the Brill score and accuracy in both directions.

        When the two contradict each other, the accuracy rate should be used as the standard. But this does not mean that Brill fractions and probability calibration curves are invalid. Probabilistic models have almost no parameters to adjust. Apart from changing the model, there are few better ways to help us improve the performance of the model. Probability calibration is a rare method that can help us improve the model for probability.

        In reality, we can choose to adjust the direction of the model. We don't necessarily have to pursue the highest accuracy rate or the best probability fit. We can adjust the model according to our own needs. Of course, for probabilistic models, since there are very few parameters that can be adjusted, we are more inclined to pursue probability fitting and use probability calibration to adjust the model. If you really want to pursue higher accuracy and recall, you can consider using the inherently very accurate probability model logistic regression, or you can consider using a support vector machine classifier with many other adjustable parameters besides probability calibration.

Guess you like

Origin blog.csdn.net/weixin_60200880/article/details/129206820