二分类模型AUC评价法

对于二分类模型，其实既可以构建分类器，也可以构建回归（比如同一个二分类问题既可以用SVC又可以SVR，python的sklearn中SVC和SVR是分开的，R的e1701中都在svm中，仅当y变量是factor类型时构建SVC，否则构建SVR）。

二分类模型的评价指标很多，这里仅叙述AUC这个指标。AUC的具体原理此处不再叙述，可以参考相关资料，比如这个还行：ROC和AUC介绍以及如何计算AUC。

若构建regression，可以直接将predict的值和真实值直接扔进auc函数里去计算，就是让程序去逐个找predict的cutoff值就可以构建ROC了。

但是如果是classifier，因为直接predict的值是0或1，无法计算auc，此时需要借助于“预测概率”，sklearn中常调用predict_proba函数来获取。另外，Logistics回归，python的predict也是0或1，也需要调用predict_proba函数来获取相应“预测概率”。还有个decision_function，其意义是当这个值大于0时，相应的样本预测为正例。R中不会有这些问题，R都是简单易用的。

AUC的计算举例：

test_auc = metrics.roc_auc_score(y_test,y_test_pre)

ROC的计算举例：

fpr, tpr, thresholds = metrics.roc_curve(y_test,y_test_pre)
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % test_auc)

Classifier的其他相关评价指标

准确度accuracy：可以用classifier.score计算accuracy。理解为正确率，就是分类正确的占总数的比例，即(TP+TN)/Total。

二分类问题中，当其中某一类数量远小于另一类时，如果追求准确度，那么只需要将分类结果全指定为数量多的那类即可。所以这种情况下仅用accuracy评价是不够的。

精确度precision：又叫“查准率”、用P表示。这是针对其中一类而言。比如我建模的目的是找出正例，那么precision就是真正的正例/找出来的所有，即TP/(TP+FP)。

扫描二维码关注公众号，回复： 4348391 查看本文章

召回率recall：又叫“查全率”、“灵敏度”、“真阳性率TPR”，用R表示。也是针对其中一类而言。比如建模的目的是找出正例，那么recall就是真正的正例/所有的正例，即TP/(TP+FN)。另，假阴性率FNR（漏诊率）=FN/(TP+FN)，FNR=1-R。

真阴性率TNR：又叫“特异度”，TNR=TN/(TN+FP)。假阳性率FPR：又叫“误诊率”。TNR+FPR=1。（还记得ROC的横坐标吗）

总而言之，准确率就是找得对，召回率就是找得全。

你问问一个模型，这堆东西是不是某个类的时候，准确率就是 它说是，这东西就确实是的概率吧，召回率就是， 它说是，但它漏说了（1-召回率）这么多。

F1值：是精确值和召回率的调和均值，即 2/F1=1/precision+1/recall。Fβ是更一般的形式，对precision和recall加权。而F1是其特殊情况，认为precision和recall同等重要。推广的话还有macro-P、macro-R、macro-F1及micro-P、micro-R、micro-F1等。

贴张图（来自：机器学习】分类性能度量指标 : ROC曲线、AUC值、正确率、召回率、敏感度、特异度）

另外wiki上也有张图：

准确率和召回率是互相影响的，理想情况下肯定是做到两者都高，但是一般情况下准确率高、召回率就低，召回率低、准确率高，当然如果两者都低，那是什么地方出问题了。

如果是做搜索，那就是保证召回的情况下提升准确率；如果做疾病监测、反垃圾，则是保准确率的条件下，提升召回。

所以，在两者都要求高的情况下，可以用F1来衡量。

P/R和ROC是两个不同的评价指标和计算方式，一般情况下，检索用前者，分类、识别等用后者。

参考一篇还不错的博客：【机器学习】分类性能度量指标 : ROC曲线、AUC值、正确率、召回率、敏感度、特异度

贴几段代码

# SVR与SVC的AUC/ROC计算
import numpy as np
from sklearn.svm import SVR,SVC
from sklearn.model_selection import train_test_split
from sklearn import metrics

x_train, x_test, y_train, y_test = train_test_split(X, Y,  train_size=0.7)

print("------------------------------ SVC ------------------------------------------")
clf = SVC(kernel='rbf', C=100, gamma=0.0001, probability=True)
clf.fit(x_train, y_train)

y_train_pre = clf.predict(x_train)
y_test_pre = clf.predict(x_test)
print("Accuracy: "+str(clf.score(x_train,y_train)))  

y_train_predict_proba = clf.predict_proba(x_train) #每一类的概率
false_positive_rate, recall, thresholds = roc_curve(y_train, y_train_predict_proba[:, 1])
train_auc=auc(false_positive_rate,recall)
print("train AUC: "+str(train_auc))

print("------------------------------------")
print("Accuracy: "+str(clf.score(x_test,y_test)))

y_test_predict_proba = clf.predict_proba(x_test) #每一类的概率
false_positive_rate, recall, thresholds = roc_curve(y_test, y_test_predict_proba[:, 1])
test_auc=auc(false_positive_rate,recall)
print("test AUC: "+str(test_auc))

plt.figure(0)
plt.title('ROC of SVM in test data')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % test_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

print("--------------------------- SVR ------------------------------------------")

reg = SVR(kernel='rbf', C=100, gamma=0.0001)
reg.fit(x_train, y_train)
y_train_pre = reg.predict(x_train)
y_test_pre = reg.predict(x_test)
train_auc = metrics.roc_auc_score(y_train,y_train_pre)
print("train AUC: "+str(train_auc))

print("--------------------------------")

test_auc = metrics.roc_auc_score(y_test,y_test_pre)
print("test AUC: "+str(test_auc))
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_test_pre)

plt.figure(1)
plt.title('ROC of SVR in test data')
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % test_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Logistics回归代码段

# Logistics regression
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import train_test_split

# input X、y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 
# clf = LogisticRegression(random_state=0, solver='lbfgs' ,multi_class='multinomial')
clf = LogisticRegression()
clf.fit(x_train, y_train)

# 一下几行仅用于展现那几个函数的作用，实际使用不应随便挑几个数据验证
logi_pre=clf.predict(X[:5, :])
logi_pro=clf.predict_proba(X[:5, :]) 
logi_accuracy=clf.score(x_test, y_test)
logi_deci=clf.decision_function(X[-5:,:])
print(y)
print("prediction of first 5 samples: ",end=" ")
print(logi_pre)
print("prediction probability of first 5 samples: ")
print(logi_pro)
print("decision_function of last 5 samples(大于0时，正类被预测): ",end=" ")
print(logi_deci)
print("prediction accuracy of test data: ",end=" ")
print(logi_accuracy)

predictions=clf.predict_proba(x_test)#每一类的概率
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc=auc(false_positive_rate,recall)
plt.title('ROC of logistics in test data')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

AUC/ROC计算的sklearn官网举例

print(__doc__)
# ROC for model

import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

题外话：SVM参数设置的案例

import numpy as np
from sklearn.svm import SVR,SVC
import matplotlib.pyplot as plt

# #############################################################################
# Generate sample data
X = np.sort(16 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel()

# #############################################################################
# Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(16))

# Fit regression model
svr_rbf = SVC(kernel='rbf', C=1e3, gamma=0.1)
# svr_rbf = SVR(kernel='rbf', C=1e3, gamma=100) #可能过拟合
# svr_lin = SVR(kernel='linear', C=1e3)
# svr_poly = SVR(kernel='poly', C=1e3, degree=2)
y_rbf = svr_rbf.fit(X, y).predict(X)

# Look at the results
lw = 2
plt.scatter(X, y, color='darkorange', label='data')
plt.plot(X, y_rbf, color='navy', lw=lw, label='RBF model')
# plt.plot(X, y_lin, color='c', lw=lw, label='Linear model')
# plt.plot(X, y_poly, color='cornflowerblue', lw=lw, label='Polynomial model')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

二分类模型AUC评价法

猜你喜欢