关于分类问题的metrics有很多,这里仅介绍几个常用的标准。
1.Accuracy score(准确率)
假设真实值为\(y\),预测值为\(\hat{y}\),则Accuracy score的计算公式为:
\(accuracy(y,\hat{y}) = \dfrac 1 m \displaystyle\sum_{i=1}^m(y_i = \hat{y}_i)\),举例说明:
>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2
参数解释:
加入参数normalize=False后,计算的是预测正确的个数。
2.Confusion matrix(混淆矩阵)
举例说明:
#多分类
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
如果是二分类问题,也可以用ravel()函数,直接输出TN,FP,FN,TP的值,这几个值的顺序也就是混淆矩阵里的按行按列依次写出的顺序。
>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
>>> (tn, fp, fn, tp)
(0, 2, 1, 1)
3.Classification report(分类报告)
Classification report提供了一种输出Precision,Recall,F1-score等值的一个报告。举例如下:
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 1, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
class 0 0.67 1.00 0.80 2
class 1 0.00 0.00 0.00 1
class 2 1.00 0.50 0.67 2
micro avg 0.60 0.60 0.60 5
macro avg 0.56 0.50 0.49 5
weighted avg 0.67 0.60 0.59 5
4.Precision, recall and F-measures(精确率、召回率和F1值)
这几个概念不在叙述,直接上代码:
import sklearn.metrics
from sklearn.metrics import confusion_matrix
y_true = [1,1,1,1,0,0,0,0,0,0,0]
y_pred = [1,1,1,0,0,0,0,0,1,1,1]
confusion_matrix(y_true,y_pred)
array([[4, 3],
[1, 3]], dtype=int64)
sklearn.metrics.precision_score(y_true,y_pred)
0.5
sklearn.metrics.recall_score(y_true,y_pred)
0.75
sklearn.metrics.f1_score(y_true,y_pred)
0.6
5.Log loss(对数损失)
Log loss又称为logistic regression loss或者cross-entropy loss,对于二分类来说,我们记\(y\)为真实值,记\(p\)为概率值\(P(y=1)\),则Log loss计算公式为:
\(L_{log}(y,p) = -ylogp-(1-y)log(1-p)\),举例如下:
>>> from sklearn.metrics import log_loss
>>> y_true = [0, 0, 1, 1]
>>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
>>> log_loss(y_true, y_pred)
0.1738...
代码解释:上述代码中的y_pred为预测0和1的概率值,也可以直接写成y_pred = [0.1,0.2,0.7,0.99]得到的loss是一样的。
6.Receiver operating characteristic (ROC)
Receiver operating characteristic (ROC)的横轴是FPR,纵轴是TPR,具体举例如下:
>>> import numpy as np
>>> from sklearn.metrics import roc_curve
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
>>> fpr
array([0. , 0. , 0.5, 0.5, 1. ])
>>> tpr
array([0. , 0.5, 0.5, 1. , 1. ])
>>> thresholds
array([1.8 , 0.8 , 0.4 , 0.35, 0.1 ])
其中,上述参数pos_label=2用于指定标签为2的为正类。roc_curve函数会返回三个值,即fpr,tpr和thresholds三个值,欲画出ROC曲线,则可以调用plt.plot(fpr,tpr)命令即可。
在此,也介绍下roc_auc_score函数的用法,用于计算AUC值,即ROC曲线下方的面积。
>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75