Sklearn.metrics类的学习笔记----Classification metrics

关于分类问题的metrics有很多，这里仅介绍几个常用的标准。

1.Accuracy score（准确率）

假设真实值为\(y\)，预测值为\(\hat{y}\)，则Accuracy score的计算公式为：
\(accuracy(y,\hat{y}) = \dfrac 1 m \displaystyle\sum_{i=1}^m(y_i = \hat{y}_i)\)，举例说明：

>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2

参数解释：
加入参数normalize=False后，计算的是预测正确的个数。

2.Confusion matrix（混淆矩阵）

举例说明：

#多分类
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

如果是二分类问题,也可以用ravel()函数，直接输出TN,FP,FN,TP的值，这几个值的顺序也就是混淆矩阵里的按行按列依次写出的顺序。

>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
>>> (tn, fp, fn, tp)
(0, 2, 1, 1)

3.Classification report（分类报告）

Classification report提供了一种输出Precision,Recall,F1-score等值的一个报告。举例如下：

>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 1, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
              precision    recall  f1-score   support

     class 0       0.67      1.00      0.80         2
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.50      0.67         2

   micro avg       0.60      0.60      0.60         5
   macro avg       0.56      0.50      0.49         5
weighted avg       0.67      0.60      0.59         5

4.Precision, recall and F-measures（精确率、召回率和F1值）

这几个概念不在叙述，直接上代码：

import sklearn.metrics
from sklearn.metrics import confusion_matrix
y_true = [1,1,1,1,0,0,0,0,0,0,0]
y_pred = [1,1,1,0,0,0,0,0,1,1,1]
confusion_matrix(y_true,y_pred)

array([[4, 3],
      [1, 3]], dtype=int64)
sklearn.metrics.precision_score(y_true,y_pred)
0.5
sklearn.metrics.recall_score(y_true,y_pred)
0.75
sklearn.metrics.f1_score(y_true,y_pred)
0.6

5.Log loss（对数损失）

Log loss又称为logistic regression loss或者cross-entropy loss，对于二分类来说，我们记\(y\)为真实值，记\(p\)为概率值\(P(y=1)\)，则Log loss计算公式为：
\(L_{log}(y,p) = -ylogp-(1-y)log(1-p)\)，举例如下：

>>> from sklearn.metrics import log_loss
>>> y_true = [0, 0, 1, 1]
>>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
>>> log_loss(y_true, y_pred)    
0.1738...

代码解释：上述代码中的y_pred为预测0和1的概率值，也可以直接写成y_pred = [0.1,0.2,0.7,0.99]得到的loss是一样的。

6.Receiver operating characteristic (ROC)

Receiver operating characteristic (ROC)的横轴是FPR，纵轴是TPR，具体举例如下：

>>> import numpy as np
>>> from sklearn.metrics import roc_curve
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
>>> fpr
array([0. , 0. , 0.5, 0.5, 1. ])
>>> tpr
array([0. , 0.5, 0.5, 1. , 1. ])
>>> thresholds
array([1.8 , 0.8 , 0.4 , 0.35, 0.1 ])

其中，上述参数pos_label=2用于指定标签为2的为正类。roc_curve函数会返回三个值，即fpr,tpr和thresholds三个值，欲画出ROC曲线，则可以调用plt.plot(fpr,tpr)命令即可。
在此，也介绍下roc_auc_score函数的用法，用于计算AUC值，即ROC曲线下方的面积。

>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75