机器学习之常见的性能度量

文章目录

1、简介

本文是对论文《The Impact of Automated Parameter Optimization on Defect Prediction Models》涉及到的性能度量标准整理。

最初看这篇论文的时候，被震撼了好久。从101个数据集中，选取涉及多个语言、多个领域的18个数据集，用12个性能度量讨论现有流行机器学习分类器（模型）参数优化对性能的提升效果。整个实验过程严谨，这种震撼效果直到读到这个实验室2017年的一篇论文《An Empirical Comparison of Model Validation Techniques for Defect Prediction Models》才减弱，使用的方法是相同的，对比实验的设置也大同小异，像是流水线产品，留下了羡慕的泪水（这种综述类的论文只能领域里的专家才能发）。

2、性能度量总结

可以参考着这两篇博客一起看：
sklearn—评价指标大全
 机器学习性能评估指标

指标很多都是根据二分类的混淆矩阵来的，混淆矩阵示例：
在这里插入图片描述

指标名称	计算公式	含义	参考文献
查准率(精确率Precision)	$P=\frac{TP}{TP+FP}$		$[3]$
查全率(召回率Recall) TPR	$R=\frac{TP}{TP+FN}$	正类(有缺陷的模块)被正确分类的比例	$[3]$
$F_{measure}$ ( $F_{1}$ )	$\times \frac{P \times R}{P+R}$	查准率和查全率的调和平均值	$[3]$
特异度(Specify)TNR	$S=\frac{TN}{TN+FP}$	负类(无缺陷模块)被正确分类的比例
误报率FPR	$FPR=\frac{FP}{TN+FP}$	负类(无缺陷模块)被错误分类的比例	$[4]$
$G_{mean}$	$G-mean=\sqrt{R \times S}$	R和S的几何平均数
$G_{measure}$	$G_{measure}=\frac{2 \times pd \times(1-pf)}{pd+(1-pf)}$	TPR和FPR的调和平均值，其中pd=TPR，pf=FPR
$B a l a n c e$	$1-\sqrt{\frac{\left.(0-p f)^{2}+(1+p d)^{2}\right)}{2}}$	负类被误分类的比例，其中pd=TPR，pf=FPR	$[5]$ , $[6]$
马修斯系数(Matthews Correlation Coefficient, MCC)	$MCC=\frac{TP \times TN-FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$	实际分类与预测分类之间的相关系数	$[7]$
AUC		ROC曲线下的面积	$[8]$ ， $[9]$ ， $[10]$ ， $[11]$ ， $[12]$ ， $[13]$
Brier	$\frac{1}{N} \sum_{i=1}^{N}\left(p_{i}-y_{i}\right)^{2}$	预测概率和结果之间的差距	$[14]$ ， $[15]$
LogLoss	$logloss=-\frac{1}{N} \sum_{i=1}^{N}\left(y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right)$	分类损失函数

补充：

MCC：MCC本质上是一个描述实际分类与预测分类之间的相关系数，它的取值范围为[-1,1]，取值为1时表示对受试对象的完美预测，取值为0时表示预测的结果还不如随机预测的结果，-1是指预测分类和实际分类完全不一致；
Brier score： $p_{i}$ 是预测概率， $y_{i}$ 是真实的标签(0或者1)，取值范围是[0, 1]，0代表性能最好，1代表性能最差，0.25表示随机分类；
LogLoss： $p_{i}$ 是预测概率， $y_{i}$ 是真实的标签(0或者1)，Kaggle比赛的标准性能度量；

3、参考文献

$[1]$ C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “The Impact of Automated Parameter Optimization on Defect Prediction Models,” IEEE Trans. Softw. Eng., vol. 45, no. 7, pp. 683–711, Jul. 2019, doi: 10.1109/TSE.2018.2794977
$[2]$ C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “An Empirical Comparison of Model Validation Techniques for Defect Prediction Models,” IEEE Trans. Softw. Eng., vol. 43, no. 1, pp. 1–18, Jan. 2017, doi: 10.1109/TSE.2016.2584050.
$[3]$ W. Fu, T. Menzies, and X. Shen, “Tuning for software analytics: is it really necessary?” Information and Software Technology, vol. 76, pp. 135–146.
$[4]$ T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static Code Attributes to Learn Defect Predictors,” IEEE Transactions on Software Engineering (TSE), vol. 33, no. 1, pp. 2–13, 2007.
$[5]$ H. Zhang and X. Zhang, “Comments on “Data Min- ing Static Code Attributes to Learn Defect Predic- tors”,” IEEE Transactions on Software Engineering (TSE), vol. 33, no. 9, pp. 635–636, 2007.
$[6]$ A. Tosun, “Ensemble of Software Defect Predictors: A Case Study,” in Proceedings of the International Sympo- sium on Empirical Software Engineering and Measurement (ESEM), 2008, pp. 318–320.
$[7]$ M. Shepperd, D. Bowes, and T. Hall, “Researcher Bias: The Use of Machine Learning in Software Defect Prediction,” IEEE Transactions on Software Engineering (TSE), vol. 40, no. 6, pp. 603–616, 2014.
$[8]$ S. den Boer, N. F. de Keizer, and E. de Jonge, “Performance of prognostic models in critically ill cancer patients - a review.” Critical care, vol. 9, no. 4, pp. R458–R463, 2005.
$[9]$ F. E. Harrell Jr., Regression Modeling Strategies, 1st ed. Springer, 2002.
$[10]$ J. Huang and C. X. Ling, “Using AUC and accuracy
in evaluating learning algorithms,” Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 299– 310, 2005.
$[11]$ S. Lessmann, S. Member, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings,” IEEE Transactions on Software Engineering (TSE), vol. 34, no. 4, pp. 485–496, 2008.
$[12]$ E. W. Steyerberg, Clinical prediction models: a practi- cal approach to development, validation, and updating. Springer Science & Business Media, 2008.
$[13]$ E. W. Steyerberg, A. J. Vickers, N. R. Cook, T. Gerds, N. Obuchowski, M. J. Pencina, and M. W. Kattan, “Assessing the performance of prediction models: a framework for some traditional and novel measures,” Epidemiology, vol. 21, no. 1, pp. 128–138, 2010.
$[14]$ G. W. Brier, “Verification of Forecasets Expressed in Terms of Probability,” Monthly Weather Review, vol. 78, no. 1, pp. 25–27, 1950.
$[15]$ K. Rufibach, “Use of Brier score to assess binary predictions,” Journal of Clinical Epidemiology, vol. 63, no. 8, pp. 938–939, 2010.