Common performance metrics for machine learning

1 Introduction

This article is a compilation of the performance metrics involved in the paper "The Impact of Automated Parameter Optimization on Defect Prediction Models".

When I first read this paper, I was shocked for a long time. From 101 data sets, 18 data sets involving multiple languages ​​and fields were selected, and 12 performance metrics were used to discuss the performance improvement effect of existing popular machine learning classifier (model) parameter optimization. The entire experimental process is rigorous, and this shocking effect was not weakened until reading a paper "An Empirical Comparison of Model Validation Techniques for Defect Prediction Models" published by this laboratory in 2017. The methods used are the same, and the settings of the comparison experiments are also similar. , like an assembly line product, leaving tears of envy (this kind of review paper can only be published by experts in the field).

2. Summary of performance metrics

You can read these two blogs together:
sklearn—Comprehensive Evaluation Indicators
Machine Learning Performance Evaluation Indicators

Many indicators are based on the confusion matrix of the two classifications. Example of confusion matrix:
insert image description here

Indicator name Calculation formula meaning references
Precision (Precision) P = T P T P + F P P=\frac{TP}{TP+FP} P=TP+FPTP [ 3 ] [3] [3]
Recall rate (recall rate Recall) TPR R = T P T P + F N R=\frac{TP}{TP+FN} R=TP+FNTP The proportion of positive classes (defective modules) that are correctly classified [ 3 ] [3][3]
F m e a s u r e F_{measure} Fmeasure( F 1 F_{1} F1) F = 2 × P × R P + R F=2 \times \frac{P \times R}{P+R} F=2×P+RP×R Harmonized mean of precision and recall [ 3 ] [3] [3]
Specificity (Specify) TNR S = T N T N + F P S=\frac{TN}{TN+FP} S=TN+FPTN The proportion of negative classes (defect-free modules) that are correctly classified
False positive rate FPR F P R = F P T N + F P FPR=\frac{FP}{TN+FP} FPR=TN+FPFP Proportion of negative classes (blocks without defects) being misclassified [ 4 ] [4] [4]
G m e a n G_{mean} Gmean G − m e a n = R × S G-mean=\sqrt{R \times S} Gmean=R×S Geometric mean of R and S
G m e a s u r e G_{measure} Gmeasure G m e a s u r e = 2 × p d × ( 1 − p f ) p d + ( 1 − p f ) G_{measure}=\frac{2 \times pd \times(1-pf)}{pd+(1-pf)} Gmeasure=pd+(1pf)2×pd×(1pf) The harmonic mean of TPR and FPR, where pd=TPR, pf=FPR
B a l a n c e Balance Balance 1 − ( 0 − p f ) 2 + ( 1 + p d ) 2 ) 2 1-\sqrt{\frac{\left.(0-p f)^{2}+(1+p d)^{2}\right)}{2}} 12(0pf)2+(1+pd)2) The proportion of negative classes that are misclassified, where pd=TPR, pf=FPR [ 5 ] [5] [5], [ 6 ] [6] [6]
Matthews Correlation Coefficient (MCC) M C C = T P × T N − F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) MCC=\frac{TP \times TN-FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} MCC=(TP+FP)(TP+F N ) ( T N+FP)(TN+FN) TP×TNFP×FN Correlation coefficient between actual and predicted classification [ 7 ] [7] [7]
AUC Area under the ROC curve [ 8 ] [8] [8] [ 9 ] [9] [9] [ 10 ] [10] [10] [ 11 ] [11] [11] [ 12 ] [12] [12] [ 13 ] [13] [13]
Brier 1 N ∑ i = 1 N ( p i − y i ) 2 \frac{1}{N} \sum_{i=1}^{N}\left(p_{i}-y_{i}\right)^{2} N1i=1N(piyi)2 预测概率和结果之间的差距 [ 14 ] [14] [14] [ 15 ] [15] [15]
LogLoss l o g l o s s = − 1 N ∑ i = 1 N ( y i log ⁡ ( p i ) + ( 1 − y i ) log ⁡ ( 1 − p i ) ) logloss=-\frac{1}{N} \sum_{i=1}^{N}\left(y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right) logloss=N1i=1N(yilog(pi)+(1yi)log(1pi)) 分类损失函数

补充:

  • MCC:MCC本质上是一个描述实际分类与预测分类之间的相关系数,它的取值范围为[-1,1],取值为1时表示对受试对象的完美预测,取值为0时表示预测的结果还不如随机预测的结果,-1是指预测分类和实际分类完全不一致;
  • Brier score: p i p_{i} pi是预测概率, y i y_{i} yi是真实的标签(0或者1),取值范围是[0, 1],0代表性能最好,1代表性能最差,0.25表示随机分类;
  • LogLoss: p i p_{i} pi是预测概率, y i y_{i} yi是真实的标签(0或者1),Kaggle比赛的标准性能度量;

3、参考文献

[ 1 ] [1] [1]C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “The Impact of Automated Parameter Optimization on Defect Prediction Models,” IEEE Trans. Softw. Eng., vol. 45, no. 7, pp. 683–711, Jul. 2019, doi: 10.1109/TSE.2018.2794977
[ 2 ] [2] [2]C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “An Empirical Comparison of Model Validation Techniques for Defect Prediction Models,” IEEE Trans. Softw. Eng., vol. 43, no. 1, pp. 1–18, Jan. 2017, doi: 10.1109/TSE.2016.2584050.
[ 3 ] [3] [3]W. Fu, T. Menzies, and X. Shen, “Tuning for software analytics: is it really necessary?” Information and Software Technology, vol. 76, pp. 135–146.
[ 4 ] [4] [4]T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static Code Attributes to Learn Defect Predictors,” IEEE Transactions on Software Engineering (TSE), vol. 33, no. 1, pp. 2–13, 2007.
[ 5 ] [5] [5]H. Zhang and X. Zhang, “Comments on “Data Min- ing Static Code Attributes to Learn Defect Predic- tors”,” IEEE Transactions on Software Engineering (TSE), vol. 33, no. 9, pp. 635–636, 2007.
[ 6 ] [6] [6]A. Tosun, “Ensemble of Software Defect Predictors: A Case Study,” in Proceedings of the International Sympo- sium on Empirical Software Engineering and Measurement (ESEM), 2008, pp. 318–320.
[ 7 ] [7] [7]M. Shepperd, D. Bowes, and T. Hall, “Researcher Bias: The Use of Machine Learning in Software Defect Prediction,” IEEE Transactions on Software Engineering (TSE), vol. 40, no. 6, pp. 603–616, 2014.
[ 8 ] [8] [8]S. den Boer, N. F. de Keizer, and E. de Jonge, “Performance of prognostic models in critically ill cancer patients - a review.” Critical care, vol. 9, no. 4, pp. R458–R463, 2005.
[ 9 ] [9] [9]F. E. Harrell Jr., Regression Modeling Strategies, 1st ed. Springer, 2002.
[ 10 ] [10] [10]J. Huang and C. X. Ling, “Using AUC and accuracy
in evaluating learning algorithms,” Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 299– 310, 2005.
[ 11 ] [11] [11]S. Lessmann, S. Member, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings,” IEEE Transactions on Software Engineering (TSE), vol. 34, no. 4, pp. 485–496, 2008.
[ 12 ] [12] [12]E. W. Steyerberg, Clinical prediction models: a practi- cal approach to development, validation, and updating. Springer Science & Business Media, 2008.
[ 13 ] [13] [13]E. W. Steyerberg, A. J. Vickers, N. R. Cook, T. Gerds, N. Obuchowski, M. J. Pencina, and M. W. Kattan, “Assessing the performance of prediction models: a framework for some traditional and novel measures,” Epidemiology, vol. 21, no. 1, pp. 128–138, 2010.
[ 14 ] [14] [14]G. W. Brier, “Verification of Forecasets Expressed in Terms of Probability,” Monthly Weather Review, vol. 78, no. 1, pp. 25–27, 1950.
[ 15 ] [15] [15]K. Rufibach, “Use of Brier score to assess binary predictions,” Journal of Clinical Epidemiology, vol. 63, no. 8, pp. 938–939, 2010.

Guess you like

Origin blog.csdn.net/u012949658/article/details/110895485