Comprehensive evaluation index of machine learning classification and regression algorithm summary

This article is a "machine learning book" The first three, after reading this article you can master the evaluation index classification and regression algorithms.

PS: end of the article with exercises

After reading machine learning algorithms common sense after you already know what is owed fitting and over-fitting, Bayes error and bias and variance. In this model to introduce a number of indicators to assess the performance of some of the offline machine learning.

When we get more training model, how to measure the performance of these models do? That we need to be able to measure a model of "good and bad" standard, which we call evaluation index. When comparing the effect of different models, different evaluation index often lead to different conclusions, which means good and bad effects model is relative.

For different types of learning tasks, we have different evaluation index, here we introduce some of the most common indicators to assess classification and regression algorithms.

Category Index

Most of the problems in life are all two-class classification problem, so here with two classified as an example to illustrate some of the indicators related to the lower classification.

Before formally introduced indicators, first to popularize some basic concepts: sometimes "positive", "true", "n-type", "1" refers to the same thing, "negative", "false", "negative type" "0" also refers to the same thing. For example prediction model for this sample is 1, can be considered the model to predict the results of this sample it is true, or positive type, or positive, in fact, that is a meaning.

Confusion matrix

Confusion matrix (confusion matrix) is a commonly used tool to evaluate the classification problem for k-ary classification, in fact, it is a kxk table to record predictions classifier. For the common binary, its confusion matrix is a 2x2.

In the second category, the sample can be based on a combination of splitting the prediction results of its real results and the model of true positives (true positive, TP), true negative (true negative, TN), false positives (false positive, FP), false negative (false negative, FN). According to TP, TN, FP, FN can get the confusion matrix dichotomous.

Accuracy

Accuracy (accuracy) refers to the model correctly predicted (including prediction is true and correct false prediction correct) the number of samples proportion of the total number of samples, that is,

Which represents the number of samples correctly classified model, it represents all of the number of samples.

In binary, the accuracy can be obtained by the following calculation formula.

Accuracy is one of the easiest and most intuitive evaluation index classification problem, but there are some limitations of accuracy. For example, in the second category, when accounting for 99% of the negative samples, all samples if the model predicted a negative sample can be obtained 99% accuracy rate. Although the exact rate seems high, but in fact, when this model does not use, because it can not find a positive sample.

Accuracy rate

Accuracy rate (precision) refers to the model prediction is true, but also the actual proportion of total number of samples really true model predicted the number of samples, that is,

To illustrate, such as the police to catch the thief, arrested 10 people, including six people is a thief, then the accuracy rate is 6/10 = 0.6.

Recall

Recall (recall) sometimes called recall, refers to the model prediction is true, also accounted for practically all the actual ratio of the number of samples is really true number of samples, that is,

To illustrate, the above example the police or catch the thief, arrested 10 individuals, wherein the individual is a thief 6, there is another three thieves away, then the recall rate is 6 / (6 + 3) ≈ 0.67.

Value F1 / Fα value

In general, precision and recall are mutually exclusive, that is to say the exact rate is high, the recall rate becomes low; recall rate is high, accuracy rate becomes low. Therefore, taking into account the design of a value index F1 precision and recall rates. F1 is the harmonic mean value of precision and recall, i.e.

In some scenarios, how we focus on precision and recall are not the same, this time, F1 value of more general form Fα value will be able to meet. Value is defined as follows Fα

Wherein α represents the size of the relative importance of the precise rate of recall.

Multi-classification

Many times we encounter is a multi-classification, which means that each combination twenty-two category corresponds to a binary confusion matrix. Hypothesis is confusion matrix of n binary, then how come the average of these n the result?

Macro Average

The first approach is to first in the respective confusion matrix results were calculated, and then calculates an average value, this is called "macro average."

Micro Average

In addition to the above macro average, we can also confuse the dihydric corresponding elements of the matrix are averaged to obtain the average TP, TN, FP, FN, and then an average value is calculated based on these, this is called " micro average. "

ROC

In these indicators described previously (such as accuracy, precision, recall, etc.) are required to get the results predicted by the model (negative type or positive type), for many models, the predicted probability is a positive belonging to the class value, so you need to specify a threshold value above the threshold for the positive class, otherwise negative category. The contrast directly determines the size of the generalization ability of the model.

Evaluation index called a receiver operating characteristic (Receiver Operating Characteristic, ROC) curve, the evaluation index can not specify the threshold value. Longitudinal axis of the ROC curve was true positive rate (TPR), and the horizontal axis is the false-positive rate (FPR).

The formula true positive rate and the false positive rate is as follows:

It can be found, and TPR Recall formula is the same. So how do you draw the ROC curve of it? Can be seen, consists of a series ROC curve (FPR, TPR) dots, but a particular model, only obtain a classification result that only one set (FPR, TPR), corresponding to a point on the ROC curve, how get more of it?

Our model for all the samples the predicted value (in the probability value of the positive type) in descending order, followed by the predicted probability value as a threshold value, each model is obtained prediction result under the threshold for the positive class, the number of samples negative type, and generating a set (FPR, TPR) values, so that you can get a little, and finally connect all points on the ROC curve appeared ROC curve. Clearly, the more times the set threshold, will generate more (FPR, TPR) values, the more the ROC curve drawn smooth. That set the number of the smoothness of the ROC curve with a threshold number of absolute relationship, not necessarily linked to the number of samples . In reality, ROC curve drawn most of us are not smooth.

ROC曲线越靠近左上角，表示效果越好。左上角坐标为（0,1），即 FPR = 0，TPR = 1，这意味着 FP（假阳性）=0， FN（假阴性）=0，这就是一个完美的模型，因为能够对所有的样本正确分类。ROC曲线中的对角线（y=x）上的所有的点都表示模型的区分能力与随机猜测没有差别。

AUC

AUC（Area Under Curve）被定义为ROC曲线下的面积，很明显，AUC的结果不会超过 1，通常ROC曲线都在 y = x 这条直线上面，所以，AUC的值一般在 0.5 ~ 1 之间。

如何理解AUC的作用呢？随机挑选一个正样本（P）和负样本（N），模型对这两个样本进行预测得到每个样本属于正类的概率值，根据概率值对样本进行排序后，正样本排在负样本前面的概率就是AUC值。

AUC可以通过下面的公式计算得到。

其中，rank为将模型对样本预测后的概率值从小到大排序后的正样本的序号（排序从1开始），|P|为正样本数，|N|为负样本数。

需要注意的是，如果多个样本被模型预测的概率值一样，那么求rank的时候只需要将这些原始rank加起来求平均即可。所以说相等概率得分的样本，无论正负，谁在前，谁在后无所谓。

对数损失

对数损失（Logistic Loss，logloss）是对预测概率的似然估计，其标准形式为：

对数损失最小化本质是上利用样本中的已知分布，求解导致这种分布的最佳模型参数，使这种分布出现概率最大。

对数损失对应的二分类的计算公式为：

其中，N为样本数，，为第i个样本预测为1的概率。

对数损失在多分类问题中也可以使用，其计算公式为：

其中，N为样本数，C为类别数，表示第i个样本的类别为j，为第i个样本属于类别j的概率。

logloss衡量的是预测概率分布和真实概率分布的差异性，取值越小越好。

回归指标

在回归学习任务中，我们也有一些评估指标，一起来看看吧！

平均绝对误差

平均绝对误差（Mean Absolute Error，MAE）公式为：

其中，N为样本数，为第i个样本的真实值，为第i个样本的预测值。

均方误差

均方误差（Mean Squared Error，MSE）公式为：

平均绝对百分误差

平均绝对百分误差（Mean Absolute Percentage Error，MAPE）公式为：

MAPE通过计算绝对误差百分比来表示预测效果，其取值越小越好。如果MAPE=10，这表明预测平均偏离真实值10%。

由于MAPE计算与量纲无关，因此在特定场景下不同问题具有一定可比性。不过MAPE的缺点也比较明显，在处无定义。另外需要注意的是，MAPE对负值误差的惩罚大于正值误差，比如预测一个酒店消费是200元，真实值是150元的会比真实值是250的MAPE大。

均方根误差

均方根误差（Root Mean Squared Error）的公式为：

RMSE代表的是预测值和真实值差值的样本标准差。和MAE相比，RMSE对大误差样本有更大的惩罚。不过RMSE有一个缺点就是对离群点敏感，这样会导致RMSE结果非常大。

基于RMSE也有一个常用的变种评估指标叫均方根对数误差（Root Mean Squared Logarithmic Error，RMSLE），其公式为：

RMSLE对预测值偏小的样本惩罚比预测值偏大的样本惩罚更大，比如一个酒店消费均价是200元，预测成150元的惩罚会比预测成250的大。

R2

R2（R-Square）的公式为：

R2用于度量因变量的变异中可由自变量解释部分所占的比例，一般取值范围是 0~1，R2越接近1,表明回归平方和占总平方和的比例越大,回归线与各观测点越接近,用x的变化来解释y值变差的部分就越多,回归的拟合程度就越好。

练习题

看完这篇文章，我们来做几道练习题来检验下学习成果：

为什么说ROC曲线的光滑程度与样本数量没有绝对的关系呢？
如果一个模型的AUC小于0.5，可能是因为什么原因造成的呢？
在一个预测流量的场景中，尝试了多种回归模型，但是得到的 RMSE 指标都非常高，考虑下可能是因为什么原因造成的呢？
在一个二分类问题中，15个样本的真实结果为[0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0]，模型的预测结果为[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1]，计算准确率、精确率、召回率以及F1值。
在一个二分类问题中，7个样本[A, B, C, D, E, F, G]的真实结果为[1, 1, 0, 0, 1, 1, 0]，模型的预测概率为[0.8, 0.7, 0.5, 0.5, 0.5, 0.5, 0.3]，计算AUC值。

想要学习更多机人工智能知识，欢迎关注公众号AI派！

以上所有的练习题答案我都会公布在我的知识星球中，方便后续做一个知识沉淀；另外，关于文章有任何疑问或者想要深入学习与交流，都可以加入我的知识星球来交流（加入方式：扫描下方二维码或者点击“阅读原文”）。

参考：

[1] 周志华.机器学习.第二章第三节（性能度量）
[2] 美团算法团队.美团机器学习实战.第一章第一节（评估指标）
[3] https://blog.csdn.net/qq_22238533/article/details/78666436
[4] https://blog.csdn.net/u013704227/article/details/77604500