Common evaluation indicators related to machine learning

table of Contents

1. Accuracy accuracy rate

Second, the confusion matrix Confusion Matrix

3. Precision (accuracy) and Recall (recall rate)

Four, Fβ Score

Five, PR curve, ROC curve

5.1, PR curve

5.2, ROC curve

Six, AUC

Seven, regression (Regression) algorithm indicators

6.1, average absolute error MAE

6.2, mean square error MSE

Eight, deep learning target detection related indicators

8.1, IOU cross-to-match ratio

8.2, NMS non-maximum suppression

8.3 、 AP 、 MAP


In the process of using machine learning models, we will inevitably encounter how to evaluate whether our model is good or bad? Or when we look at other people's papers, we always encounter things like "accuracy" and "recall rate". Bloggers with a bad memory always forget that they are stupid and confused, and they are always confused. Therefore, here are some common evaluation indicators recorded. The following figure shows the evaluation indicators of different machine learning algorithms:

1. Accuracy accuracy rate

Divide the number of pairs of samples at the time of detection by the total number of samples. The accuracy rate is generally used to evaluate the global accuracy of the detection model, and the information contained is limited, and the performance of a model cannot be fully evaluated.

Second, the confusion matrix Confusion Matrix

For a two-classification task, the prediction results of the two-classifier can be divided into the following four categories:

  • TP : true positive—predict positive samples as positive
  • TN : true negative—predict negative samples as negative
  • FP : false positive—predict negative samples as positive
  • FN : false negative—predict positive samples as negative
 

Positive sample

Negative sample
Positive sample (forecast) TP (True Positive) FP (false positive)
Negative sample (forecast) FN (false negative) TN (True Negative)

3. Precision and Recall

For a sample of severely unbalanced data, such as email, if we have 1000 emails, of which 999 are normal emails, and there is only 1 spam email, then assuming that all samples are identified as normal emails, then the Accuracy is 99.9 at this time %. It seems that our model is good, but the model is unable to identify spam, because the purpose of training the model is to be able to classify spam. Obviously, this is not the evaluation indicator of a model we need. Of course, in the actual model, we will use this indicator to initially judge whether our model is good or bad.

Through the above description, we know that accuracy alone cannot accurately measure our model, so here we introduce Precision (precision) and Recall (recall) , they are only applicable to two classification problems . Let's take a look at its definition first:

Precision

  • Definition : The ratio of the number of correctly classified positive examples to the number of cases classified as positive, also known as the precision rate .
  • Calculation formula :Precision=\frac{TP}{TP+FP}

Recall (recall rate)

  • Definition : The ratio of the number of correctly classified positive cases to the actual number of positive cases is also called recall rate .
  • Calculation formula :Recall=\frac{TP}{TP+FN}

Ideally, the higher the accuracy rate and the recall rate, the better. However, in fact, the two are contradictory in some cases. When the precision rate is high, the recall rate is low; when the precision rate is low, the recall rate is high. It is not difficult to observe this property by observing the PR curve. For example, when searching a webpage, if only the most relevant webpage is returned, the accuracy rate is 100%, and the recall rate is very low; if all webpages are returned, the recall rate is 100%, and the accuracy rate is very low. Therefore, it is necessary to judge which index is important according to actual needs in different occasions.

Four, Fβ Score

Normally, precision and recall have an influence on each other. If Precision is high, Recall is low; if Recall is high, Precision is low. So is there a way to measure these two indicators? So here is the F1 Score. The F1 function is a commonly used indicator, and the F1 value is the harmonic mean of the precision rate and the recall rate, that is

                                                  \frac{2}{F_{1}}=\frac{1}{P}+\frac{1}{R}

                                                 F_{1}=\frac{2PR}{P+R}

Of course, the F value can be generalized to assign different weights to the precision rate and the recall rate to perform a weighted reconciliation:

                                                 F_{\beta }=\frac{ \left ( 1 + \beta ^{2} \right ) PR}{\beta ^{2}P+R}

Among them, when β>1, the recall rate is more influential; when β=1, it degenerates to the standard F1; when β<1, the precision rate is more influential.

Five, PR curve, ROC curve

5.1, PR curve

The PR curve uses Precision as the ordinate and Recall as the abscissa. As shown below:

It is not difficult to find from the above figure that the trade off between precision and recall. The closer the curve is to the upper right corner, the better the performance. The area under the curve is called the AP score, which reflects the accuracy and recall of the model to a certain extent. High percentage. However, this value is not convenient to calculate. Considering the accuracy and recall rate, the F1 function or AUC value is generally used (because the ROC curve is easy to draw, and the area under the ROC curve is easier to calculate).

Generally speaking, use the PR curve to evaluate the quality of a model. First look at whether it is smooth or not, and see who goes up and down (on the same test set). Generally speaking, the upper one is better than the lower one (the red line is better than the black line) .

5.2, ROC curve

In many machine learning models, many models output prediction probabilities. When using indicators such as precision and recall rates for model evaluation, it is also necessary to set a classification threshold for the prediction probability. For example, the prediction probability is greater than the threshold as a positive example, and vice versa. Is a negative example. This makes the model one more hyperparameter, and this hyperparameter will affect the generalization ability of the model.

The ROC curve curve does not need to set such a threshold. The vertical axis of the ROC curve is the true rate, and the abscissa is the false positive rate. The ROC (Receiver Operating Characteristic) curve is often used to evaluate the pros and cons of a binary classifier. The horizontal axis of the ROC is False positive rate, FPR, "false positive rate", that is, the ratio of misjudgment as correct; the vertical axis is true positive rate, TPR, "true rate", that is, the ratio of correct judgment as correct.

  • Real class rate (True Rate Postive) TPR : TPR=\frac{TP}{TP+FN} , in all of the positive samples, the classifier correctly predicted proportions (equal to Recall).
  • True negative rate class (True Negative Rate) the FPR : FPR=\frac{FP}{FP+TN}, in all the negative samples, the classifier predicted error ratio.

The full name of ROC curve is Receiver Operating Characteristic, which is often used to judge the quality of a classifier. The ROC curve is the relationship curve between FPR and TPR. This combination is FPR versus TPR, that is, costs versus benefits. Obviously, the higher the benefits and the lower the costs, the better the performance of the model. The better the effect of the model, the more the entire ROC curve moves toward the upper left corner, similar to PR. If the ROC curve of one model completely covers the other, it means that the model is better than the other model.

It is also worth noting that the AUC calculation method also considers the learner's ability to classify positive and negative examples. In the case of unbalanced samples, it can still make a reasonable evaluation of the classifier. AUC is not sensitive to whether the sample category is balanced, which is one reason why unbalanced samples usually use AUC to evaluate the performance of the learner.

The AUC score is the area under the curve, and the larger it means the better the classifier is. Obviously the value of this area will not be greater than 1. And because the ROC curve is generally above the line y=x, the value range of AUC is between 0.5 and 1. The AUC value is used as the evaluation criterion because in many cases the ROC curve does not clearly indicate which classifier performs better, and as a value, a classifier with a larger AUC performs better.

 Since there are so many evaluation criteria, why use ROC and AUC? Because the ROC curve has a very good feature: when the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged. In actual data sets, there is often a class imbalance phenomenon, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time.

Six, AUC

AUC is a model evaluation index, which can only be used for the evaluation of two-class models. For two-class models, there are many other evaluation indexes, such as logloss, accuracy, and precision. If you often pay attention to data mining competitions, such as kaggle, you will find that AUC and logloss are basically the most common model evaluation indicators. Why are AUC and logloss more commonly used than accuracy? Because many machine learning models predict the results of classification problems are probabilities, if you want to calculate accuracy, you need to convert the probabilities into categories, which requires manual setting of a threshold. If the predicted probability of a sample is higher than this prediction, then Put this sample into one category, below this threshold, put it into another category. So this threshold greatly affects the calculation of accuracy. Using AUC or logloss can avoid converting predicted probabilities into categories.

AUC is an acronym for Area under curve. AUC is the area under the ROC curve, a performance indicator that measures the pros and cons of a learner. From the definition, AUC can be obtained by summing the area of ​​each part under the ROC curve. Assuming that the ROC curve is formed by sequentially connecting points with coordinates (x1,y1),...,(xm,ym), the AUC can be estimated as:

                                              AUC=\frac{1}{2}\sum_{i=1}^{m-1}(x_{i+1}-x_{i})(y_{i}+y_{i+1})

The AUC value is the area covered by the ROC curve . Obviously, the larger the AUC, the better the classification effect of the classifier.

  • AUC = 1, is a perfect classifier.
  • 0.5 <AUC <1, which is better than random guessing. Has predictive value.
  • AUC = 0.5, follow the machine to guess the same (for example: lost copper), there is no predictive value.
  • AUC <0.5 is worse than random guessing; but as long as it is always anti-predictable, it is better than random guessing.

The physical meaning of AUC The physical meaning of  AUC is the probability that the prediction result of the positive sample is greater than the prediction result of the negative sample. So AUC reflects the classifier's ability to sort samples. It is also worth noting that AUC is not sensitive to whether the sample category is balanced, which is one reason why unbalanced samples usually use AUC to evaluate the performance of the classifier.

How to calculate AUC?

  • Method 1: AUC is the area under the ROC curve, then we can directly calculate the area. The area is the sum of the small trapezoidal areas (curves). The accuracy of the calculation is related to the accuracy of the threshold.
  • Method 2: According to the physical meaning of AUC, we calculate the probability that the prediction result of the positive sample is greater than the prediction result of the negative sample . Take n1*n0 (n1 is the number of positive samples, n0 is the number of negative samples) two-tuples, each two-tuple compares the prediction results of the positive sample and the negative sample, and the prediction result of the positive sample is higher than the prediction result of the negative sample. Correct, the ratio of the two-tuples that are correctly predicted to the total two-tuples is the final AUC. The time complexity is O(N*M).
  • Method 3: We first sort all samples according to score, and use rank to represent them in turn, such as the sample with the largest score, rank=n (n=n0+n1, where n0 is the number of negative samples and n1 is the number of positive samples), Next is n-1. Then for the sample with the largest rank among the positive samples, rank_max, there are n1-1 other positive samples smaller than his score, then there are (rank_max-1)-(n1-1) negative samples smaller than his score. The second is (rank_second-1)-(n1-2). Finally, we get the probability that the positive sample is greater than the negative sample:

                                                AUC=\frac{\sum rank(score) - \frac{n_{1}*(n_{1}+1)}{2}}{n_{0}*n_{1}}

among them:

n0, n1-the number of negative and positive samples

rank(score)——represents the sequence number of the i-th sample. (Probability scores are ranked from small to large, ranked in the rank position). Here is to add up the serial numbers of all positive samples.

This article focuses on the third calculation method, and it is recommended that you choose this method when calculating AUC. For how to calculate, you can refer to the following link, and it will be done after reading it.

Reference link: AUC calculation method

Why can both ROC and AUC be applied to unbalanced classification problems?

The ROC curve is only related to the abscissa (FPR) and the ordinate (TPR). We can find that TPR is only the probability of correct prediction in positive samples, while FPR is only the probability of incorrect prediction in negative samples, and has nothing to do with the ratio of positive and negative samples. Therefore, the value of ROC has nothing to do with the actual ratio of positive and negative samples, so it can be used for both equilibrium and imbalance problems. The geometric meaning of AUC is the area under the ROC curve, so it has nothing to do with the actual ratio of positive and negative samples.

Seven, regression (Regression) algorithm indicators

6.1, average absolute error MAE

Mean Absolute Error (MAE) is also called L1 norm loss.

                                             MAE(y,\hat{y})=\frac{1}{m}\sum_{i=1}^{m}\left | y_{i}-\hat{y}_{i} \right |

Although MAE can better measure the quality of the regression model, the existence of the absolute value causes the function to be unsmooth, and the derivative cannot be obtained at some points. You can consider changing the absolute value to the square of the residual, which is the mean square error.

6.2, mean square error MSE

Mean Squared Error (MSE) is also called L2 norm loss.

                                             MSE(y,\hat{y})=\frac{1}{m}\sum_{i=1}^{m}\left ( y,\hat{y} \right )^{2}

Eight, deep learning target detection related indicators

8.1, IOU cross-to-match ratio

The full name of IoU is Intersection over Union. IoU calculates the ratio of the intersection and union of the "predicted frame" and the "real frame". IoU is a simple evaluation index, which can be used to evaluate the performance of any model algorithm whose output is a bounding box.

Calculation formula:

 

8.2, NMS non-maximum suppression

Non-Maximum Suppression (NMS) is a classic post-processing step in the detection algorithm, which is crucial to the final performance of the detection. This is because the original detection algorithm usually predicts a large number of detection frames, which generate a large number of false, overlapping and inaccurate samples, and a large number of algorithm calculations are required to filter these samples. If the processing is not good, it will greatly reduce the performance of the algorithm, so it is particularly important to use an effective algorithm to eliminate redundant detection frames to obtain the best prediction.

The essence of the NMS algorithm is to search for local maxima. In target detection, the NMS algorithm mainly uses the wooden plaque detection frame and the corresponding confidence score, and sets a certain threshold to delete the bounding box with a large overlap.

The NMS algorithm is generally to remove the redundant frame after model prediction, and it generally has a nms_threshold=0.5. The specific implementation ideas are as follows:

  1. Select which one of the largest scores in this type of box, record it as box_best, and keep it
  2. Calculate the IOU of box_best and the rest of the box
  3. If its IOU>0.5, then discard this box (because the two boxes may represent the same goal, keep the one with the higher score)
  4. From the last remaining boxes, find out which one has the largest scores, and so on

The following is the NMS algorithm implemented by python:

def nms(bbox_list,thresh):
    '''
    非极大值抑制
    :param bbox_list:方框集合,分别为x0,y0,x1,y1,conf
    :param thresh: iou阈值
    :return:
    '''
    x0 = bbox_list[:,0]
    y0 = bbox_list[:,1]
    x1 = bbox_list[:,2]
    y1 = bbox_list[:,3]
    conf=bbox_list[:,4]

    areas=(x1-x0+1)*(y1-y0+1)
    order=conf.argsort()[::-1]
    keep=[]
    while order.size>0:
        i=order[0]
        keep.append(i)

        xx1=np.maximum(x0[i],x0[order[1:]])
        yy1=np.maximum(y0[i],y0[order[1:]])
        xx2=np.minimum(x1[i],x1[order[1:]])
        yy2=np.minimum(y1[i],y1[order[1:]])

        w=np.maximum(xx2-xx1+1,0)
        h=np.maximum(yy2-yy1+1,0)

        inter=w*h
        over=inter/(areas[i]+areas[order[1:]]-inter)
        inds=np.where(over<=thresh)[0]
        order=order[inds+1]

    return keep

8.3 、 AP 、 MAP

AP (Average Precision) is calculated based on accuracy and recall. We have already understood these two concepts before, so I won't talk about them here. AP represents the average of the accuracy rates of different recall rates.

Therefore, here we have introduced another concept MAP (Mean Average Precision, MAP). MAP means that the AP of each category is counted once, and then the average value is taken.

Therefore, AP is for a single category, and mAP is for all categories.

In the field of machine vision target detection, the dividing line between AP and mAP is not obvious. As long as the detection frame with IoU> 0.5 can be called AP0.5, then the expression of AP and mAP are the accuracy of multi-class detection, and the focus is on the detection frame. Accuracy.

Reference link:

Detailed evaluation indicators of machine learning classification model

Machine learning evaluation index

Machine learning evaluation metrics

PR, ROC, AUC are all broken

PR curve, ROC curve, AUC indicator, etc., Accuracy vs Precision

Image processing intersection ratio (IoU)

Guess you like

Origin blog.csdn.net/wxplol/article/details/93471838