The commonly used evaluation indicators for machine learning you want, in case you need them

guide

In machine learning, we go through data collection and data cleaning in the early stage, feature analysis and feature selection in the middle stage, and segment the processed data set in the later stage, dividing the data set into training set, verification set, and test set. Finally, based on the divided The dataset is trained and tuned, and the model with the best performance is selected. So how do we evaluate the performance of our model? This has to say something about the commonly used machine learning rating indicators. get in the car~~

Machine Learning Evaluation Metrics

For the indicators for evaluating model performance in machine learning, commonly used are accuracy, precision, recall, PR curve, F1 score, ROC, AUC, and confusion matrix. Here we first use two classifications as an example to explain, and then we can extend to multi-classification later.
In binary classification, we call the two types of samples positive and negative. After the model is trained, we let the model make predictions on the test set data and evaluate the prediction results. Here are a few concepts:

  • True Positive (TP): positive samples predicted by the model as positive;
  • False Positive (FP): Negative samples that are predicted to be positive by the model;
  • True Negative (TN): Negative samples predicted by the model as negative;
  • False Negative (FN): A positive sample that is predicted as a negative by the model.

The centralized indicators we mentioned above are all calculated based on the above four concepts, which will be introduced separately below.

  1. Accuracy (Accuracy)
    Accuracy is the most primitive evaluation index for classification problems, reflecting the percentage of correct predicted results in the total samples, which is defined as:
    A accuracy = TP + TNTP + TN + FP + FN \begin{aligned} Accuracy = &\frac{TP + TN}{TP + TN + FP + FN}\\ \end{aligned}Accuracy=TP+TN+FP+FNTP+TN
    From the definition of the accuracy rate, we can see that it is tricky. If the distribution of samples in the data set is unbalanced, 99 out of 100 samples are positive samples, and only one negative sample is present. At this time, the model is easy to overfit the positive samples, and it does not Know negative samples. No matter what happens, as long as a sample is encountered, it is predicted to be a positive class. At this time, the calculated accuracy rate is 99%, but we all know that this kind of model cannot be meaningless (there is no judgment on negative samples. ability ). Therefore, it is not objective and one-sided to only look at the accuracy rate when evaluating the model, and it needs to be judged together with other indicators.

  2. Precision

Accuracy reflects the probability that all samples predicted to be positive are actually positive samples, that is, in all samples predicted to be positive, how many real positive samples are there , which is defined as:
P precision = TPTP + FP \ begin{aligned} Precision = &\frac{TP}{TP + FP}\\ \end{aligned}Precision=TP+FPTP
Accuracy reflects the predictive ability of the model on the overall data ( both positive samples and negative samples ), while the precision index reflects the accuracy of our model on the prediction of positive samples .

  1. Recall rate (Recall)

The precision metric reflects how accurately our model predicts positive samples . Then we can understand that the recall rate represents the actual ability of the model to discriminate positive samples , that is, the number of positive samples predicted by the model accounts for the percentage of all true positive samples, which is defined as: R ecall = TPTP + FN \begin{aligned} Recall = &\frac{TP}{TP + FN}\\ \end{aligned}
Recall=TP+FNTP
Recall rate and accuracy rate are indicators that trade off each other. For example, in a typical cat and dog classification, if we want to improve the recall rate of the model for dogs, then some cats may also be judged as dogs. Accuracy drops. In actual engineering, we often need to trade-off these two indicators to find a balance point so that the performance of the model is more suitable for specific business scenarios .

  • PR curve (Precision-Recall Curve)
    PR curve (Precision Recall Curve) is a curve that describes the change of precision/recall rate. The PR curve is defined as follows: according to the prediction result of the model (usually a real value or probability), the test sample is Sort the samples that are most likely to be "positive examples" in the front, and the least likely to be "positive examples" in the back, and predict the samples as "positive examples" one by one in this order, and calculate the current P every time value and R value, as shown in the figure below:
    insert image description here
    How to evaluate the PR curve? If the PR curve of a model B is completely covered by the PR curve of another learner A ( that is, the precision and recall of model A are greater than that of model B ), then it is said that the performance of A is better than that of B. If the curves of A and B intersect, whoever has the larger area under the curve will have better performance. But generally speaking, the area under the curve is difficult to estimate, so the "balance point" (Break-Event Point, referred to as BEP) is derived, that is, the value when P=R, the higher the value of the balance point High, better performance. as the picture shows.
  • F1 score (F1-Score)
    As mentioned above, the Precision and Recall indicators sometimes ebb and flow, that is, the higher the precision rate, the lower the recall rate. In some scenarios, both the precision rate and the recall rate must be considered. The most common The best method is F-Measure, also known as F-Score. F-Measure is the weighted harmonic mean of P and R, defined as:
    1 F β = 1 1 + β 2 ∗ ( 1 P + β 2 R ) . . . . . . . . . . . . . . . . . . ( 1 ) \begin{aligned} &\frac{1}{F_\beta} = \frac{1}{1 + \beta^{2} } * (\frac{1}{P} + \frac {\beta ^{2} }{R} )\\ \end{aligned} ................... (1)Fb1=1+b21(P1+Rb2)...................(1)
    F β = ( 1 + β 2 ) ∗ P ∗ R β 2 ∗ P + R . . . . . . . . . . . . . . . . . . . . . . . . . ( 2 ) \begin{aligned} F_\beta = \frac{(1 + \beta ^2) * P * R}{\beta ^2 * P + R}\\ \end{aligned} ......................... (2) Fb=b2P+R(1+b2)PR. . . . . . . . . . . . . . . . . . . . . . ( 2 )
    When β = 1, it is what we call the F1 score, namely: F 1 =
    2 ∗ P ∗ RP + R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . = \frac{2 * P * R}{P + R}\\ \end{aligned} ...................... ......... (3)F1=P+R2PR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( 3 )
    From formula (1), we can Seeing that the F1 score isthe harmonic mean, when the F1 score is higher, the performance of the model is better.
  • The ROC curve (ROC Curve)
    talked about Accuracy, Precision, and Recall before. They complement each other and can evaluate the performance of the model in many ways. You may think that the model has been evaluated well, but this is not enough, because when the positive samples in the test set When the proportion of negative samples changes, these indicators will also change, so the robustness of these indicators is still not so strong. In order to avoid the impact of sample distribution changes in the test set on the test results, we introduce ROC curves and AUC curves.
    The ROC curve, like the PR curve mentioned above, is an evaluation indicator that does not depend on the threshold. In the distribution model whose output is probability, if only accuracy, precision, and recall are used as evaluation indicators for model comparison, All must be based on a given threshold. For different thresholds, the Metrics results of each model will be different, so it is difficult to obtain a high-confidence result. Before the introduction, we need to introduce several indicator concepts:
  • Sensitivity (Sensitivity), also known as true rate (TPR)
    TPR = number of positive samples predicted correctly Total number of positive samples = TPTP + FN \begin{aligned} TPR = &\frac{positive sample predicted correct number}{total number of positive samples} = \ frac{TP}{TP + FN}\\ \end{aligned}TPR=Total number of positive samplesPositive samples predict correct number=TP+FNTP
  • Specificity (Specificity), also called True Negative Rate (TFR)
    TFR = Negative Samples Predicted Correct Number Total Negative Samples = TNTN + FP \begin{aligned} TFR = &\frac{Negative Samples Predicted Correct Number}{Negative Samples Total Number} = \frac{TN}{TN + FP}\\ \end{aligned}TFR=Total number of negative samplesNegative samples predict correct number=TN+FPTN
  • False Positive Rate (FPR)
    FPR = Number of Negative Prediction Errors Total Negative Samples = FPTN + FP \begin{aligned} FPR = &\frac{Negative Prediction Errors}{Total Negative Samples} = \frac{FP}{TN + FP}\\ \end{aligned}FPR=Total number of negative samplesNumber of Negative Sample Prediction Errors=TN+FPFP
  • False Negative Rate (FNR)
    FNR = Number of Positive Sample Prediction Errors Total Number of Positive Samples = FPFP + TN \begin{aligned} FNR = &\frac{Number of Positive Sample Prediction Errors}{Total Number of Positive Samples} = \frac{FP}{FP + TN}\\ \end{aligned}FNR=Total number of positive samplesNumber of positive sample prediction errors=FP+TNFP

From their definitions, we can see that sensitivity is the recall rate of positive samples, specificity is the recall rate of negative samples, and false negative rate and false positive rate are equal to 1 - TPR and 1 - TNR respectively. The above four concepts They are all for the prediction results of a single category (positive or negative samples), so they are not sensitive to whether the overall sample is balanced . Let's take an example of unbalanced samples to explain. Assume that 90% of the total samples are positive samples and 10% are negative samples. In this case, it is unscientific for us to use the accuracy rate for evaluation, but it is possible to use TPR and TNR, because TPR only pays attention to how many of the 90% positive samples are predicted correctly, and how many of the 10% negative samples are predicted correctly? The sample has nothing to do with it. Similarly, FPR only pays attention to how many of the 10% negative samples are predicted wrong, and has nothing to do with the 90% positive samples. This avoids the problem of sample imbalance.
As shown in the figure, the two main indicators of the ROC curve are TPR and FPR . Similar to the PR curve, the ROC curve is also drawn by TPR and FPR under different thresholds , but the horizontal and vertical coordinates of the PR curve are Precision and Recall . It will change with different thresholds, but because ROC selects TPR and FPR , the shape of the curve is not affected by the selection. That is, the threshold does not affect the performance of our model using the ROC curve. Moreover, ROC is not affected by category imbalance. No matter how the distribution and proportion of positive and negative samples change, the ROC curve remains unchanged. This is verified by experiments.
insert image description here

  • AUC (Area Under Curve)
    AUC (Area Under Curve), is the area under the line, which is the size of the area enclosed by the ROC curve and the two coordinate axes of TPR and FPR. Among them, TPR is the vertical axis, and FPR is the horizontal axis, because TPR is the percentage of the correct number of positive samples in the total number of positive samples. We definitely hope that it is higher, the better , and FPR is the percentage of the number of negative samples that are wrongly predicted in the total number of negative samples. We definitely want it to be as low as possible , so it can be inferred that the steeper the ROC curve is, the better the performance of the model is, and the greater the AUC is.
    If the model is perfect, then its AUC = 1, proving that all positive examples are ranked in front of negative examples, if the model is a simple two-class random guessing model, then its AUC = 0.5, if one model is better than the other , then its area under the curve is relatively large, and the corresponding AUC value will also be large.

  • Confusion Matrix

The confusion matrix can intuitively reflect the results of the model's predicted classification. As the name suggests, it reflects the degree of confusion of the model's predicted classification. The i- th row and the j -th column of the matrix represent the number of samples that are classified into the j-category for the samples labeled as the i- category. On the diagonal is the number of correct samples for all predictions . In the image classification task of deep learning, the confusion matrix is ​​a relatively general evaluation index, which reflects the judgment ability and learning effect of the model for each category.

Summarize

In this article, we introduce the commonly used indicators for model evaluation in the field of machine learning, including accuracy (Accuracy), precision (Precision), recall (Recall), PR curve, ROC curve, AUC and confusion matrix. In the specific use process, you can combine your own business scenarios and tasks, as well as the more mainstream indicators in your field for evaluation. I hope that the article can help you when you need it, and it is also my own side to review. It is a very happy thing to make progress with everyone. If you have any questions or comments, welcome to exchange in the comment area. If the article is helpful to you, don’t forget to leave a like before leaving~ For friends who are engaged in machine learning, you can bookmark it, and you will always use it.

Guess you like

Origin blog.csdn.net/Just_do_myself/article/details/118631495