An in-depth interpretation of the evaluation method of machine learning models

We train and learn a good model, and by objectively evaluating the performance of the model, we can make better practical decisions. Model evaluation mainly includes: prediction error, degree of fitting, model stability, etc. There are also some scenarios that have requirements for model prediction speed (throughput), computing resource consumption, interpretability, etc., which will not be expanded here.

1. Assessing forecast errors

The prediction error of the machine learning model is usually the focus of evaluation. It is not only a good learning and prediction ability for the training data in the learning process, but also a good prediction ability (generalization ability) for new data. Therefore, We often evaluate the generalization performance of the model through the indicator performance of the test set.

The prediction error of the evaluation model is often judged by the loss function as an indicator, such as the mean square loss of regression prediction. But in addition, using the loss function as an evaluation indicator has some limitations and is not intuitive (for example, f1-score is commonly used in the evaluation of classification tasks, which can directly show the correct classification of various categories). Here, we mainly interpret the commonly used error evaluation indicators for regression and classification prediction tasks.

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, and technical exchange improvement, all of which can be obtained by adding the communication group. The group has more than 2,000 friends. The best way to add notes is: source + interest directions, making it easy to find like-minded friends.

Method ①, add WeChat account: dkl88191, remarks: from CSDN + python
method ②, WeChat search official account: Python learning and data mining, background reply: add group

1.1 Error Evaluation Indicators for Regression Tasks

To evaluate the error of the regression model, it is relatively simple to calculate the average of the difference between the real value and the predicted value after "positive". as follows:

  • Mean Squared Error (MSE) Mean Squared Error (MSE) is the average of the squares of the differences between the actual and predicted values. where y is the actual value and y^ is the predicted value

picture

  • Root mean square error (RMSE)

Root mean square error (RMSE) is the square root of MSE

picture

  • Mean Absolute Error (MAE)

The mean absolute error (MAE) is the average of the absolute values ​​of the errors between the predicted value and the true value

pictureSince MAE uses absolute values ​​(not derivable), it is rarely used in training loss functions. It is still possible to use it for the final evaluation model.

  • Root mean square logarithmic error (RMSLE)picture

Comparison of the above indicators:

① Sensitivity to outliers: MAE is the real prediction error, while RMSE and MSE are both squared, which amplifies the influence of larger error samples ( more sensitive to outliers ). If you encounter individual outliers with a very large degree of deviation When , even a small amount will make these two indicators very poor. To reduce the influence of outliers, RMSLE can be used, which focuses on the proportion of prediction errors. Even if there are outliers, the influence of these outliers can be reduced.

②Dimensional difference: Unlike MSE, which is squared, RMSE (squared and then squared) and MAE are unchanged for the original dimension, which will be more intuitive . Although RMSE and MAE have the same dimension, RMSE is actually larger than MAE . This is because RMSE first accumulates the squares of the errors and then extracts the square, which also enlarges the gap between the errors.

③ Dimensional differences across tasks: In practical applications, there is a problem with RMSE and MAE. The dimensions of different tasks will change. For example, we predict that the stock price error is 10 yuan, and the predicted house price error is 1w, which spans We cannot evaluate which model is better for different tasks. Next, the R2 score index, which further normalizes the above errors, has a unified evaluation standard .

  • R^2 score

The R^2 score is often used to evaluate the linear regression fitting effect, and its definition is as follows:

picture
The R^2 score can be thought of as the ratio of the mean squared error of our model divided by the mean squared error when using the mean of the actual values ​​as the predicted value (like the baseline model). picture
In this way, the R^2 score range is reduced to [0,1]. When its value is 0, it means that our model has no effect, which is consistent with the guessed effect of the baseline model. When the value is 1, the model works best, which means that the model does not have any errors.

To add, when the R^2 value is 0 and the model is linear regression, it can also indirectly indicate that there is no linear relationship between features and labels. picture
This is also the principle of the commonly used collinearity index VIF. Try to use each feature as a label, use other features to learn and fit, get the R^2 value of the linear model, and calculate the VIF. A VIF of 1 means that there is no collinearity between features at all (collinearity will affect the stability and interpretability of linear models, and VIF<10 is often used as a threshold in engineering).

1.2 Error Evaluation Indicators for Classification Models

For the classification error of the classification model, you can use a loss function (such as cross entropy. In the classification model, cross entropy is more suitable than MSE. In simple terms, MSE has no difference and pays attention to the difference between the predicted probability and the true probability of all categories. Cross entropy concerns is the predicted probability of the correct class.) to evaluate:

picture

However, the loss function evaluation classification effect is not very intuitive. Therefore, f1-score, precision, and recall are commonly used in the evaluation of classification tasks, which can directly show the correct classification of various categories.

  • precision、recall、f1-score、accuracy

picture

Accuracy. That is, the proportion of all correct predictions (TP+TN) to the total (TP+FP+TN+FN);

Precision P (precision): refers to the ratio of the number of correct samples (TP) predicted by the classifier as Positive to the number of all samples predicted to be Positive (TP+FP);

Recall rate R (recall): refers to the proportion of the number of correct samples (TP) predicted by the classifier as Positive to all the actual number of positive samples (TP+FN).

F1-score is the harmonic average of precision P and recall R:picture

Summary of the above indicators:

① Comprehensive accuracy of various categories: Accuracy is relatively straightforward for the description of classification errors, but for the case of unbalanced positive and negative examples, the accuracy evaluation basically has no reference value, such as the classification scene of fraudulent user identification, there are 950 Normal user samples (negative examples), 50 abnormal users (positive examples), the model predicts all samples as normal user samples, and the accuracy rate is very good, reaching 95%. But in fact, the classification effect is very poor. accuracy cannot express the misclassification of a few categories , so F1-score is more commonly used, which comprehensively considers the precision and recall.

② Balance the precision rate and recall rate: precision rate and recall rate are often a pair of contradictory indicators, and sometimes it is necessary to choose "more accurate" or "more complete" in combination with business bias (for example, in the case of fraudulent users Inside, it is usually biased to identify more "more complete" positive examples, although there will be higher misjudgments. "I would rather kill a hundred by mistake than let one go"). At this time, the presion under the threshold can be divided according to different Make a trade-off between the two with the recall curve (PR curve) .picture

  • kappa value

    kappa is an indicator for consistency testing (for classification problems, the so-called consistency is whether the model prediction results are consistent with the actual classification results). The kappa value calculation is also based on the confusion matrix, which is an indicator that can **punish the model to predict the "bias"**. According to the calculation formula of kappa, the more unbalanced the confusion matrix (that is, the difference in the prediction accuracy of different categories is large) , the lower the kappa value.

picture

The meaning of the formula can be interpreted as the ratio of the improvement of the total accuracy compared to the random accuracy to the improvement of the perfect model compared to the random accuracy:picture

The value of kappa is between -1 and 1, usually greater than 0, and can be divided into five groups to indicate different levels of consistency: 0.0 0.20 very low consistency (slight), 0.21 0.40 general consistency (fair), 0.41 ~0.60 is moderate, 0.61~0.80 is substantial, and 0.81~1 is almost perfect.

  • ROC curve, AUC

    The ROC curve (Receiver operating characteristic curve) is actually the comprehensive result of multiple confusion matrices. If we do not have a fixed threshold in the above model, but sort the model predictions from high to low, and use each probability value as a dynamic threshold in turn, then there are multiple confusion matrices.picture

For each confusion matrix, we calculate two indicators TPR (True positive rate) and FPR (False positive rate), TPR=TP/(TP+FN)=Recall is the recall rate, FPR=FP/(FP+TN), FPR is the proportion of predicted positive samples among the actual negative samples. Finally, we draw a graph with FPR as the x-axis and TPR as the y-axis, and we get the ROC curve. pictureBy solving the area under the ROC curve, that is, AUC (Area under Curve), AUC can intuitively evaluate the quality of the classifier, usually between 0.5 and 1, and the larger the value, the better.

Analysis and summary of AUC indicators:

  • Since the measurement of ROC is a "dynamic threshold", AUC does not depend on the classification threshold, and gets rid of the limitation of the fixed classification threshold to see the classification effect.

  • ROC is plotted by different thresholds TPR, FPR. A larger ROC area (AUC) means a larger TPR under a smaller FPR, and a smaller FPR is a larger 1-FPR = TN/(TN+FP)=TNR, so AUC is actually TPR ( Also called recall rate, sensitivity) and TNR (also called specificity) comprehensive consideration.

  • It can be seen from the confusion matrix that AUC's TNR (ie 1-FPR), TPR and the actual ratio of good or bad samples are irrelevant, and they only focus on the comprehensiveness of the recognition of the corresponding actual category. (Unlike the precision rate precision is evaluated across the actual category). To put it simply: AUC is not sensitive to the positive and negative ratio of samples. Even if the ratio of positive and negative cases changes greatly, the area of ​​ROC curve will not change greatly.picture

  • AUC is the area of ​​the ROC curve. The physical meaning of its value is: randomly given two samples, one positive and one negative, the probability that the predicted score of the positive sample is greater than that of the negative sample. That is to say, AUC is the "ordering" indicator of discrimination ability (the probability score of positive samples higher than negative samples is sufficient), and it is not sensitive to specific judgment probabilities - ignoring the fitting effect of the model, but for an excellent As far as the model is concerned, what we expect is that the probability values ​​of positive and negative samples are sufficiently different . For example, if the model predicts all negative samples as 0.49 and positive samples as 0.51, then the auc of this model is 1 (but the probability of positive and negative samples is very close, and the model will be wrong when there is a disturbance). And we expect the interval between good and bad predictions of the model to be as large as possible. For example, the negative sample prediction is less than 0.1, and the positive sample prediction is above 0.8. At this time, although the AUC is the same, such a model has a better fitting effect and is more robust.

picture

AUC vs. F1-score difference

  • AUC does not depend on the classification threshold, F1-score needs to specify the threshold, and the results of different thresholds are different;

  • When the ratio of positive and negative samples changes, AUC has little impact, and F1-score will have a relatively large impact (because the accuracy rate precision is evaluated across the actual category);

  • Both include the recall rate (the comprehensiveness of positive sample recognition) and take into account FP (the negative sample is misidentified as a positive sample), and both require a balance between "full" and "quasi".

  • F1-score can flexibly adjust the recall rate and precision rate through the threshold. And AUC can only give a general information.

# 上述指标可以直接调用 sklearn.metrics
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_curve, auc, cohen_kappa_score,mean_squared_error
...
yhat = model.predict(x)

f1_score(y, yhat)

2. Model Fitting Degree

For the degree of fitting of the model, it is often expressed as underfitting, good fitting and overfitting. Generally, a well-fitting model has better generalization ability and better performance on unknown data (test set).

We can evaluate the fit of the model through the training and validation set errors (such as loss functions). From the perspective of the overall training process, the training error and validation set error are both high during underfitting, and decrease with the increase of training time and model complexity. After reaching a critical point of optimal fitting, the training error decreases and the verification set error increases, at which point the model enters the over-fitting region. pictureUnderfitting is usually not a problem in practice, and the accuracy of learning can be improved by using strong features and more complex models. To solve overfitting, that is, how to reduce the generalization error and improve the generalization ability, is usually the focus of optimizing the model effect. To solve overfitting, the common method is to improve the quality and quantity of data and adopt appropriate regularization strategy.

3. Model stability

If the online model is unstable, it means that the model is uncontrollable and affects the rationality of decision-making. For business, this is a kind of uncertainty risk, which is unacceptable (especially for the risk-averse risk control field).

We usually use Population Stability Index (PSI) to measure whether the distribution ratio of future (test set) samples and model training sample scores is consistent to evaluate the stability of the model. In the same way, PSI can also be used to measure the distribution difference of feature values ​​and evaluate the stability of data feature level.picture

The PSI calculation takes the model score of the training sample as the reference point of stability (expected score ratio), and measures the error of the future actual predicted score (actual distribution ratio). The calculation formula is SUM ((actual proportion - expected proportion) * ln(actual proportion / expected proportion) of each score segment). The picturespecific calculation steps and sample code are as follows:

Step1: Discretize the expected numerical distribution (development data set) into bins, and count the proportion of samples in each bin.

step2: According to the same binning interval, count the proportion of samples in each binning for the actual distribution (test set).

Step3: Calculate A - E and Ln(A / E) in each bin, calculate index = (actual proportion - expected proportion) * ln(actual proportion / expected proportion).

step4: Sum the indexes of each sub-bin to get the final PSI

picture

import math
import numpy as np
import pandas as pd

def calculate_psi(base_list, test_list, bins=20, min_sample=10):
    try:
        base_df = pd.DataFrame(base_list, columns=['score'])
        test_df = pd.DataFrame(test_list, columns=['score']) 
        
        # 1.去除缺失值后,统计两个分布的样本量
        base_notnull_cnt = len(list(base_df['score'].dropna()))
        test_notnull_cnt = len(list(test_df['score'].dropna()))

        # 空分箱
        base_null_cnt = len(base_df) - base_notnull_cnt
        test_null_cnt = len(test_df) - test_notnull_cnt
        
        # 2.最小分箱数
        q_list = []
        if type(bins) == int:
            bin_num = min(bins, int(base_notnull_cnt / min_sample))
            q_list = [x / bin_num for x in range(1, bin_num)]
            break_list = []
            for q in q_list:
                bk = base_df['score'].quantile(q)
                break_list.append(bk)
            break_list = sorted(list(set(break_list))) # 去重复后排序
            score_bin_list = [-np.inf] + break_list + [np.inf]
        else:
            score_bin_list = bins
        
        # 4.统计各分箱内的样本量
        base_cnt_list = [base_null_cnt]
        test_cnt_list = [test_null_cnt]
        bucket_list = ["MISSING"]
        for i in range(len(score_bin_list)-1):
            left  = round(score_bin_list[i+0], 4)
            right = round(score_bin_list[i+1], 4)
            bucket_list.append("(" + str(left) + ',' + str(right) + ']')
            
            base_cnt = base_df[(base_df.score > left) & (base_df.score <= right)].shape[0]
            base_cnt_list.append(base_cnt)
            
            test_cnt = test_df[(test_df.score > left) & (test_df.score <= right)].shape[0]
            test_cnt_list.append(test_cnt)
        
        # 5.汇总统计结果    
        stat_df = pd.DataFrame({
    
    "bucket": bucket_list, "base_cnt": base_cnt_list, "test_cnt": test_cnt_list})
        stat_df['base_dist'] = stat_df['base_cnt'] / len(base_df)
        stat_df['test_dist'] = stat_df['test_cnt'] / len(test_df)
        
        def sub_psi(row):
            # 6.计算PSI
            base_list = row['base_dist']
            test_dist = row['test_dist']
            # 处理某分箱内样本量为0的情况
            if base_list == 0 and test_dist == 0:
                return 0
            elif base_list == 0 and test_dist > 0:
                base_list = 1 / base_notnull_cnt   
            elif base_list > 0 and test_dist == 0:
                test_dist = 1 / test_notnull_cnt
                
            return (test_dist - base_list) * np.log(test_dist / base_list)
        
        stat_df['psi'] = stat_df.apply(lambda row: sub_psi(row), axis=1)
        stat_df = stat_df[['bucket', 'base_cnt', 'base_dist', 'test_cnt', 'test_dist', 'psi']]
        psi = stat_df['psi'].sum()
        
    except:
        print('error!!!')
        psi = np.nan 
        stat_df = None
    return psi, stat_df

## 也可直接调用toad包计算psi
# prob_dev模型在训练样本的评分,prob_test测试样本的评分
psi = toad.metrics.PSI(prob_dev,prob_test)

Analyzing the principle of the psi indicator, after formula deformation, we can find that the meaning of psi is equivalent to the KL divergence of the first item actual distribution (A) and expected distribution (E) + the second item expected distribution (E) and actual distribution (A) The sum of the KL divergence between them, the KL divergence can describe the difference in information entropy in one direction (asymmetric index), and the above formula can describe the difference in distribution more comprehensively.

picture

The smaller the PSI value (experience is often <0.1 as the standard), the smaller the difference between the two distributions, which means the more stable. pictureThe advantage of PSI value in practical application lies in the convenience of its calculation, but it should be noted that the calculation of PSI is affected by multiple factors such as the number and method of grouping, group sample size and actual business policies, especially for small samples with drastic business changes Originally, the value of PSI often exceeds the general experience level, so it needs to be analyzed in combination with actual business and data conditions.

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/130757269