Machine Learning Evaluation Metrics

Regression algorithm metrics

The following is a schematic diagram of linear regression for one variable and two variables:

How to measure the quality of regression model?

We naturally think of using the mean value of the residual (the difference between the actual value \(y_i\) and the predicted value \(\hat y_i\) ) to measure, that is:
\[ \text {residual}(y,\hat y_i)= \frac {1}{m} \sum_{i=1}^m(y_i- {\hat y_i}) \]
Question ①: Is it reasonable to use the mean of residuals?

Mean Absolute Error

It is unreasonable to use the mean of the residuals, which is easy to offset the positive and negative residuals.

The mean absolute error MAE (Mean Absolute Error) is also known as \(l1\) norm loss ( \(l1-normloss\) )
\[ \text {MAE}(y,\hat y_i) = \frac {1} {m} \sum_{i=1}^n |y- {\hat y_i}| \]
Question ②: What are the shortcomings of MAE?

Mean Squared Error

The MAE absolute value function is not smooth, which is not good for derivation.

Mean Squared Error MSE (Mean Squared Error) is also known as \(l2\) norm loss ( \(l2-normloss\) )
\[ \text {MSE}(y,\hat y_i) = \frac {1} {m} \sum_{i=1}^n (y-{\hat y_i})^2 \]
Question ③: Is there a more reasonable indicator than MSE?

Root Mean Squared Error

Recalling the definitions of variance and standard deviation, one can try to align the evaluation metrics with the units of the parameters.

\[ \text {RMSE}(y,\hat y_i) = \sqrt {\frac {1}{m} \sum_{i=1}^n (y- {\hat y_i})^2} \]

Question 4: Is there any normalized (dimensionless) index?

Coefficient of Determination

It's like assuming that Shanghai's house price RMSE is 1000, which is acceptable, but if the 4th and 5th tier cities are this price...so the value needs to be between 0 and 1 to avoid different scenarios and models.

The coefficient of determination is also known as \(R^2 score\) , which reflects the proportion of all the variance of the dependent variable that can be explained by the independent variable through the regression relationship. Variation and variance are somewhat similar. The variance reflects the degree of dispersion of sample points near the actual value mean \(\overline y\) , while the variation of the actual value \(y\) is \(m\) times the variance , which is the following total The sum of squared deviations SST:
\[ \text{SST} = \sum_i^m(y_i-\overline y)^2 \quad \text {SST=total sum of squares} \]
And the regression sum of squares SSR is the predicted value Variation:
\[ \text{SSR} = \sum_i^m(\hat y_i-\overline y)^2 \quad \text {SSR=sum of due to regression} \]
The residual sum of squares SSE represents the actual value and the prediction Difference between values:
\[ \text{SSE} = \sum_i^m(\hat y_i-y_i)^2 \quad \text {SSE=sum of due to errors} \]

You will find \(\text{SSE+SSR=SST }\) , which can be proved by the principle of least squares:
\[ y-\overline y=(y-\hat y)+(\hat y-\overline y) \\ \Rightarrow \sum (y-\overline y)^2 = \sum (y-\hat y)^2+\sum(\hat y-\overline y)^2+2\sum (y- \hat y)(\hat y-\overline y) \\ \begin {align*} \because \sum (y-\hat y)(\hat y-\overline y) &= \sum (y-\hat y)(a+bx-\overline y) \\ &= \sum (y-\hat y)[(a-\overline y) +bx] \\ &= (a-\overline y) \sum (y -\hat y)+b \sum (y-\hat y)x \\ &= (a-\overline y) \sum (ya-bx)+b \sum (ya-bx)x \end {align* } \]
According to the principle of least squares, there are:
\[ \sum (ya-bx) = 0,\sum (ya-bx)x=0 \\ \therefore \sum (y-\hat y)(\hat y -\overline y) = 0 \\ \therefore \sum (y-\overline y)^2 = \sum (y-\hat y)^2+\sum(\hat y-\overline y)^2 \]
After proving \(\text{SSE+SSR=SST }\) , we define\(R^2(y,\hat y)=\frac {\text{SSR}}{\text{SST}}\) , which can reflect the predictive ability of the model. If the result is 0, it means that the model prediction cannot predict the dependent variable; if the result is 1, it means that it is a functional relationship. If the result is between 0 - 1, it can indicate how good or bad the model is, regardless of the usage scenario.

Simplifying the above formula, the numerator becomes the mean square error MSE, and the denominator becomes the variance:
\[ \begin {align*} R^2(y,\hat y) &=\frac {\text{SSR }}{\text{SST}} = 1- \frac {\text{SSE}}{\text{SST}} \\ & = 1 - \frac { \sum_i^m(\hat y_i-y_i)^2 }{ \sum_i^m( y_i-\overline y)^2 } \\ & = 1 - \frac { \sum_i^m(\hat y_i-y_i)^2 /m}{ \sum_i^m( y_i-\ overline y)^2/m } \\ & = 1 - \frac {\text{MSE}(\hat y,y)} {\text {Var}(y)} \end {align*} \]
in actual A single metric is not used in the work, but multiple metrics are weighed.

Classification algorithm indicators

Accuracy

The proportion of correctly predicted samples to the total samples, the value range is \([0,1]\) , the larger the value, the better the model prediction ability.
\[ Acc(y,\hat y) = \frac {1}{m} \sum_{i=1}^m sign(\hat y_i,y_i) \]
where $sign(\hat y_i,y_i) = \ begin{cases}1 \quad \hat y_i=y_i \ 0 \quad \hat y_i \neq y_i\end{cases} $

Question ⑤: When will the accuracy indicator fail?

Confusion Matrix Confusion Matirx

Confusion matrix, known as matching matrix in unsupervised learning, is called confusion matrix because we can easily see from the figure whether the learner confuses the category of the sample. Each column of the matrix represents the prediction of the classifier, and each row represents the true class to which the sample belongs. As shown below:

Confusion matrix Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

Three aspects are involved here: the relationship between the actual value, the predicted value, the predicted value and the actual value. It is known that any two of them can be used to obtain the third part, usually taking the relationship between the predicted value and the actual value , Predicted Values ​​Divide the matrix:

  • True positive(TP)

The true value is Positive, the prediction is correct (the predicted value is Positive)

  • True negative(TN)

The true value is Negative and the prediction is correct (the predicted value is Negative)

  • False positive(FP)

True value is Negative, prediction error (prediction value is Positive), Type I error

  • False negative(FN)

True value is Positive, prediction error (prediction value is Negative), Type II error

Derivative metrics for confusion matrix:

In fact, there are only a few commonly used ones, and those who are interested can learn about other indicators by themselves.

Accuracy (precision) Precision

Precision is the proportion of positive samples predicted by the classifier:
\[ \text{Precision}=\frac {\sum\text{True positive}}{\sum \text{Predicted condition positive}}= \frac {\ text{TP}}{\text{TP+FP}} \]

Recall rate (recall rate)

Recall is the proportion of positive samples identified by the classifier to all positive samples:
\[ \text{Recall}=\frac {\sum\text{True positive}}{\sum \text{Condition positive}}= \frac { \text{TP}}{\text{TP+FN}} \]

\(F_1\) Score and \(F_\beta\) Score

Precision and Recall affect each other. Ideally, it is difficult to achieve both high values. In order to balance the two indicators, their harmonic average \(F_1\) Score is often used:
\[ F_1=\frac { 2}{\frac {1}{\text {recal}l}+\frac {1}{\text {precision}}} = 2 \frac {\text{precision}\cdot \text{recall}}{\ text{precision } + \text{recall}} \]
More generally, the weighted harmonic mean \(F_\beta\) Score is used:
\[ F_\beta=(1+\beta ^2)\frac {\text{precision}\cdot \text{recall}}{\beta ^2 \times \text{precision } + \text{recall}} \]
where \(\beta\) represents the weight, which can also be expressed as :
\[ F_\beta=\frac {1}{\frac {\beta^2}{(1+\beta^2) \times\text {recal}l}+\frac {1}{(1+\ beta^2) \times\text {precision}}} \\ = \frac {1}{\frac {1}{(1+\frac {1} {\beta^2}) \times\text {recal} l}+\frac {1}{(1+\beta^2) \times\text {precision}}} \]
When $\beta \to 0 $,\(F_\beta \approx P\) ; when \(\beta \to \infty\) , \(F_\beta \approx R\) ; \(\beta\ ) when \(1\) is \ (F_1\) Score.

Question ⑥: Are there any indicators that do not depend on the threshold?

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325214591&siteId=291194637