15- Evaluation regression algorithm R Square

  In the previous blog, three indicators for evaluating the pros and cons of regression algorithms were introduced: MSE, RMSE, and MAE. In fact, these indicators have their problems. Recall that when we were learning classification problems, our evaluation index for classification problems was very simple, that is , the accuracy of classification. For classification accuracy, its value is between 0-1 If it is 1, it means that its classification accuracy is 100% accurate, which is the best. If it is 0, it is the worst. This evaluation standard is very clear, because the classification accuracy is 0-1 Even if the problems we classify are different, we can easily compare the pros and cons between them.

  But RMSE and MAE have no such properties. For example, maybe my prediction is real estate data, whether the RMSE or MAE I get at the end, their value is 5, which means that our error is 50,000 yuan, and we predict that the student's grade may predict the error is 10. , That is, the prediction gap is 10 points. In this case, is our algorithm used in predicting real estate? Or should it be used in predicting student performance? We cannot judge. This is because the 5 and 10 correspond to different kinds of things, and we cannot compare them directly. This is the limitation of the RMSE and MAE we use.

  Then this problem can be solved. The solution is to use a new indicator: R Squared , which is usually called R Square in Chinese. Use the following formula to calculate:
Insert picture description here
  see this, don't faint. Look at the following formula: I
Insert picture description here
  believe that with such a formula listed, we can already program to calculate this R-square.

  However, I want to know more about the meaning of the R-square indicator? Why is it good?
Insert picture description here
Then let's take a look at the calculation results of the R side, we will get the following conclusions:

  • R^2 <= 1
  • R^2 The bigger the better. When our prediction model does not make any mistakes, R^2 gets the maximum value of 1
  • When our model is equal to the Baseline Model, R^2 is 0
  • If R^2 <0, it means that the model we learned is not as good as the benchmark model. At this point, it is very likely that our data does not have any linear relationship.

  Therefore, we have a very important advantage of using R^2, which is equivalent to reducing the final measurement result of the regression problem to between 0-1. 1 is the best and 0 is the worst. Then in this case, we can use this indicator very conveniently to compare the final results obtained by applying the same regression algorithm to different problems. However, the accuracy of our R-square and the classification problem is very different, that is, there is a situation where the R-square is less than 0. If the R-square is less than 0, it means that the error our model gets is greater than the error we get using the benchmark model. What does this mean? It means that we have trained our model for a long time. It is better not to train it. The results we directly use the benchmark model to predict are better than the results of the model we trained. Then in this case, the R-square will be less than zero. In fact, when we are dealing with real data, we may encounter a situation where the R square is less than 0. If you encounter this situation, you need to be careful. It shows that the model you trained is really bad. It is better to be straightforward. Use benchmark models. So in this case, it usually means that your data does not have a linear relationship at all. What we are learning here is linear regression. It has a very important assumption. It assumes that there is a certain linear relationship between the data. The linear relationship mentioned here can be a linear relationship with a positive correlation or a linear relationship with a negative correlation. It's just that the slope is greater than 0 or less than 0. But if your data does not have a linear relationship at all, it is very likely that the R-square obtained is less than 0, then you may have to consider that you cannot use linear regression to solve this problem.

  So finally, at the level of specific implementation, let’s take a closer look at the formula of the R formula. Is this formula a bit familiar? If we divide the numerator and denominator of the part of the formula from 1 by m at the same time, the numerator is calculated It is MSE, and the denominator calculates the variance.
Insert picture description here
Insert picture description here
  If we write R-square in the form of equation (3), it will be very easy to calculate R-square. Then through this formula, you may vaguely feel that there is a deeper statistical significance behind the R side. Since I am a beginner, I will not go into it deeply. Interested friends can check relevant information by themselves.


Programming realization

  Now I will program to implement the very important standard of R Square.

  We continue to write code in Jupyter Notebook from the previous blog.

Insert picture description here
  Below we encapsulate the method (r2_score) in the metrics.py file we wrote.

import numpy as np
from math import sqrt

def accuracy_score(y_true, y_predict):
    """计算y_true和y_predict之间的准确率"""
    assert y_true.shape[0] == y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"
    return sum(y_true == y_predict) / len(y_true)

# MSE:均方误差
def mean_squared_error(y_true, y_predict):
    """计算y_true与y_predict之阿的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"
    return np.sum((y_true - y_predict) ** 2) / len(y_true)

# RMSE:均方根误差
def root_mean_squared_error(y_true, y_predict):
    """计算y_true与y_predict之阿的RMSE"""
    return sqrt(mean_squared_error(y_true, y_predict))

# MAE:平均绝对误差
def mean_absolute_error(y_true, y_predict):
    """计算y_true与y_predict之阿的MAE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"
    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)

# R Square
def r2_score(y_true, y_predict):
    """计算y_true和y_predict之间的R Square"""
    return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)

Insert picture description here
Insert picture description here

  Finally, it is worth mentioning that, in sklearn linear regression algorithm is encapsulated in a LinearRegression in, but this LinearRegression directly support multiple linear regression, because I have only just learning simple linear regression, so now temporarily Don't use this class yet.

  Let us recall that in the kNN algorithm we learned before, a function such as score will be directly encapsulated in the kNN algorithm to directly measure the accuracy of the algorithm itself. The same is true for the regression algorithm, which also has a score function.
Insert picture description here
  We can see that the score function directly returns the standard of the R side. From here, we can also see that the standard of R is too important and it is also the most widely used standard. So, we also add a score function to our SimpleLinearRegression.

# SimpleLinearRegression.py

import numpy as np
from metrics import r2_score
# 使用向量化运算
class SimpleLinearRegression:
    def __init__(self):
        """初始化 Simple Linear Regression 模型"""
        self.a_ = None
        self.b_ = None
    def fit(self, x_train, y_train):
        """根据训练数据集 x_train, y_train训练模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regression can only solve single feature training data"
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        num = (x_train - x_mean).dot(y_train - y_mean) #分子点乘
        d = (x_train - x_mean).dot(x_train - x_mean) #分母点乘

        self.a_ = num / d
        self.b_ = y_mean - self.a_ * x_mean

        return self
    def predict(self, x_predict): # x_predict 为一个向量
        """给定预测数据集x_predict, 返回表示x_predict的结果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regression can only solve single feature training data"
        assert self.a_ is not None and self.b_ is not None, \
            "must fit before predict!"
        return np.array([self._predict(x) for x in x_predict])
    def _predict(self, x_single): # x_single 为一个数
        """给定单个预测数据x_single, 返回x_single的预测结果值"""
        return self.a_ * x_single + self.b_
    def score(self, x_test, y_test):
        """根据测试数据集x_test和y_test确定当前模型的准确度"""
        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)
    def __repr__(self):
        return "SimpleLinearRegression()"

Insert picture description here

  So far, our simple linear regression learning ends here. In fact, the algorithm part is relatively simple. How to use the algorithm? How to evaluate the algorithm? For these issues, there may be a lot of things worth studying.

  Later, I will discard the condition that each sample can only have one feature in simple linear regression. For each sample, there can be multiple features. The more general form of n features corresponds to the multiple linear regression problem.


For specific codes, see 14 Evaluation of Regression Algorithm. ipynb

Guess you like

Origin blog.csdn.net/qq_41033011/article/details/109012402