Polynomial regression, R2 and RMSE

Main points:

Building a polynomial regression model

a brief introduction

R2 (coefficient of determination) and RMSE (root mean square error) are commonly used regression model evaluation indicators to measure the fit and prediction accuracy of the model to the observed data. Here's how they're calculated:

R2 (coefficient of determination) : R2 represents the ability of the model to explain the dependent variable, and the value ranges from 0 to 1. The closer to 1, the better the fit of the model to the data .

The calculation formula is: R2 = 1 - (SSR / SST) Among them, SSR is the regression sum of squares (Sum of Squares of Residuals), indicating the fitting error of the regression model; SST is the total sum of squares (Total Sum of Squares), indicating the overall The degree of dispersion of the data.

The closer the calculation result of R2 is to 1, the better the model can explain the changes in the observed data, and the better the fitting effect is.
RMSE (root mean square error) : RMSE represents the difference between the model's predicted value and the actual observed value, and is used to measure the prediction accuracy of the model .

The calculation formula is: RMSE = sqrt(MSE) Among them, MSE is Mean Squared Error (Mean Squared Error), which represents the square of the average error between the predicted value and the observed value.

The smaller the calculation result of RMSE, the higher the prediction accuracy of the model, and the smaller the difference between the predicted value and the actual observed value.

Two evaluation methods

When using a machine learning framework or statistical software package for modeling, the R2 and RMSE values of the model can usually be obtained through corresponding functions or methods.

Taking the common Python machine learning library Scikit-learn as an example, the following methods can be used to obtain R2 and RMSE values:

R2 (coefficient of determination):

from sklearn.metrics import r2_score

# y_true为实际观测值，y_pred为模型预测值
r2 = r2_score(y_true, y_pred)

This calculates the R2 value for the model, where y_trueis the actual observed value and y_predis the model's predicted value.

RMSE (root mean square error):

from sklearn.metrics import mean_squared_error

# y_true为实际观测值，y_pred为模型预测值
rmse = mean_squared_error(y_true, y_pred, squared=False)

This calculates the RMSE value for the model, where is the actual observed value and is the model's predicted value . The parameter indicates that the root mean square error is returned instead of the mean square error. y_truey_predsquared=False

This is just an example, and the exact methods and functions may vary depending on the tool or library used. For specific use, you can consult the documentation of the tool used or refer to the corresponding function description to obtain the R2 and RMSE values of the model.

Three nonlinear fits

Nonlinear regression refers to the application of a nonlinear model to a regression problem where the relationship between the dependent and independent variables is not linear. The choice of nonlinear regression function usually depends on the nature of the problem and the distribution of the data.

Here are some common nonlinear regression functions:

Polynomial Regression : Polynomial regression introduces a polynomial function of the independent variables into the regression model, such as a quadratic, cubic, or higher degree polynomial. Its form is:
```
y = a0 + a1*x + a2*x^2 + a3*x^3 + ...
```
where xis the independent variable, yis the dependent variable, and a0, a1, a2, a3, ...is the regression coefficient.
Exponential Regression : Exponential regression introduces the independent variable into the regression model as an exponential function. Its form is:
```
y = a*exp(b*x)
```
where xis the independent variable, yis the dependent variable, aand bis the regression coefficient.
Logarithmic Regression : Logarithmic regression introduces the independent variable into the regression model as a logarithmic function. Its form is:
```
y = a + b*log(x)
```
where xis the independent variable, yis the dependent variable, aand bis the regression coefficient.
Sigmoid regression (logistic regression): Sigmoid regression is a commonly used binary classification nonlinear regression method, which uses the Sigmoid function to convert the results of linear combinations into probability values. Its form is:
```
y = 1 / (1 + exp(-(a + b*x)))
```
where xis the independent variable, yis the probability value, aand bis the regression coefficient.

These are just some examples of nonlinear regression functions, and the actual choice of which function depends on the specific problem and the characteristics of the data. There are many other non-linear functions available, which you can adapt and expand to suit your problem.

Four polynomial regression

Polynomial Regression Reference: Polynomial Regression

Linear regression studies the regression problem between a target variable and an independent variable, but sometimes in many practical problems, there are often more than one independent variable that affects the target variable, but multiple, such as the wool production of sheep. Variables are affected by multiple variables such as sheep weight, chest circumference, and body length at the same time, so it is necessary to design a regression analysis between a target variable and multiple independent variables , that is, multiple regression analysis. Since linear regression is not suitable for all data, we need to create a curve to fit our data. Many curve relationships in the real world are realized by adding polynomials, such as a quadratic function model:

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2021-07-03
from sklearn.linear_model import LinearRegression     
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt       
import numpy as np

#X表示企业成本 Y表示企业利润
X = [[400], [450], [486], [500], [510], [525], [540], [549], [558], [590], [610], [640], [680], [750], [900]]
Y = [[80], [89], [92], [102], [121], [160], [180], [189], [199], [203], [247], [250], [259], [289], [356]]
print('数据集X: ', X)
print('数据集Y: ', Y)

#第一步 线性回归分析
clf = LinearRegression() 
clf.fit(X, Y)                     
X2 = [[400], [750], [950]]
Y2 = clf.predict(X2)
print(Y2)
res = clf.predict(np.array([1200]).reshape(-1, 1))[0]   
print('预测成本1200元的利润：$%.1f' % res) 
plt.plot(X, Y, 'ks')    #绘制训练数据集散点图
plt.plot(X2, Y2, 'g-')  #绘制预测数据集直线

#第二步 多项式回归分析
xx = np.linspace(350,950,100) #350到950等差数列
quadratic_featurizer = PolynomialFeatures(degree = 2) #实例化一个二次多项式
x_train_quadratic = quadratic_featurizer.fit_transform(X) #用二次多项式x做变换
X_test_quadratic = quadratic_featurizer.transform(X2)
regressor_quadratic = LinearRegression()
regressor_quadratic.fit(x_train_quadratic, Y)

#把训练好X值的多项式特征实例应用到一系列点上,形成矩阵
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), "r--",
         label="$y = ax^2 + bx + c$",linewidth=2)
plt.legend()
plt.show()

Here we use R-Squared (R-Squared) to evaluate the effect of polynomial regression prediction. R-square is also called Coefficient of Determination (Coefficient of Determination), which indicates the degree to which the model fits the real data. There are several ways to calculate R-square. In unary linear regression, R-square is equal to the square of Pearson Product Moment Correlation Coefficient. The R-square calculated by this method must be a positive number between 0 and 1. . The other is the method provided by the Sklearn library to calculate the R square. The R square calculation code is as follows:

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2021-07-03
from sklearn.linear_model import LinearRegression     
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt       
import numpy as np

#X表示企业成本 Y表示企业利润
X = [[400], [450], [486], [500], [510], [525], [540], [549], [558], [590], [610], [640], [680], [750], [900]]
Y = [[80], [89], [92], [102], [121], [160], [180], [189], [199], [203], [247], [250], [259], [289], [356]]
print('数据集X: ', X)
print('数据集Y: ', Y)

#第一步 线性回归分析
clf = LinearRegression() 
clf.fit(X, Y)                     
X2 = [[400], [750], [950]]
Y2 = clf.predict(X2)
print(Y2)
res = clf.predict(np.array([1200]).reshape(-1, 1))[0]   
print('预测成本1200元的利润：$%.1f' % res) 
plt.plot(X, Y, 'ks')    #绘制训练数据集散点图
plt.plot(X2, Y2, 'g-')  #绘制预测数据集直线

#第二步 多项式回归分析
xx = np.linspace(350,950,100) 
quadratic_featurizer = PolynomialFeatures(degree = 5) 
x_train_quadratic = quadratic_featurizer.fit_transform(X) 
X_test_quadratic = quadratic_featurizer.transform(X2)
regressor_quadratic = LinearRegression()
regressor_quadratic.fit(x_train_quadratic, Y)
#把训练好X值的多项式特征实例应用到一系列点上,形成矩阵
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), "r--",
         label="$y = ax^2 + bx + c$",linewidth=2)
plt.legend()
plt.show()
print('1 r-squared', clf.score(X, Y))
print('5 r-squared', regressor_quadratic.score(x_train_quadratic, Y))

# ('1 r-squared', 0.9118311887769025)
# ('5 r-squared', 0.98087802460869788)