[AI underlying logic] - "Mathematical Waltz" linear regression (code test)

Table of contents

1. Actual measurement of one-variable linear regression code

2. Statistical analysis

1. statsmodels library

2. Calculate various statistics

3. F test, t test

4. Confidence interval, prediction interval

5. Residual normality test

6. Autocorrelation detection


1. Actual measurement of one-variable linear regression code

①Import related modules

First, import the necessary modules. Here we mainly use the rich algorithm modules that come with Python's sklearn library ! You can check the specific functions of each library by yourself. Here I only explain the ideas.

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn
from pylab import rcParams
rcParams['figure.figsize'] = 10,8
from sklearn import datasets,linear_model
from sklearn.linear_model import LinearRegression,Ridge,Lasso  #后面两个是套索回归和岭回归,暂时用不到
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge  
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.model_selection import KFold  #老函数cross_validation改名为model_selection
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from scipy.interpolate import make_interp_spline   #老函数spline改名为make_interp_spline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.kernel_ridge import KernelRidge

②Import and view data

#读取数据
regression_data = pd.read_csv('simple_regression_data.csv')    #加载数据集csv文件:1000x2(1列名Volume,1列名Price)
#绘制数据散点图,观察数据分布
plt.scatter(regression_data['Volume'],regression_data['Price'])
plt.xlabel('Volume')
plt.ylabel('Price')
plt.title('Price-Volume Data')
plt.show()

③Divide training set and test set

#数据分割为训练集80%和测试集20%,每次运行划分都是随机的(设置random_state参数为一个任意整数即可解除这种随机)
X_train, X_test, Y_train, Y_test = 
            train_test_split(regression_data['Volume'],regression_data['Price'],test_size=0.20)

④Create algorithm objects and train (fit fitting)

#创建线性回归对象
simple_linear_regression = LinearRegression()
#利用训练集训练数据
X_train = X_train.values
Y_train = Y_train.values
X_test = X_test.values
simple_linear_regression.fit(pd.DataFrame(X_train),pd.DataFrame(Y_train))  #fit方法训练模型找规律

⑤Use the test set to predict and output

#使用测试集进行预测
Y_predict = simple_linear_regression.predict(pd.DataFrame(X_test))         #predict方法预测测试数据
#输出拟合后的图像
plt.scatter(X_test,Y_test,color='blue')             #测试集散点图
plt.plot(X_test,Y_predict,color='red',linewidth=2)  #测试集部分折线图(直线)
plt.show()

2. Statistical analysis

1. statsmodelsLibrary

Linear regression comparing statsmodelslibrary to scikit-learnlibrary

The above code uses LinearRegression() in the scikit-learn` library to implement linear regression. Next we will use the sm.OLS ordinary least squares (OLS) method of the statsmodels library for linear regression modeling. In comparison, the main difference is that the two provide different functions and outputs :

(1) Function:
   LinearRegression() is scikit-learn linear regression, its main purpose is to perform machine learning tasks , such as prediction or classification. scikit-learn's model design is simpler and suitable for a wide range of machine learning tasks.
   The OLS model of statsmodels focuses more on statistical analysis and provides more detailed information about the statistical properties of the model, such as p-values, confidence intervals, etc. It is commonly used for regression analysis and statistical inference, and is more suitable for data analysis that requires detailed statistical information .

(2) Output:
   LinearRegression() objects usually do not provide direct statistical information, but provide the coefficients, intercepts, etc. of the model. For statistics, you need to use other methods or libraries for analysis .
   statsmodels' OLS model provides a very detailed summary of model statistics through model.summary() , including ANOVA tables, coefficients, standard errors, t statistics, p values, etc.

In general:

If your main focus is on prediction and machine learning tasks , LinearRegression() may be more convenient. If you are interested in statistics and detailed properties of the model , or performing regression analysis, then statsmodels' OLS model is more suitable. In some cases , both may be used simultaneously in the analysis, choosing the appropriate tool based on different needs.

The above codes ①-③ remain unchanged, but some modules in ① are no longer used.

④Use the sm.OLS ordinary least squares method of the statsmodels library for linear regression fitting.

import statsmodels.api as sm
# 添加截距项
X_train_sm = sm.add_constant(X_train)  #将数据转换为 statsmodels 支持的格式
X_test_sm = sm.add_constant(X_test)

# 创建OLS模型(普通最小二乘法)
model = sm.OLS(Y_train, X_train_sm).fit()

View the model summary, which already contains some statistics. You can also call the anova_table =sm.stats.anova_lm(model) function to calculate the variance analysis table, but an error occurs during the experiment. AttributeError: 'PandasData' object has no attribute' design_info' I don't know why. ? ? If anyone knows, please comment in the comments section.

# 模型摘要
summary = model.summary()
print("最小二乘(OLS)回归结果:")
print(summary)

2. Calculate various statistics

You can also call the anova_table = sm.stats.anova_lm(model) function to calculate the variance analysis table, but an error occurs during the experiment. AttributeError: 'PandasData' object has no attribute 'design_info'. I don't know why? ? If anyone knows, please comment in the comments section.

# 计算各种统计量
y_pred = model.predict(X_test_sm)
residuals = Y_test - y_pred #计算残差
print("残差:" )
print(residuals)

# 数据总离差平方和 SST
total_sum_of_squares = np.sum((Y_test - np.mean(Y_test))**2)
print("总离平方和(SST):%f" % total_sum_of_squares)

# 回归平方和  SSR
regression_sum_of_squares = np.sum((y_pred - np.mean(Y_test))**2)
print("回归平方和(SSR):%f" % regression_sum_of_squares)

# 残差平方和  SSE
residual_sum_of_squares = np.sum(residuals**2)
print("残差平方和(SSE):%f" % residual_sum_of_squares)

# 总离差自由度 DFT
dft = len(Y_test) - 1 #非NaN样本数量-1
print("总离差自由度(DFT):%d" % dft)

# 回归自由度 DFR
dfr = 2-1  # 回归模型参数数量-1
print("回归自由度(DFR):%d" % dfr)

# 残差自由度 DFE
dfe = dft - dfr #非NaN样本数量-回归模型参数数量
print("残差自由度(DFE):%d" % dfe)


# 平均总离差 MST
mean_total_sum_of_squares = total_sum_of_squares / dft
print("平均总离差(MST):%f" % mean_total_sum_of_squares)

# 平均回归平方  MSR
mean_regression_sum_of_squares = regression_sum_of_squares / dfr
print("平均回归平方(MSR):%f" % mean_regression_sum_of_squares)

# 残差平均值  MSE
mean_residual_sum_of_squares = residual_sum_of_squares / dfe
print("残差平均值(MSE):%f" % mean_residual_sum_of_squares)

# 均方根残差  RMSE
root_mean_square_residual = np.sqrt(mean_residual_sum_of_squares)
print("均方根残差(RMSE):%f" % root_mean_square_residual)

#拟合优度决定系数R^2系数(这里一元线性回归,不用修正决定系数)
R2 = regression_sum_of_squares / total_sum_of_squares #SSR/SST或1-SSE/SST
print("R^2系数:%.2f" % R2)

# 对数似然函数
log_likelihood = -0.5 * len(residuals) * (1 + np.log(2 * np.pi * mean_residual_sum_of_squares))
print("对数似然函数值:%f" % log_likelihood)

3. F test, t test

Question: Why are the test values ​​and some statistics above different from those in the model summary? If anyone knows, please comment in the comments section.

# # F检验
from scipy.stats import f_oneway
# 合并实际值和预测值为一个数组
all_values = np.concatenate([Y_test, y_pred])
# 创建对应组标签的列表
group_labels = ['Actual'] * len(Y_test) + ['Predicted'] * len(y_pred)
# 进行一元方差分析
f_statistic, p_value = f_oneway(Y_test, y_pred)
print(f"F统计量: {f_statistic}")
print(f"F检验的p值: {p_value}")

# 计算t检验
t_test_results = model.t_test([0, 1])  # 这里假设你对的是截距项和斜率的t检验
# 获取 t 统计量和 p 值
t_statistic = t_test_results.tvalue[0, 0]
t_p_value = t_test_results.pvalue.item()
print(f"t统计量: {t_statistic}")
print(f"t检验的p值: {t_p_value}")

If the p-value of the F test is small (less than the set significance level such as 0.05), the null hypothesis (null hypothesis) can be rejected, indicating that the model is significant and the overall fitting effect is good, that is, the independent variable has a good influence on the cause. variables have significant effects.

4. Confidence interval, prediction interval

Question: Is there a problem with the filling in the picture?

# 计算置信区间和预测区间
confidence_interval = model.get_prediction(X_test_sm).conf_int()
prediction_interval = model.get_prediction(X_test_sm).conf_int(obs=True)
import seaborn as sns
# 可视化拟合图
plt.scatter(X_test, Y_test, color='blue', label='Test Data')  # 测试集散点图
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Fit Line')  # 测试集部分折线图(拟合线)
# 标注置信区间
plt.fill_between(X_test, confidence_interval[:, 0], confidence_interval[:, 1], color='gray', alpha=0.2, label='Confidence Interval')
# 标注预测区间
plt.fill_between(X_test, prediction_interval[:, 0], prediction_interval[:, 1], color='orange', alpha=0.2, label='Prediction Interval')
plt.xlabel('Volume')
plt.ylabel('Price')
plt.title('Price-Volume Data with Confidence and Prediction Intervals')
plt.legend()
plt.show()

5. Residual normality test

①Draw the residual distribution graph

import seaborn as sns
from scipy.stats import probplot
# 绘制残差分布图
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

②Draw the standardized residual QQ plot

In statistics, QQ plot (Quantile-Quantile Plot) is a graphical tool used to check whether the sample distribution conforms to the theoretical distribution . For linear regression models, the QQ plot of the standardized residuals is used to check whether the model's residuals are approximately normally distributed.

# 标准化残差的Q-Q图
probplot(residuals, plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
plt.show()

Standardized residuals are the residuals divided by their standard deviation to ensure that they are consistently scaled. The QQ plot of standardized residuals displays the distribution of the residuals by comparing their theoretical quantiles to the theoretical quantiles of a standard normal distribution. If the QQ plot of the standardized residuals is distributed in a straight line, it means that the residuals roughly conform to the normal distribution. If there is curvature or deviation, it may indicate that the residuals are not normally distributed.

In the above code, probplotthe QQ plot of the standardized residuals is plotted using the function. The points in the graph represent the observed values ​​of the standardized residuals. If they fall on a straight line, it means that the residuals are approximately normally distributed . This graph provides an intuitive way to check the normality of the residuals and is a useful tool to determine whether the model meets the normal distribution assumption.

③Omnibus inspection

# Omnibus检验残差的正态性
omnibus_test = sm.stats.omni_normtest(residuals)
print("Omnibus检验结果:")
print(omnibus_test)

6. Autocorrelation detection

①Draw the autocorrelation diagram

Autocorrelation Plot is a graphical tool used to examine the autocorrelation of time series data . In the autocorrelation diagram, the x-axis represents the lag order (Lag), and the y-axis represents the autocorrelation coefficient corresponding to the lag order. This plot can help you understand whether there is a lagged correlation in the time series.

from statsmodels.graphics.tsaplots import plot_acf
# 绘制自相关图
plot_acf(residuals, lags=20)  # 在lags参数中指定滞后阶数
plt.title('Autocorrelation Plot of Residuals')
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.show()

Interpretation of autocorrelation plots:

  • If the points in the autocorrelation plot are within the shaded band, it indicates that there is no significant autocorrelation between the residuals .
  • If the points in the autocorrelogram are outside the shaded band, it may indicate a lagged correlation between the residuals.
  • If the points in the autocorrelogram alternate above and below a specific lag order, this may indicate seasonality.

In autocorrelation plots, a shaded band is typically displayed, indicating a 95% confidence interval . If the autocorrelation coefficient is within this confidence interval, it is not significant. These are obviously within the shaded band, indicating that there is no significant autocorrelation between the residuals.

②Durbin-Watson autocorrelation detection

# Durbin-Watson自相关检测
durbin_watson_statistic = sm.stats.durbin_watson(residuals)
print(f"Durbin-Watson统计量: {durbin_watson_statistic}")

The DW value is near 2, indicating that the sequence has no autocorrelation. Please see previous blogs for the specific value range and meaning.

Summarize:

When writing this code test blog, I also encountered many problems. I asked for help from platforms such as GPT and CSDN but could not solve them. I hope knowledgeable bloggers can give me valuable opinions! In addition, friends who need the measured data xls file can like it and ask for help in the comment area !

Guess you like

Origin blog.csdn.net/weixin_51658186/article/details/135049600