Untitled3

利用statsmodel实现多元线性回归

导入数据集

import numpy as np
import pandas as pd
import random
#加载线性回归需要的模块和库
import statsmodels.api as sm #最小二乘
from statsmodels.formula.api import ols #加载ols模型
data= pd.read_csv("D:\Download\house_prices.csv")

划分并整理数据集

#分训练集测试集
random.seed(123) #设立随机数种子
a=random.sample(range(len(house_data)),round(len(house_data)*0.3))
house_test=[]
for i in a:
    house_test.append(house_data.iloc[i])
house_test=pd.DataFrame(house_test)
house_train=house_data.drop(a)

#重新排列index
for i in [house_test,house_train]:
    i.index = range(i.shape[0])
house_test.head()
house_train.head()

house_id neighborhood area bedrooms bathrooms style price
0 491 B 3512 5 3 victorian 1744259
1 3525 A 1940 4 2 ranch 493675
2 5108 B 2208 6 4 victorian 1101539
3 7507 C 1785 4 2 lodge 455235
4 7627 C 3263 5 3 victorian 821931

训练并展示线性回归模型

#训练模型
lm=ols('price~ area + bedrooms + bathrooms',data=house_train).fit()
lm.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.678
Model: OLS Adj. R-squared: 0.678
Method: Least Squares F-statistic: 2958.
Date: Mon, 01 Nov 2021 Prob (F-statistic): 0.00
Time: 22:11:39 Log-Likelihood: -59155.
No. Observations: 4220 AIC: 1.183e+05
Df Residuals: 4216 BIC: 1.183e+05
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9332.4402 1.23e+04 0.756 0.449 -1.49e+04 3.35e+04
area 344.9919 8.607 40.082 0.000 328.117 361.866
bedrooms -2934.4763 1.22e+04 -0.240 0.810 -2.69e+04 2.1e+04
bathrooms 9679.0188 1.69e+04 0.573 0.567 -2.34e+04 4.28e+04
Omnibus: 276.310 Durbin-Watson: 2.012
Prob(Omnibus): 0.000 Jarque-Bera (JB): 236.334
Skew: 0.505 Prob(JB): 4.79e-52
Kurtosis: 2.431 Cond. No. 1.15e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.15e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

线性模型拟合效果

显示r^2值:

#利用测试集测试模型
house_test.loc[:,"pread"]=lm.predict(house_test)
#计算R方
##计算残差平方和
error2=[]
for i in range(len(house_test)):
    error2.append((house_test.pread[i]-house_test.loc[:,"price"][i])**2)
##计算总离差平方和
sst=[]
for i in range(len(house_test)):
    sst.append((house_test.price[i]-np.mean(house_test.price))**2)
R2=1-np.sum(error2)/np.sum(sst)
print("R方为:",R2)

R方为: 0.6784013021595324

R方为: 0.6784013021595324

预测效果展示

#作预测效果图
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(range(len(house_test.pread)),sorted(house_test.price),c="black",label= "target_data")
plt.plot(range(len(house_test.pread)),sorted(house_test.pread),c="red",label = "Predict")
plt.legend()
plt.show()

在这里插入图片描述

由图可见,在开头和中间部分拟合效果不错,但末尾差距较大,导致R平方较低

使用excel中数据分析工具进行多元线性回归分析

加载所需加载项

在文件->更多->选项中加载项找到分析工具库和分析工具库-VBA,如果没有加载,则将其加载。

进行线性回归分析

在数据中找到数据分析,选择回归。选择y(因变量),另外一个为自变量。所选择的向必须相邻,且不含有非数字项。分析结果如下
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vX7F8QPG-1635811085260)(attachment:image-2.png)]
r平方为0.673421718254268

使用sklearn库实现多元线性回归分析

导入并划分数据集

代码如下

import pandas as pd
import numpy as np
import math
from sklearn import linear_model # 线性模型
from sklearn.model_selection import train_test_split
data = pd.read_csv('D:\Download\house_prices.csv') #读取数据
data.head() #数据展示
x_data=data.iloc[:,2:5];
y_data=data.iloc[:,6]
print(x_data,y_data)
      area  bedrooms  bathrooms
0     1188         3          2
1     3512         5          3
2     1134         3          2
3     1940         4          2
4     2208         6          4
...    ...       ...        ...
6023   757         0          0
6024  3540         5          3
6025  1518         2          1
6026  2270         4          2
6027  3355         5          3

[6028 rows x 3 columns] 0        598291
1       1744259
2        571669
3        493675
4       1101539
         ...   
6023     385420
6024     890627
6025     760829
6026     575515
6027     844747
Name: price, Length: 6028, dtype: int64

使用模型

代码如下

# 应用模型
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print("回归系数:", model.coef_)
print("截距:", model.intercept_)
print('回归方程: price=',model.coef_[0],'*area +',model.coef_[1],'*bedrooms +',model.coef_[2],'*bathromms +',model.intercept_)
print(model.score(X_test,y_test))
回归系数: [  349.35836736  2928.93579408 -5312.7045231 ]
截距: 8659.165446003317
回归方程: price= 349.35836735851404 *area + 2928.9357940750974 *bedrooms + -5312.704523101768 *bathromms + 8659.165446003317

R平方为
0.6677698767114112

求解得R平方为0.6677698767114112

おすすめ

転載: blog.csdn.net/weixin_45747542/article/details/121092799