Article directory
Experiment 09 Linear Regression and Boston Housing Price Prediction
1. Purpose of the experiment
- Master the basic concepts of machine learning
- Master the implementation process of linear regression
- Using LinearRegression to Realize Regression Prediction
- Know the evaluation criteria and formulas of regression algorithms
- Know the causes and solutions of overfitting and underfitting
2. Experimental equipment
- Jupter Notebook
3. Experimental content
People often encounter classification and prediction problems in their lives. The target variable may be affected by multiple factors. The importance of the influencing factors can be judged according to the correlation coefficient. Just as a patient gets a certain disease is caused by many factors.
As a place to live, a house is indispensable for everyone. The level of housing prices is also affected by many factors. Whether the city where the house is located is first-tier or second-tier, the convenience of transportation around the house, whether there are hospitals or schools near the house, etc., many factors will affect the house price.
"Regression" was proposed by the famous British biologist and statistician Galton (Francis Galton, 1822-1911, cousin of the biologist Darwin) when he was studying human heredity. In the 19th century, Gauss systematically proposed the least square estimation, which made the regression analysis flourish.
The Boston house price data comes from an American economics magazine, which analyzes and studies the data set of Boston House Price. Each row of data in the data set is a description of housing prices around Boston or towns. In this experiment, the Boston housing price data set is used as the linear regression case data to conduct model training and predict Boston housing prices.
3.1 Understanding the data
First import the required packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import preprocessing
Load the dataset of house prices in Boston
data = load_boston()
data_pd = pd.DataFrame(data.data,columns=data.feature_names)
data_pd['price'] = data.target
After getting the data, first check the type of the data, whether there are null values, the description information of the data, and so on.
It can be seen that the data are quantitative data.
# 查看数据类型
data_pd.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
The next step is to check whether there are null values in the data. From the results, it can be seen that there are no null values in the data.
# 查看空缺值
data_pd.isnull().sum()
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
price 0
dtype: int64
It can be seen that there are no missing values in the dataset.
# 查看数据大小
data_pd.shape
(506, 14)
The dataset has 14 columns and 506 rows
View the first 5 rows of the data and give the meaning of the data features
data_pd.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
The description of the data set variables is convenient for everyone to understand the meaning of the data set variables.
- CRIM: Urban crime rate per capita
- ZN: Proportion of Residential Land
- INDUS: Proportion of non-residential land in cities and towns
- CHAS: dummy variable for regression analysis
- NOX: Environmental index
- RM: Number of rooms per dwelling
- AGE: Proportion of owner-occupied units built before 1940
- DIS: Weighted distance to 5 Boston employment centers
- RAD: Convenience Index for Distance from Motorways
- TAX: Real estate tax rate per $10,000
- PTRATIO: teacher-student ratio in town
- B: Proportion of blacks in town
- LSTAT: How many landlords in the region are low-income
- price: The median house price (that is, the average price) of self-occupied houses
3.2 Analyzing data
Calculate the correlation coefficient between each feature and price
data_pd.corr()['price']
CRIM -0.388305
ZN 0.360445
INDUS -0.483725
CHAS 0.175260
NOX -0.427321
RM 0.695360
AGE -0.376955
DIS 0.249929
RAD -0.381626
TAX -0.468536
PTRATIO -0.507787
B 0.333461
LSTAT -0.737663
price 1.000000
Name: price, dtype: float64
Draw and display the features with the absolute value of the correlation coefficient greater than 0.5:
corr = data_pd.corr()
corr = corr['price']
corr[abs(corr)>0.5].sort_values().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x13d1990e5e0>
It can be seen that the correlation coefficients of the three features of LSTAT, PTRATIO, and RM are greater than 0.5, and the scatter diagram of the three features with respect to price is drawn below.
(1) Scatter plot of LSTAT and price
data_pd.plot(kind="scatter",x="LSTAT",y="price")
<matplotlib.axes._subplots.AxesSubplot at 0x13d198bc3d0>
data_pd.plot(kind="scatter",x="PTRATIO",y="price")
<matplotlib.axes._subplots.AxesSubplot at 0x13d199dca60>
data_pd.plot(kind="scatter",x="RM",y="price")
<matplotlib.axes._subplots.AxesSubplot at 0x13d19a2f430>
It can be seen that there is an obvious linear relationship between the three features and the price.
3.3 Model building
(1) Using a variable for prediction
(1) Use LASTAT
unary linear regression
to first make a training set and a test set
# 制作训练集和测试集的数据
feature_cols = ['LSTAT']
X = data_pd[feature_cols]
y = data_pd['price']
# 分割训练集和测试集
train_X,test_X,train_Y,test_Y = train_test_split(X,y)
y.describe()
count 506.000000
mean 22.532806
std 9.197104
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
Name: price, dtype: float64
# 加载模型
linreg = LinearRegression()
# 拟合数据
linreg.fit(train_X,train_Y)
print(linreg.intercept_)
# pair the feature names with the coefficients
b=list(zip(feature_cols, linreg.coef_))
b
63.81849572918555
[('PTRATIO', -2.2442477329043706)]
# 进行预测
y_predict = linreg.predict(test_X)
# 计算均方根误差
print("均方根误差=",metrics.mean_squared_error(y_predict,test_Y))
均方根误差= 74.6287048997467
drawing
import seaborn as sns #seaborn就是在matplot的基础上进行了进一步封装
sns.lmplot(x='LSTAT', y='price', data=data_pd, aspect=1.5, scatter_kws={
'alpha':0.2})
<seaborn.axisgrid.FacetGrid at 0x13d1b0f5a00>
(2) Use PTRATIO
unary linear regression
# 制作训练集和测试集的数据
feature_cols = ['PTRATIO']
X = data_pd[feature_cols]
y = data_pd['price']
# 分割训练集和测试集
train_X,test_X,train_Y,test_Y = train_test_split(X,y)
# 加载模型
linreg = LinearRegression()
# 拟合数据
linreg.fit(train_X,train_Y)
print(linreg.intercept_)
# pair the feature names with the coefficients
b=list(zip(feature_cols, linreg.coef_))
b
61.54376809966996
[('PTRATIO', -2.1175617470715635)]
# 进行预测
y_predict = linreg.predict(test_X)
# 计算均方根误差
print("均方根误差=",metrics.mean_squared_error(y_predict,test_Y))
均方根误差= 54.541969092283985
drawing
import seaborn as sns #seaborn就是在matplot的基础上进行了进一步封装
sns.lmplot(x='PTRATIO', y='price', data=data_pd, aspect=1.5, scatter_kws={
'alpha':0.2})
<seaborn.axisgrid.FacetGrid at 0x13d1b140490>
(3) Use RM
unary linear regression
# 制作训练集和测试集的数据
feature_cols = ['RM']
X = data_pd[feature_cols]
y = data_pd['price']
# 分割训练集和测试集
train_X,test_X,train_Y,test_Y = train_test_split(X,y)
# 加载模型
linreg = LinearRegression()
# 拟合数据
linreg.fit(train_X,train_Y)
print(linreg.intercept_)
# pair the feature names with the coefficients
b=list(zip(feature_cols, linreg.coef_))
b
-32.662292886508155
[('RM', 8.738014969584246)]
# 进行预测
y_predict = linreg.predict(test_X)
# 计算均方根误差
print("均方根误差=",metrics.mean_squared_error(y_predict,test_Y))
均方根误差= 51.81438126437724
drawing
import seaborn as sns #seaborn就是在matplot的基础上进行了进一步封装
sns.lmplot(x='RM', y='price', data=data_pd, aspect=1.5, scatter_kws={
'alpha':0.2})
<seaborn.axisgrid.FacetGrid at 0x13d1b1addc0>
Model comparison based on root mean square error
Answer: The root mean square error of RM univariate regression analysis is the smallest, so the model is the best
(2) Prediction using multiple linear regression analysis
Use LSTAT
, PTRATIO
, RM
to do multiple linear regression analysis
First make training set and test set
# 制作训练集和测试集的数据
feature_cols = ['LSTAT','PTRATIO','RM']
X = data_pd[feature_cols]
y = data_pd['price']
# 分割训练集和测试集
train_X,test_X,train_Y,test_Y = train_test_split(X,y)
# 加载模型
linreg = LinearRegression()
# 拟合数据
linreg.fit(train_X,train_Y)
print(linreg.intercept_)
# pair the feature names with the coefficients
b=list(zip(feature_cols, linreg.coef_))
b
24.145147504479777
[('LSTAT', -0.6077646658186993),
('PTRATIO', -0.9890097312795556),
('RM', 3.894020674969254)]
# 进行预测
y_predict = linreg.predict(test_X)
# 计算均方根误差
print("均方根误差=",metrics.mean_squared_error(y_predict,test_Y))
均方根误差= 22.06146178562167
draw comparison
Compare the trained test set with the original test set
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['font.sans-serif'] = 'SimHei'
fig = plt.figure(figsize=(10,6)) ##设定空白画布,并制定大小
##用不同的颜色表示不同数据
plt.plot(range(test_Y.shape[0]),test_Y,color="blue", linewidth=1.5, linestyle="-")
plt.plot(range(test_Y.shape[0]),y_predict,color="red", linewidth=1.5, linestyle="-.")
plt.legend(['真实值','预测值'])
plt.show() ##显示图片
Model comparison based on root mean square error
Answer: The root mean square error of the multiple linear regression analysis is the smallest, so the model is the best