【手把手机器学习入门到放弃】从线性回归开始

终于开新坑了~

线性回归是指将数据拟合成 y = a 1 x 1 + a 2 x 2 + a 3 x 3 . . . + a n x n + b + ϵ y=a_1x_1+a_2x_2+a_3x_3...+a_nx_n+b +\epsilon 的形式

通过训练模型获得参数 a 1 , a 2 , . . . , a n , b a_1, a_2, ..., a_n, b

从而对新的x值,可以预测y

下面就正式开始吧,这次是要预测墨尔本的房价~

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# 线性回归
from sklearn.linear_model import LinearRegression
# 数据分割
from sklearn.model_selection import train_test_split
from datetime import date

1. 数据集描述

Melbourne Housing Market

Some Key Details

Suburb: Suburb

Address: Address

Rooms: Number of rooms

Price: Price in dollars

Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2 : Scraped # of Bedrooms (from different source)

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year the house was built

CouncilArea: Governing council for the area

Lattitude: Self explanitory

Longtitude: Self explanitory

import os
print(os.listdir('datasets'))
['BrentOilPrices.csv', '.DS_Store', 'Iris', 'Lending club loan data', 'Adult', 'Melbourne_housing_extra_data.csv']

2. 数据初探

org_data = pd.read_csv('datasets/Melbourne_housing_extra_data.csv')
org_data.head(10)
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 3/09/2016 2.5 3067.0 ... 1.0 1.0 126.0 NaN NaN Yarra -37.8014 144.9958 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 3/12/2016 2.5 3067.0 ... 1.0 1.0 202.0 NaN NaN Yarra -37.7996 144.9984 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 4/02/2016 2.5 3067.0 ... 1.0 0.0 156.0 79.0 1900.0 Yarra -37.8079 144.9934 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 4/02/2016 2.5 3067.0 ... 2.0 1.0 0.0 NaN NaN Yarra -37.8114 145.0116 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 4/03/2017 2.5 3067.0 ... 2.0 0.0 134.0 150.0 1900.0 Yarra -37.8093 144.9944 Northern Metropolitan 4019.0
5 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 4/03/2017 2.5 3067.0 ... 2.0 1.0 94.0 NaN NaN Yarra -37.7969 144.9969 Northern Metropolitan 4019.0
6 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 4/06/2016 2.5 3067.0 ... 1.0 2.0 120.0 142.0 2014.0 Yarra -37.8072 144.9941 Northern Metropolitan 4019.0
7 Abbotsford 16 Maugie St 4 h NaN SN Nelson 6/08/2016 2.5 3067.0 ... 2.0 2.0 400.0 220.0 2006.0 Yarra -37.7965 144.9965 Northern Metropolitan 4019.0
8 Abbotsford 53 Turner St 2 h NaN S Biggin 6/08/2016 2.5 3067.0 ... 1.0 2.0 201.0 NaN 1900.0 Yarra -37.7995 144.9974 Northern Metropolitan 4019.0
9 Abbotsford 99 Turner St 2 h NaN S Collins 6/08/2016 2.5 3067.0 ... 2.0 1.0 202.0 NaN 1900.0 Yarra -37.7996 144.9989 Northern Metropolitan 4019.0

10 rows × 21 columns

# 查看变量类型:
org_data.dtypes
Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

3. 为了使模型简单,我们就选取type是h(house)类型的房子,选取的变量有rooms,Date,Distance,Landsize, Bedroom2, Bathroom, YearBuilt几个变量

dataframe = org_data[org_data["Type"]=='h'].loc[:,["Rooms","Date","Distance","Landsize","Bedroom2","Bathroom","YearBuilt","Price"]]

dataframe.head()
Rooms Date Distance Landsize Bedroom2 Bathroom YearBuilt Price
0 2 3/09/2016 2.5 126.0 2.0 1.0 NaN NaN
1 2 3/12/2016 2.5 202.0 2.0 1.0 NaN 1480000.0
2 2 4/02/2016 2.5 156.0 2.0 1.0 1900.0 1035000.0
4 3 4/03/2017 2.5 134.0 3.0 2.0 1900.0 1465000.0
5 3 4/03/2017 2.5 94.0 3.0 2.0 NaN 850000.0
dataframe.shape
(12992, 8)

4. 去除Price列为null值的数据

dataframe = dataframe.dropna(subset=['Price'])

dataframe.head()
Rooms Date Distance Landsize Bedroom2 Bathroom YearBuilt Price
1 2 3/12/2016 2.5 202.0 2.0 1.0 NaN 1480000.0
2 2 4/02/2016 2.5 156.0 2.0 1.0 1900.0 1035000.0
4 3 4/03/2017 2.5 134.0 3.0 2.0 1900.0 1465000.0
5 3 4/03/2017 2.5 94.0 3.0 2.0 NaN 850000.0
6 4 4/06/2016 2.5 120.0 3.0 1.0 2014.0 1600000.0
dataframe.shape
(9944, 8)
# 统计缺失值
dataframe.isnull().describe()
Rooms Date Distance Landsize Bedroom2 Bathroom YearBuilt Price
count 9944 9944 9944 9944 9944 9944 9944 9944
unique 1 1 2 2 2 2 2 1
top False False False False False False True False
freq 9944 9944 9939 7780 7978 7978 5352 9944

5. 将Date处理成与最小日期的天数差

dataframe["Date"] = pd.to_datetime(dataframe["Date"],dayfirst=True)

days_since_start = [(x - dataframe["Date"].min()).days for x in dataframe["Date"]]

dataframe["Days"] = days_since_start

dataframe = dataframe.drop(["Date"], axis=1)
dataframe.head()
Rooms Distance Landsize Bedroom2 Bathroom YearBuilt Price Days
1 2 2.5 202.0 2.0 1.0 NaN 1480000.0 310
2 2 2.5 156.0 2.0 1.0 119.0 1035000.0 7
4 3 2.5 134.0 3.0 2.0 119.0 1465000.0 401
5 3 2.5 94.0 3.0 2.0 NaN 850000.0 401
6 4 2.5 120.0 3.0 1.0 5.0 1600000.0 128

6. 将YearBuilt处理成与当前年份之间的年数差

year_from_now = [(2019 - x) for x in dataframe["YearBuilt"]]

dataframe["YearBuilt"]=year_from_now

dataframe.head()
Rooms Distance Landsize Bedroom2 Bathroom YearBuilt Price Days
1 2 2.5 202.0 2.0 1.0 NaN 1480000.0 310
2 2 2.5 156.0 2.0 1.0 1900.0 1035000.0 7
4 3 2.5 134.0 3.0 2.0 1900.0 1465000.0 401
5 3 2.5 94.0 3.0 2.0 NaN 850000.0 401
6 4 2.5 120.0 3.0 1.0 2014.0 1600000.0 128

7. 查看各变量非null值的分布

sns.kdeplot(dataframe["Price"])
<matplotlib.axes._subplots.AxesSubplot at 0x1a188c06d8>

在这里插入图片描述

sns.kdeplot(dataframe["Distance"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a189709b0>

在这里插入图片描述

sns.kdeplot(dataframe["Landsize"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18a85710>

在这里插入图片描述

# 检查一下异常值
dataframe[dataframe["Landsize"]>70000]
Rooms Distance Landsize Bedroom2 Bathroom YearBuilt Price Days
1198 3 9.2 75100.0 3.0 1.0 NaN 2000000.0 213
17293 3 34.6 76000.0 3.0 2.0 NaN 1085000.0 485
sns.kdeplot(dataframe["Days"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18c4b390>

在这里插入图片描述

sns.kdeplot(dataframe["YearBuilt"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18bfbe10>

在这里插入图片描述

yearBuilt缺失值过多,且数据质量过差,我们决定放弃这一列

8. 缺失值处理

Distance = dataframe["Distance"]
Distance.fillna(Distance.mean(),inplace=True)
Distance.isnull().describe()

Bedroom2 = dataframe["Bedroom2"]
Bedroom2.fillna(Bedroom2.mean(), inplace=True)
Bedroom2.isnull().describe()

Bathroom = dataframe["Bathroom"]
Bathroom.fillna(Bathroom.mean(), inplace=True)
Bathroom.isnull().describe()

Landsize = dataframe["Landsize"]
Landsize.fillna(Landsize.mean(), inplace=True)
Landsize.isnull().describe()

dataframe = dataframe.drop(["Distance","Landsize","Bedroom2","Bathroom","YearBuilt"], axis=1)

dataframe = pd.concat([dataframe,Distance,Landsize,Bedroom2,Bathroom],axis=1)

dataframe.head()
Rooms Price Days Distance Landsize Bedroom2 Bathroom
1 2 1480000.0 310 2.5 202.0 2.0 1.0
2 2 1035000.0 7 2.5 156.0 2.0 1.0
4 3 1465000.0 401 2.5 134.0 3.0 2.0
5 3 850000.0 401 2.5 94.0 3.0 2.0
6 4 1600000.0 128 2.5 120.0 3.0 1.0
dataframe.isnull().describe()
Rooms Price Days Distance Landsize Bedroom2 Bathroom
count 9944 9944 9944 9944 9944 9944 9944
unique 1 1 1 1 1 1 1
top False False False False False False False
freq 9944 9944 9944 9944 9944 9944 9944

9. 绘制矩阵散点图,查看变量间关系

sns.pairplot(dataframe)
<seaborn.axisgrid.PairGrid at 0x1a18ab1128>

在这里插入图片描述

10. 绘制热度图,查看变量相关性

fig, ax = plt.subplots(figsize=(15,15)) 
sns.heatmap(dataframe.corr(), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a8d06d8>

在这里插入图片描述

我们去除和price关系不大的Days 和 Landsize两列

11. 拆分训练集与测试集

X=dataframe.drop(["Price"], axis=1)
y=dataframe["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

12. 导入线性回归模型进行训练

lm = LinearRegression()
lm.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

13. 查看拟合参数结果

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
ranked_suburbs = coeff_df.sort_values("Coefficient", ascending = False)
ranked_suburbs
Coefficient
Bathroom 287227.794348
Rooms 256960.542449
Days 208.567181
Landsize 28.455846
Bedroom2 -40640.077882
Distance -48979.633196

14. 预测并可视化预测结果

predictions = lm.predict(X_test)
plt.scatter(y_test, predictions)
plt.ylim([200000,1000000])
plt.xlim([200000,1000000])
(200000, 1000000)

在这里插入图片描述

# 查看残差分布
sns.distplot((y_test-predictions),bins=50)
#结果还不错,比较尖

在这里插入图片描述

15. 计算 RMSE(均方根误差)、MSE(均方误差)、MAE(平均绝对误差)

from sklearn import metrics
# 1.0 最好,越小越差
print("score:", metrics.explained_variance_score(y_test, predictions))
print("MAE:", metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("r^2:", metrics.r2_score(y_test, predictions))
score: 0.31016009739486505
MAE: 406517.11675773124
MSE: 347933550457.3135
RMSE: 589858.9241990948
r^2: 0.31015583638325983

这是一个简单线性回归模型,涉及到了变量空值的填充,和一些变量分布的查看。最后效果一般,受制于线性模型的简单性,且本模型未对变量进行变化。仅作为第一个数据分析项目,熟悉数据分析流程。

希望对读者有帮助


发布了78 篇原创文章 · 获赞 7 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/yao09605/article/details/102724516