Data Science Assignment 2_ House Transaction Price Prediction

        This is homework 2 when I took data science last year. It was taught by teacher Xiao Ruoxiu at the time, but I heard that after this year, the computer science and Wulian Xinan will teach the same level of difficulty. This article may just be mere record. I can’t help my classmates learn My sister, but when I was in data science, Mr. Xiao didn't sign in. It's okay. After the last four assignments, I got a pretty good score.        

Previous link:

                Data Science Assignment 1


Table of contents

Previous link:

1. Job description

2. Operation process

1. Import related libraries

2. Read data

3. Feature Scaling

4. Gradient descent linear regression

5. Mean square error evaluation model

6. Drawing

7. Output

2. Formula derivation

Third, the source code is attached

3. Experience


1. Job description

 In this assignment, provide 1 set of data of US housing transaction prices, the data is data2.CSV , including 10,000 records, 14 fields, the main fields are described as follows:

First column "Sale Date": May 2014 to May 2015 when the house was sold

The second column "sales price": the transaction price of the house, in US dollars, is the predicted value of the target

Third column "Number of Bedrooms": the number of bedrooms in the house

Fourth column "Number of bathrooms": Number of bathrooms in the house

Fifth column "house area": ​​the living area in the house

Column 6 "Parking Area": ​​the area of ​​the parking apron

The seventh column "number of floors": the number of floors of the house

Column 8 "House Rating": The overall rating of the home on the King County housing scoring system

Column 9 "Building Area": ​​the building area of ​​the house except the basement

Column 10 "Basement Area": ​​the area of ​​the basement

Eleventh column "Building Year": the year the house was built

Column 12 "Year of Restoration": The year the house was last restored

The thirteenth column " latitude ": the latitude of the house

The fourteenth column "longitude": the longitude of the house

 This assignment aims to achieve accurate prediction of housing prices through multi-dimensional analysis of the practicability and comfort of houses. It mainly examines students' understanding and application of regression algorithms.

Specific requirements:

(1) From the given basic information of the house and the sale information of the house, etc., a regression model is established to predict the sale price of the house, and one, more or all features can be selected.

(3) Reasonably divide the data (training set, verification set and test set, etc.), select reasonable evaluation indicators, and analyze the effect of the obtained regression model.   

2. Operation process

1. Import related libraries

# 导入相关python库
import os
import numpy as np
import pandas as pd

#设定随机数种子
np.random.seed(36)

#使用matplotlib库画图
import matplotlib
import seaborn
import matplotlib.pyplot as plot

from sklearn import datasets

2. Read data

housing = pd.read_csv('kc_train.csv')

target=pd.read_csv('kc_train2.csv')  #销售价格

t=pd.read_csv('kc_test.csv')         #测试数据

3. Feature Scaling

from sklearn.preprocessing import MinMaxScaler

minmax_scaler=MinMaxScaler()

minmax_scaler.fit(housing)   #进行内部拟合,内部参数会发生变化

scaler_housing=minmax_scaler.transform(housing)

scaler_housing=pd.DataFrame(scaler_housing,columns=housing.columns)



mm=MinMaxScaler()

mm.fit(t)

scaler_t=mm.transform(t)

scaler_t=pd.DataFrame(scaler_t,columns=t.columns)

4. Gradient descent linear regression

from sklearn.linear_model import LinearRegression

LR_reg=LinearRegression()

#进行拟合

LR_reg.fit(scaler_housing,target)

5. Mean square error evaluation model

from sklearn.metrics import mean_squared_error

preds=LR_reg.predict(scaler_housing)   #输入数据进行预测得到结果

mse=mean_squared_error(preds,target)   #使用均方误差来评价模型好坏,可以输出mse进行查看评价值

6. Drawing

plot.figure(figsize=(10,7))       #画布大小

num=100

x=np.arange(1,num+1)              #取100个点进行比较

plot.plot(x,target[:num],label='target')      #目标取值

plot.plot(x,preds[:num],label='preds')        #预测取值

plot.legend(loc='upper right')  #线条显示位置

plot.show()

7. Output

result=LR_reg.predict(scaler_t)

df_result=pd.DataFrame(result)

df_result.to_csv("result.csv")

2. Formula derivation

        By constructing a loss function, the parameters w and b when the loss function is minimized are solved:

        y^ is the predicted value, the independent variable x and the dependent variable y are known, and what we want to achieve is to predict a new x, and what is the corresponding y. Therefore, in order to construct this functional relationship, the goal is to solve for the two parameters w and b in the linear model with known data points.

        To solve the optimal parameters, a standard is needed to measure the results. For this reason, we need to quantify an objective function formula so that the computer can continuously optimize during the solution process.

For any model solving problem, you can finally get a set of predicted values ​​y^, compared with the existing real value y, and the number of data rows is n. The loss function can be defined as follows:

        That is, the average squared distance between the predicted value and the real value, which is generally called MAE (mean square error) in statistics . Substituting the previous functional formula into the loss function, and considering the parameters w and b to be solved as the independent variables of the function L, we can get:

        The core content of gradient descent is the process of continuously updating the independent variables (finding partial derivatives for w and b), so that the objective function continuously approaches the minimum value:

3. Visualization results

3. The source code is attached:

'''
在本作业中,提供美国房屋交易价格的数据1套,数据为data2.CSV,包括10000条记录,14个字段,主要字段说明如下:
第一列“销售日期”:2014年5月到2015年5月房屋出售时的日期
第二列“销售价格”:房屋交易价格,单位为美元,是目标预测值
第三列“卧室数”:房屋中的卧室数目
第四列“浴室数”:房屋中的浴室数目
第五列“房屋面积”:房屋里的生活面积
第六列“停车面积”:停车坪的面积
第七列“楼层数”:房屋的楼层数
第八列“房屋评分”:King County房屋评分系统对房屋的总体评分
第九列“建筑面积”:除了地下室之外的房屋建筑面积
第十列“地下室面积”:地下室的面积
第十一列“建筑年份”:房屋建成的年份
第十二列“修复年份”:房屋上次修复的年份
第十三列"纬度":房屋所在纬度
第十四列“经度”:房屋所在经度
 
本次作业旨在通过对房屋的实用性和舒适性等多维度分析,实现对房价的精准预测,主要考察学生对于回归算法的理解和应用。
具体要求:
(1)从给定的房屋基本信息以及房屋销售信息等,建立一个回归模型预测房屋的销售价格,可以选定一个、多个或全部特征。
(2)【可选:加分项】仅使用Numpy和pandas等基础数学库,实现上述回归模型的梯度下降优化代码,并分析步长对收敛速度的影响。
(3)对数据进行合理的划分(训练集,验证集和测试集等),选定合理的评价指标,分析所获得的回归模型的效果。
提交方式:压缩格式,文件后缀为ZIP。其中:
(1)压缩文件根目录包含pdf文件,名称为“学号-姓名-作业2.pdf”,为作业的文档,内容包括作业过程、公式推导、代码说明、模型训练过程和结果、可视化图表等。
(2)压缩文件包含src子目录,其中包含所有的相关代码等。原始数据无需放置在压缩包内。
'''

from __future__ import print_function
 
# 导入相关python库
import os
import numpy as np
import pandas as pd
 
#设定随机数种子
np.random.seed(36)
 
#使用matplotlib库画图
import matplotlib
import seaborn
import matplotlib.pyplot as plot
 
from sklearn import datasets
 
 
#读取数据
housing = pd.read_csv('kc_train.csv')
target=pd.read_csv('kc_train2.csv')  #销售价格
t=pd.read_csv('kc_test.csv')         #测试数据
 
#数据预处理
#housing.info()    #查看是否有缺失值
 
#特征缩放
from sklearn.preprocessing import MinMaxScaler
minmax_scaler=MinMaxScaler()
minmax_scaler.fit(housing)   #进行内部拟合,内部参数会发生变化
scaler_housing=minmax_scaler.transform(housing)
scaler_housing=pd.DataFrame(scaler_housing,columns=housing.columns)
 
mm=MinMaxScaler()
mm.fit(t)
scaler_t=mm.transform(t)
scaler_t=pd.DataFrame(scaler_t,columns=t.columns)
 
 
 
#选择基于梯度下降的线性回归模型
from sklearn.linear_model import LinearRegression
LR_reg=LinearRegression()
#进行拟合
LR_reg.fit(scaler_housing,target)
 
 
#使用均方误差用于评价模型好坏
from sklearn.metrics import mean_squared_error
preds=LR_reg.predict(scaler_housing)   #输入数据进行预测得到结果
mse=mean_squared_error(preds,target)   #使用均方误差来评价模型好坏,可以输出mse进行查看评价值
 
#绘图进行比较
plot.figure(figsize=(10,7))       #画布大小
num=100
x=np.arange(1,num+1)              #取100个点进行比较
plot.plot(x,target[:num],label='target')      #目标取值
plot.plot(x,preds[:num],label='preds')        #预测取值
plot.legend(loc='upper right')  #线条显示位置
plot.show()
 
 
#输出测试数据
result=LR_reg.predict(scaler_t)
df_result=pd.DataFrame(result)
df_result.to_csv("result.csv")

3. Experience

      Through the gradient descent algorithm, the house price is estimated, and I have a new understanding of the regression model, and I have gained a lot.

Guess you like

Origin blog.csdn.net/weixin_48144018/article/details/124871453