kaggle竞赛初体验(House Prices)

提出问题

买房时哪些因素最终影响了你的决定?

理解数据

本次竞赛的数据集中,包含了大量的房源相关的字段信息,将字段信息分为离散型和连续型,并将离散型数据做处理。

1)导入数据

# 导入需要的模块
import pandas as pd
import numpy as np

train = pd.read_csv('C:\\Users\\1\\Desktop\\house price  prediction\\train.csv')
test = pd.read_csv('C:\\Users\\1\\Desktop\\house price  prediction\\test.csv')
full = train.append(test,ignore_index=True)

2)查看数据信息

①查看数据类型

full.info()

数据值较多,此处仅展示部分

②查看缺失值

aa = full.isnull().sum()
aa[aa>0].sort_values(ascending=False)

数据集中缺失数据量较高,"PoolQC"至"LotFrontage"字段的数据缺失率甚至超过15%,一般情况下对于大量数据缺失的情况,会直接将该字段进行删除,此次为练习数据预处理故一并进行处理;

数据清洗

1)数据预处理

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
    full[col].fillna(0, inplace=True)

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
    full[col].fillna("None", inplace=True)

cols2 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd"]
for col in cols2:
    full[col].fillna(full[col].mode()[0], inplace=True)

full['LotFrontage']=full['LotFrontage'].transform(lambda x: x.fillna(x.median()))

#房价缺失值处理:按房屋等级将房价进行切片,并填充中位数
full['SalePrice']=full.groupby(['MSSubClass'])['SalePrice'].transform(lambda x: x.fillna(x.median()))
full['SalePrice']=full['SalePrice'].fillna('0')
full['SalePrice']=full['SalePrice'].astype('int')

2)特征工程

①将离散型数据进行one-hot处理:

BldgTypeDf = pd.get_dummies( full['BldgType'] , prefix='BldgType' )
full = pd.concat([full,BldgTypeDf],axis=1)
full.drop('BldgType',axis=1,inplace=True)

BsmtCondDf = pd.get_dummies( full['BsmtCond'] , prefix='BsmtCond' )
full = pd.concat([full,BsmtCondDf],axis=1)
full.drop('BsmtCond',axis=1,inplace=True)

BsmtExposureDf = pd.get_dummies( full['BsmtExposure'] , prefix='BsmtExposure' )
full = pd.concat([full,BsmtExposureDf],axis=1)
full.drop('BsmtExposure',axis=1,inplace=True)

BsmtFinType1Df = pd.get_dummies( full['BsmtFinType1'] , prefix='BsmtFinType1' )
full = pd.concat([full,BsmtFinType1Df],axis=1)
full.drop('BsmtFinType1',axis=1,inplace=True)

BsmtFinType2Df = pd.get_dummies( full['BsmtFinType2'] , prefix='BsmtFinType2' )
full = pd.concat([full,BsmtFinType2Df],axis=1)
full.drop('BsmtFinType2',axis=1,inplace=True)

②特征选择

查看各特征的相关系数,并排序:

corrDf = full.corr()
corrDf['SalePrice'].sort_values(ascending=False)

正相关:

负相关:

选择以上截图中红框内的特征重新组合:

full_X=pd.concat([full['OverallQual'],
                  full['GrLivArea'],
                  full['GarageCars'],
                  full['YearBuilt'],
                  full['GarageArea'],
                  full['FullBath'],
                  full['YearRemodAdd'],
                  full['TotRmsAbvGrd'],
                  full['TotalBsmtSF'],
                  full['1stFlrSF'],
                  MSZoningDf,
                  BsmtExposureDf],axis=1)

构建模型

1)建立训练数据集和测试数据集

sourceRow=1460
data_X=full_X.loc[0:sourceRow-1,:]
data_y=full.loc[0:sourceRow-1,'SalePrice']
pred_X=full_X.loc[sourceRow:,:]

2)选择算法

房价预测属于回归算法的范畴,故选择最常用的逻辑回归进行计算(其优点是无参数):

#从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
#训练模型
regressor.fit(data_X,data_y)

3)方案评估

model.score(data_X,data_y)

R²值为0.795,模型效果一般,后期再对特征进行优化,并尝试其他可调参的模型进行计算,以求得到更好的拟合结果。

猜你喜欢

转载自blog.csdn.net/c710473510/article/details/88993811