数据挖掘项目实战—Kaggle入门竞赛:房价预测之EDA与特征工程


房价预测网址


一、房价预测比赛介绍

  房价回归预测,依据一个房子的全方位信息,包括面积、地段、环境等79个变量来预测出房子的价格。你的工作是预测每栋房子的销售价格。对于测试集中的每个Id,必须预测SalePrice变量的值。
  这个比赛要求使用的metricRoot-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price即观测到销售价格的对数与预测价格的对数之间的RMSE
  提交文件的格式如下:

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

数据集介绍

一共有4个文件:train.csv是训练文件,test.csv是测试文件,data_description.txt特征描述文件,sample_submission.csv提交文件样例
train.csv文件字段含义:(英语水平有限,机翻的很多没那味)
SalePrice - 目标变量,销售价格
MSSubClass: 特征变量,建筑等级,应该属于类别型变量,但是数据中是以连续型数据表示的
MSZoning: 特征变量,总分区分类,类别型变量
LotFrontage: 特征变量,房子周围的街道数(有缺失值)
LotArea: 特征变量,土地面积
Street: 特征变量,道路类型(都是Pave)
Alley: 特征变量,巷道通道类型
LotShape: 特征变量,财产类型General shape of property
LandContour:特征变量,财产状况 Flatness of the property
Utilities: 特征变量,公共设施,都是(AllPub)Type of utilities available
LotConfig: 特征变量,配置位置Lot configuration
LandSlope: 特征变量,土地坡度,类别型变量
Neighborhood: 特征变量,在艾姆斯城市范围内的物理位置Physical locations within Ames city limits
Condition1: 特征变量,靠近主要公路或铁路Proximity to main road or railroad
Condition2: 特征变量,靠近主要公路或铁路(如果有第二条的话)Proximity to main road or railroad (if a second is present)
BldgType: 特征变量,住宅类型Type of dwelling
HouseStyle: 特征变量,住宅的风格Style of dwelling
OverallQual: 特征变量,整体的材料与质量Overall material and finish quality
OverallCond: 特征变量,总体状况评级Overall condition rating
YearBuilt: 特征变量,原始的施工日期Original construction date
YearRemodAdd: 特征变量,改变日期Remodel date
RoofStyle: 特征变量,屋顶的类型Type of roof
RoofMatl: 特征变量,屋顶的材料Roof material
Exterior1st: 特征变量,房屋的外墙Exterior covering on house
Exterior2nd: 特征变量,房屋外部的覆盖物Exterior covering on house (if more than one material)
MasVnrType: 特征变量,表层砌体类型Masonry veneer type
MasVnrArea: 特征变量,砌筑单板面积(平方英尺)Masonry veneer area in square feet
ExterQual: 特征变量,外部材质Exterior material quality
ExterCond: 特征变量,材料表面的现状Present condition of the material on the exterior
Foundation: 特征变量,基础类型Type of foundation
BsmtQual: 特征变量,地下室的高度Height of the basement
BsmtCond: 特征变量,地下室的一般状况General condition of the basement
BsmtExposure: 特征变量,室外或花园水平的地下室墙壁Walkout or garden level basement walls
BsmtFinType1: 特征变量,地下室完工区域质量Quality of basement finished area
BsmtFinSF1: 特征变量,类型1完成平方英尺Type 1 finished square feet
BsmtFinType2: 特征变量,第二成品区质量(如有)Quality of second finished area (if present)
BsmtFinSF2: 特征变量,类型2完成平方英尺Type 2 finished square feet
BsmtUnfSF: 特征变量,未完工的地下室面积Unfinished square feet of basement area
TotalBsmtSF: 特征变量,地下室面积总计平方英尺Total square feet of basement area
Heating: 特征变量,加热类型Type of heating
HeatingQC: 特征质量,加热质量和条件Heating quality and condition
CentralAir: 特征变量,中央空调Central air conditioning
Electrical: 特征变量,电气系统Electrical system
1stFlrSF: 特征变量,一楼的平方英尺First Floor square feet
2ndFlrSF: 特征变量,二楼的平方英尺Second floor square feet
LowQualFinSF: 特征变量,低质量的成品平方英尺(所有楼层)Low quality finished square feet (all floors)
GrLivArea: 特征变量,地面以上居住面积平方英尺Above grade (ground) living area square feet
BsmtFullBath: 特征变量,地下室全套浴室Basement full bathrooms
BsmtHalfBath: 特征变量,半地下室卫生间Basement half bathrooms
FullBath: 特征变量,地上全浴室Full bathrooms above grade
HalfBath: 特征变量,地上半浴Half baths above grade
Bedroom: 特征变量,非地下室卧室数量Number of bedrooms above basement level
Kitchen: 特征变量,厨房的数量Number of kitchens
KitchenQual: 特征变量,厨房的质量Kitchen quality
TotRmsAbvGrd: 特征变量,房间总数(不含浴室)Total rooms above grade (does not include bathrooms)
Functional: 特征变量,家庭功能评级Home functionality rating
Fireplaces: 特征变量,壁炉的数量Number of fireplaces
FireplaceQu: 特征变量,壁炉的质量Fireplace quality
GarageType: 特征变量,出库的位置Garage location
GarageYrBlt: 特征变量,车库建成的年份Year garage was built
GarageFinish: 特征变量,车库的内部装修Interior finish of the garage
GarageCars: 特征变量,按汽车容量计算车库的大小Size of garage in car capacity
GarageArea: 特征变量,车库大小(以平方英尺为单位)Size of garage in square feet
GarageQual: 特征变量,车库的质量Garage quality
GarageCond: 特征变量,车库的条件Garage condition
PavedDrive: 特征变量,道路车道Paved driveway
WoodDeckSF: 特征变量,以平方英尺为单位计算木甲板面积Wood deck area in square feet
OpenPorchSF: 特征变量,开放式门廊面积Open porch area in square feet
EnclosedPorch: 特征变量,以平方英尺为单位的封闭门廊区域Enclosed porch area in square feet
3SsnPorch: 特征变量,三季门廊面积平方英尺Three season porch area in square feet
ScreenPorch: 特征变量,屏风门廊面积(平方英尺)Screen porch area in square feet
PoolArea: 特征变量,泳池面积(平方英尺)Pool area in square feet
PoolQC: 特征变量,泳池的质量Pool quality
Fence: 特征变量,栅栏质量Fence quality
MiscFeature: 特征变量,其他类别中未涉及的杂项特性Miscellaneous feature not covered in other categories
MiscVal: 特征变量,杂项功能的价值$Value of miscellaneous feature
MoSold: 特征变量,月销售Month Sold
YrSold: 特征变量,年销售Year Sold
SaleType: 特征变量,销售类型Type of sale
SaleCondition: 特征变量,销售的条件Condition of sale

二、EDA与特征工程

  没有在本地跑,直接在kaggle上做的,代码见网址:



如果有什么好的建议,欢迎评论区留言!

猜你喜欢

转载自blog.csdn.net/weixin_46649052/article/details/114890489