Machine learning | A simple practice based on machine learning: Boston housing price prediction analysis

This paper uses Kaggle above Boston HousePrice data sets show the usual process of learning how to build a machine model , includes the following phases:

  • data collection
  • Data cleaning
  • Exploratory data analysis
  • Feature engineering
  • Model building
  • Model integration

The label variable (housing price) is logarithmically transformed to make it fit the normal distribution. Finally, the 6 models with the best predictive effect are selected from the 12 candidate models, Lasso, Ridge, SVR, KernelRidge, ElasticNet, and BayesianRidge are respectively weighted and averaged Integration and Stacking integration, and finally found that the Stacking integration effect is better. The innovation lies in adding the Stacking integration data to the original training set and retraining the Stacking integration model, so that the model performance is improved again. As the final prediction model, the prediction results are submitted to kaggle Good performance after going up. In addition, limited by the training time, the hyperparameter search space is small, which needs to be improved.

data collection

Kaggle's official website provides a large number of machine learning data sets. This article selects the Boston HousePrice data set. The download address is https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data, after downloading The data set includes train.csv, test.csv, data_description.txt, sample_submission.csv. As the name suggests, train.csv is the training data set for training the model, and test.csv is the test data set for verifying the model. Accuracy, data_description.txt describes the train.csv field, and sample_submission.csv provides the format of the final submitted file. The training set has 1459 samples, 81 fields, an ID field, and a label SalePrice field. The test set has 1458 samples with 80 fields.

The question gave us 79 characteristics describing the house, and asked us to predict the final sale price of the house based on this, that is, give the predicted value of the SalePrice field for the ID of each house in the test set, and mainly examine our data cleaning, feature engineering, Skills in model building and tuning. This question is a typical regression question. The evaluation index is the root mean square error (RMSE). In order to make the price level have an equal effect on the evaluation of the result, the root mean square error of the question is based on the predicted value and the actual value. Take the logarithm to calculate. Preliminary analysis of characteristics:

Feature name

description

Types of

unit

SalePrice

House price, the label we want to predict

Numerical

USD

MSSubClass

Building grade

Category type


MSZoning

Area classification

Category type


LotFrontage

 Straight-line distance from the street

Numerical

foot

LotArea

Land area

Numerical

Square feet

Street

Street type

Category type


Alley

Alley type

Category type


LotShape

Overall shape of the house

Category type


LandContour

Flatness level

Category type


Utilities

Type of public facilities

Category type


LotConfig

Housing configuration

Category type


LandSlope

Inclination

Category type


Neighborhood

Urban physical location

Category type


Data cleaning

Data cleaning is an indispensable link in the entire data analysis process, and the quality of the results is directly related to the model effect and the final conclusion. Data cleaning objects mainly include outliers, missing values, duplicate values, and data conversion.

Outliers

Outliers usually refer to numerical variables. By doing the characteristic GrLivArea and SalePrice scatter plots, it is found that there are two abnormal points at the bottom right, because it is unlikely that the larger the living area, the lower the price, so they are deleted.

c5553bd5bd34471c87b0a1afc06796d4.png

Variable conversion

SalePrice is the target variable we need to predict. Let's do some analysis on SalePrice. Using the normal distribution to fit SalePrice, and doing its normal probability graph at the same time, you can find that the target variable presents a right skewed distribution.

Because the linear model is more suitable for fitting the normal distribution, it is necessary to perform log transformation on the target variable to make it close to the normal distribution. After doing the log transformation, repeat the above steps and find that the skewness is significantly reduced, almost close to the normal distribution.

b10b5cc8cd8dec605a6a0fac7d4edae7.png

Missing value

The training set and the test set are processed together in a data frame to handle missing values. Analyze the missing data, as shown in the figure below

0c4260caf523a8eb21f3ebf550c28bb4.png

Points to consider:

Fill in missing values ​​according to the actual meaning of each feature. To analyze the correlation between each feature and SalePrice, it is best to use a heat map.

2ed01489a87645143c34af96f50427b3.png

可以看到对角线有一条白线,这代表相同的特征相关性为最高,但值得注意的是,有两个正方形小块:TotaLBsmtSF和1stFlrSF、GarageAreas和GarageCars处。这代表全部建筑面积TotaLBsmtSF与一层建筑面积1stFlrSF成强正相关,车库区域GarageAreas和车库车辆GarageCars成强正相关,那么在填补缺失值的时候就有了依据,我们可以直接删掉一个多余的特征或者使用一个填补另一个。

对于特征PoolQC,因为具有很高的缺失率,NA表示不带游泳池,根据常识中大多数房屋都不带游泳池,因此缺失值全部用None填充。

对于特征MiscFeature,Alley,Fence,FireplaceQu和MSSubClass,NA都表示没有特征所代表的实际意义,因此缺失值都用None填充。

对于特征LotFrontage,根据特征描述,因为任一房屋都非常可能与它的邻居拥有相同的相连的街道区域,因此可以按照特征Neighborhood分组后在根据其众数填充。

对于特征GarageType, GarageFinish, GarageQual , GarageCond,直接用None填充。

对于特征GarageYrBlt, GarageArea 和GarageCars,因为都是数值型变量,缺失表示没有,因此全部用0填充。

对于特征BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath和BsmtHalfBath,都是数值型变量,缺失都表示没有,全部用0填充。

对于特征BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 和BsmtFinType2,都是类别型变量,缺失表示没有basement,因此都用None填充。

对于特征MasVnrArea 和MasVnrType,因为NA很可能意味着没有表层砌体饰面,因此分别用0和None填充。

对于特征MSZoning,Electrical ,KitchenQual,Exterior1st ,Exterior2nd和SaleType,因为缺失率较低,全部用众数填充。

对于特征Functional ,因为NA意味着typical,因此用Typ填充。

对于特征Utilities,除了两个NA和一个NoSeWa外,全部为AllPub,又NoSeWa属于训练集中,因此这个特征对于训练模型没有意义,应该删除。

编码

  1. 有些数据实际含义是类别型特征,在此处用了数值表示,需要将其转化为类别型特征,比如卖出的月份MoSold,这些变量有MSSubClass,BsmtFullBath,BsmtHalfBath,HalfBath,BedroomAbvGr,KitchenAbvGr,MoSold,YrSold,YearBuilt,YearRemodAdd,LowQualFinSF,GarageYrBlt。
  2. 对SalePrice按照分类型变量进行分组后,进行特征映射。以变量MSSubClass为例,依据平均值可以将MSSubClass映射为下图所示。

c088516d1353ddd10c93c848d81e2880.png

依次对变量MSSubClass, MSZoning, Neighborhood, Condition1, BldgType, HouseStyle, Exterior1st, MasVnrType, ExterQual, Foundation, BsmtQual, BsmtExposure, Heating, HeatingQC, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, PavedDrive, SaleType, SaleCondition分组后进行特征映射。

3. 下面对特征进行编码,采用LabelEncode和OneHotEncode。先对于三个跟年相关的变量YearBuilt, YearRemodAdd, GarageYrBlt进行LabelEncoding编码,然后对于那些偏度很大的变量,先进行log1p转换后在进行OneHotEncode。接着将数据集按照原来的比例拆分为训练集和测试集,因为担心训练集和测试集中还有大量的离群点,考虑到模型的稳健性,使用robustscaler对所有数据进行缩放。

对数据完成了预处理,下面就进入了特征过程。

特征工程

有这么一句话在业界广泛流传:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。顾名思义,其本质是一项工程活动,目的是最大限度地从原始数据中提取特征以供算法和模型使用。通过总结和归纳,人们认为特征工程主要包括特征创造和特征选择。

特征选择

特征选择主要有两个目的一是减少特征数量、降维,使模型泛化能力更强,二是减少过拟合, 增强对特征和特征值之间的理解。选取特征主要依据以下两点:

一、特征是否发散:如果一个特征不发散,例如方差接近于0,也就是说样本在这个特征上基本上没有差异,这个特征对于样本的区分并没有什么用。

二、特征与目标的相关性:这点比较显见,与目标相关性高的特征,应当优选选择。除方差法外,本文介绍的其他方法均从相关性考虑。

基于以上两点,特征选择 的常用方法有移除低方差的特征,卡方(Chi2)检验,Pearson相关系数,互信息和最大信息系数,距离相关系数,Wrapper,Embedded。

因为特征量较大,选择Embedded中基于惩罚项的特征选择法。Embedded主要思想是:使用某些机器学习的算法和模型进行训练,得到各个特征的权值系数,根据系数从大到小选择特征。类似于Filter方法,但是是通过训练来确定特征的优劣。其实是讲在确定模型的过程中,挑选出那些对模型的训练有重要意义的属性。考虑到LASSO回归因为L1正则项同时具有特征选择和降维的作用,特别适合稀疏样本,因为前面进行编码后造成特征膨胀,样本变得稀疏,因此选择LASSO回归来筛选特征。对训练集应用LASSO回归,输出所有特征的特征重要性如下

3f6d190b6032250393e0c125a77a0f78.png5eb39304fcc2e752f7f359dec8b0937a.png

接着筛选出特征重要性不为0的特征,如下图所示。

53b0849a62d3d02e407c85193e1fc056.png

特征创造

基于特征重要性,可以创造一些新的特征,比如顾客可能关心的是房屋的总面积,因此可以组合新的特征。

X["TotalArea"]=X["TotalBsmtSF"]+X["1stFlrSF"]+X["2ndFlrSF"]+X["GarageArea"],同理根据原有特征描述以及实际意义,可以组合出以下新的特征。

X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]

X["TotalHouse"]=X["TotalBsmtSF"]+X["1stFlrSF"]+X["2ndFlrSF"]

X["TotalArea"]=X["TotalBsmtSF"]+X["1stFlrSF"]+X["2ndFlrSF"]+X["GarageArea"]

X["+_TotalHouse_OverallQual"]=X["TotalHouse"]*X["OverallQual"]

X["+_GrLivArea_OverallQual"]=X["GrLivArea"]*X["OverallQual"]

X["+_oMSZoning_TotalHouse"]=X["oMSZoning"]*X["TotalHouse"]

X["+_oMSZoning_OverallQual"]=X["oMSZoning"]+X["OverallQual"]

X["+_oMSZoning_YearBuilt"]=X["oMSZoning"]+X["YearBuilt"]

X["+_oNeighborhood_TotalHouse"]=X["oNeighborhood"]*X["TotalHouse"]

X["+_oNeighborhood_OverallQual"]=X["oNeighborhood"]+X["OverallQual"]

X["+_oNeighborhood_YearBuilt"]=X["oNeighborhood"]+X["YearBuilt"]

X["+_BsmtFinSF1_OverallQual"]=X["BsmtFinSF1"]*X["OverallQual"]

X["-_oFunctional_TotalHouse"]=X["oFunctional"]*X["TotalHouse"]

X["-_oFunctional_OverallQual"]=X["oFunctional"]+X["OverallQual"]

X["-_LotArea_OverallQual"]=X["LotArea"]*X["OverallQual"]

X["-_TotalHouse_LotArea"]=X["TotalHouse"]+X["LotArea"]

X["-_oCondition1_TotalHouse"]=X["oCondition1"]*X["TotalHouse"]

X["-_oCondition1_OverallQual"]=X["oCondition1"]+X["OverallQual"]

X["Bsmt"]=X["BsmtFinSF1"]+X["BsmtFinSF2"]+X["BsmtUnfSF"]

X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]

X["PorchArea"]=X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

X["TotalPlace"]=X["TotalBsmtSF"]+X["1stFlrSF"]+X["2ndFlrSF"]+X["GarageArea"]+["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

降维

前面进行了编码和特征创造后,特征矩阵过大,导致计算量大,训练时间长的问题,因此降低特征矩阵维度也是必不可少的。但在该案例中采用PCA技术降维选择40个主成份效果差于采用400个主成份,400接近特征维度,表明模型过拟合程度不大。

模型融合与评估

模型融合和寻找高级特征是提升机器学习性能的两个重要手段。模型融合的方法很多,比如bagging,stacking,boosting,average weight,voting等。本文选择average weight和stacking这两种方法。用于融合的模型有LinearRegression,Ridge,Lasso,Random Forrest,Gradient Boosting Tree,Support Vector Regression,Linear Support Vector Regression,ElasticNet,Stochastic Gradient Descent,BayesianRidge,KernelRidge,ExtraTreesRegressor共12个基础模型。

评估函数

因为该案例是典型的回归问题,对于回归问题最适合采用基于距离的的评估函数,本文采用均方误差,调用库scikit-learn中cross_val_score函数评估模型效果。cross_val_score函数采用K折交叉验证,将训练样本分割成K份,一份被保留作为验证模型的数据(test set),其他K-1份用来训练(train set)。交叉验证重复K次,每份验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测,这个方法的优势在于,同时重复运用随机产生的子样本进行训练和验证,运用同样的样本可以训练模型制定的次数,在样本量不足的环境下有用,交叉验证用于评估模型的预测性能,尤其是训练好的模型在新数据上的表现,可以在一定程度上减小过拟合,还可以从有限的数据中获取尽可能多的有效信息。应用cross_val_score计算出各模型的得分情况如下

超参数调优

超参数是在开始学习过程之前设置值的参数,而不是通过训练得到的参数数据。通常情况下,需要对超参数进行优化,给学习机选择一组最优超参数,以提高学习的性能和效果。对于所选择的12个备用模型,很多都有需要自己设置的超参数,一十不知道如何设置。我们采用网格搜索最优参数。搜索前,先给每个参数准备一个参数网,然后调用scikit-learn库中的GridSearchCV搜索最有或者次优参数。以Kernel Ridge(核岭回归)为例,KernelRidge()有四个超参数,alpha,kernel,degree,coef0。根据经验,设置参数网param_grid={'alpha':[0.2,0.3,0.4],'kernel':["polynomial"],'degree':[3],'coef0':[0.8,1.0]}。结果如下

由此此网格中的最优参数是alpha:0.2,coef:1,degree:3,kernel:polynomial。注意采用网格搜索无法求出全局的最优参数,只能求出指定网格中的最优参数表,因而是次优的。可以依次求出各个模型的最佳超参数如下。

Lasso

Alpha:0.005

max_iter:10000



Ridge

Alpha:60




SVR

C:13

Epsilon:0.009

Gamma:0.0004

Kernel:rbf

ElasticNet

Alpha:0.005

l1_ratio:0.08

max_iter:10000


模型集成

接下来进行模型融合,先使用加权平均的方法,根据备选模型选择得分最佳的6个模型来进行融合,并且根据得分情况分配他们的权重。模型分别是

模型

权重

lasso=Lasso(alpha=0.0005,max_iter=10000)

0.02

ridge = Ridge(alpha=60)

0.2

svr=SVR(gamma=0.0004,kernel='rbf',C=13,epsilon=0.009)

0.25

ker=KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 ,coe8)

0.3

ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)

0.03

bay = BayesianRidge()

0.2

模型融合后的最终得分为0.1077,好于单个模型的得分情况。

下面采用Stacking的模型集成方法,Stacking过程可以分为三步:

1、单个模型分别进行学习。首先,采用交叉验证+网格搜索,得到子模型最优超参数;然后,在此最优超参数下,每次进行交叉验证时,都会训练得到一个模型。用此模型对验证集和测试集分别预测,共进行K次预测,得到一个完整的训练集预测值和K个测试集预测值,对K个测试集预测值取平均,从而得到一个完整的训练集预测值和一个测试集预测值。

2. Determine the new training set test set. First, learn separately for n sub-models to obtain n training set prediction values ​​(not taking the average value), which are used as n-dimensional features as the input of the second layer model; similarly, n sub-models also get n test set prediction values, as Input to the second layer model.

3. The second layer model learning. Use the training set and test set composed of new features to make predictions.

After using Stacking integration to appeal the 6 sub-models, the score is 0.1066, which is better than the weighted average method.

After stacking integration, a prediction matrix of (1458, 6) size will be obtained. This prediction matrix is ​​helpful for our prediction of the entire test-set. Adding it to the training set expands the feature amount, and uses the expanded model to train us The Stacking ensemble model has a score of 0.1018 after training. Obviously, the performance of the model after retraining is better, so it is used as the final prediction model. The prediction result on the test set is 0.1178. After the prediction result is submitted, the project successfully squeezed into the top 3% on kaggle.


Guess you like

Origin blog.51cto.com/11855672/2558522