kaggle房价实战总结

简介：
通过描述爱荷华州埃姆斯住宅的79个特征属性，预测每个住宅的价格。
数据来源:House Prices: Advanced Regression Techniques
参考文献:Stacked Regressions to predict House Prices
COMPREHENSIVE DATA EXPLORATION WITH PYTHON
首先总结下kaggle流程：
1.识别问题
2.探索数据
3.数据清洗
4.特征工程
5.模型建立
6.集成学习
7.预测
识别问题是指具体研究什么；探索数据是对各列数据初步了解所代表的含义，是否存在缺失，可以使用各种方法对数据进行分析；数据清洗一般是针对缺失值和错误数据进行处理，使数据完整科学；特征工程表示把清洗后的数据转化为建立模型时所需要的数据类型和结构；模型建立是通过python的相关机器学习包搭建模型；集成学习是指对模型进行整合；最后对测试数据进行预测得出结论。

一、识别问题

预测房价

二、探索数据

首先导入所需库，%matplotlib inline功能是可以内嵌绘图，并且可以省略掉plt.show()这一步

%matplotlib inline  
import numpy as np # 基础计算
import pandas as pd # 数据处理   
import matplotlib.pyplot as plt  #绘图库
import seaborn as sns            #高级绘图库
color = sns.color_palette()      # 设置调色板，默认  
sns.set_style('darkgrid')       # 设置网格类型，默认
from scipy import stats         #统计函数库
from scipy.stats import norm, skew

导入数据

tr=open('E:\kaggle\房价\\train.csv')
te=open('E:\kaggle\房价\\test.csv')
train=pd.read_csv(tr)
test=pd.read_csv(te)
train.head(5)

test.head(5)

另外，从上面可以知道Id特征对分类没有影响，但最后我们得到结果以后提交的时候需要，所以先把Id列单独取出来，从训练集和测试集中去掉

print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
train_ID = train['Id']
test_ID = test['Id']
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

通过以上操作对数据有了大体的认识，有1460组训练数据和1459组测试数据，数据的特征列有79个，其中35个是数值类型的，44个类别类型。
注意：以下属于进一步进行数据分析，目的是对数据有更多了解，可略过直接到数据处理部分。
待更新。。。

三、数据清洗

数据处理过程也可以看做是数据探索的延伸，可以边探索边处理
1.离群点处理：
即针对不合理的错误数据处理，以地上面积大小GrLivArea和房价的关系为例

fig, ax = plt.subplots()     #建立画布，默认一幅图
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在数据集中有个别离群点，等于噪音，可将其删除掉，如果偏离的不是特别厉害要保留，保证泛化能力。如图可见，右下角的两个点明显与其他的点不在一个节奏上，这两个点是面积最大的两座房子，但是价格很低，虽然理论上有可能是在郊区，但只有两个属于小众划为离群点先删掉。

train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

2. 线性的模型需要正态分布的目标值才能发挥最大的作用。
先做房价的分布图

sns.distplot(train['SalePrice'] , fit=norm)     #displot集合了直方图和拟合曲线
(mu, sigma) = norm.fit(train['SalePrice'])     #求出正太分布的均值和标准差
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#legend目的显示图例
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')     
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

从上图可以看出整体向左偏，接着使用probplot函数绘制正态概率图，正态概率图用于检测一组数据是否服从正态分布，越趋向于一条直线越近似正态分布，具体可以参考： https://wenku.baidu.com/view/03c56baddd3383c4bb4cd2ae.html

fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

由此可以看出是右偏态分布（或者通过分布图马尾偏向得出），关于偏态分布参考： https://baike.baidu.com/item/%E5%81%8F%E6%80%81%E5%88%86%E5%B8%83/445413?fr=aladdin
由于偏度较大，需要对目标值做log转换，以恢复目标值的正态性。

train["SalePrice"] = np.log1p(train["SalePrice"])   #log1p(x) := log(1+x)
sns.distplot(train['SalePrice'] , fit=norm);
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

这时可以看出偏度已正常，房价近似于正态分布。
3.在解决错误数据和目标值的分布问题后，接下来一直到特征工程的过程是对缺失值进行处理：
为了方便处理数据，先将训练集和测试集先进行合并

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

算上训练集1460和测试集1459组数据，再减到离群点的2组数据，对剩下的2917组数据进行缺失值处理
（1）首先整体看下各个特征缺失情况

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

再可视化一下

f, ax = plt.subplots(figsize=(10, 6))   #figsize画布大小
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

可见，类别型数据泳池质量PoolQC、类别数据未涉及特征MiscFeature、类别数据胡同类型Alley、类别数据栅栏质量Fence、类别数据壁炉质量FireplaceQu这五个特征缺失较大。同时也注意到车库Garage、地下室Bsmt这两个大类缺失数据的特征较多，后续缺失值处理时可按大类处理。

all_data_na[train.dtypes=='object'].index

有包括PoolQC在内的23个类别型数据缺失。

all_data_na[train.dtypes!='object'].index

11个数值型数据缺失

（2）接下来看看各个特征与房价的相关性，针对37个数据型特征，相关性的分析最好使用热力图

corrmat = train.corr()
plt.subplots(figsize=(10,10))
sns.heatmap(corrmat, vmax=0.9, square=True)

如图，颜色越浅相关性越高，其中相关性较高的亮点有：TotaLBsmtSF和1stFlrSF、GarageArea和GarageCar处，表明全部建筑面积TotaLBsmtSF与一层建筑面积1stFlrSF成强正相关，车库区域GarageArea和车库车辆GarageCar成强正相关。在填补缺失值的时候可以直接删掉一个多余的特征或者使用一个填补另一个。
（3）缺失值处理
缺失值处理时最好要看下特征有哪些属性，使得填充科学，通常方法：
缺失较大：类别型用NA替代，数据型用均值众数等填充（视特征而定）
缺失较小：类别用最多的属性替代，数据型使用均值众数等替代（视特征而定）
PoolQC 、MiscFeature 、Alley 、Fence 、FireplaceQu 这5个缺失较大的类别型数据用NA来代替

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")

对于街道连接房子之间的面积LotFrontage（数值型），这个可能与周边其他房子相似，缺失可以使用邻居的中值来替代。方法是利用groupby和transform（lambda），使用方法见数据分析-pandas常用方法个人总结的分组部分。

all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

针对缺失值相对较小的车库Garage、地下室Bsmt这两个大类，类别使用NA填充，数据用0填充

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')

公共设施utilities这一列只有一组是NoSewr，所以研究意义不大，直接删掉该列。

all_data = all_data.drop(['Utilities'], axis=1)

余下都去缺失值不足1%的数据特征，直接无脑输出统一使用出现最多的属性填充。使用.mode()[0]实现

all_data['MasVnrArea'] = all_data['MasVnrArea'].fillna(all_data['MasVnrArea'].mode()[0])
all_data['MasVnrType'] = all_data['MasVnrType'].fillna(all_data['MasVnrType'].mode()[0])
all_data['Functional'] = all_data["Functional"].fillna(all_data['Functional'].mode()[0])
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

最后检查下还有没有缺失值

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

四、特征工程

目的是通过处理数据使得适合模型
针对类别数据此处有两种编码方式，LabelEncoder和OneHotEncoder。简单说LabelEncoder就是把每类特征的标签数字化，将标签值统一转换成对应range(标签类个数-1)范围内的固定值，比如0、1、2...；使用pd.get_dummies()的OneHotEncoder独热编码，可以将原特征列按标签分成多列，每列数据转化成0或1。总的来说，要是one hot encoding的类别数目不太多，建议优先考虑。
有的数据型特征实际上是类别型的特征。比如MSSubClass，是评价房子种类的一个特征，给出的是10-100的数字，但实际上是类别，所以将其转化为字符串类别。

all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
all_data['OverallCond'] = all_data['OverallCond'].astype(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

为了减少onehot编码后的特征空间，可以先把一部分类别特征使用LabelEncoder转化为数值(也可以省略直接进行哑变量转化)。LabelEncoder优点是方便不会产生额外的列，但是场景限制很多。比如有一列 [dog,cat,dog,mouse,cat]，我们把其转换为[1,2,1,3,2]。这里就产生了一个奇怪的现象：dog和mouse的平均值是cat。

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))
print('Shape all_data: {}'.format(all_data.shape))

接下来添加一个特征，实际在购买房子的时候会考虑总面积大小，目前数据集中并没有包含此特征。总面积等于地下室面积+1层面积+2层面积。

all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

接下来看下各数值型数据的分布偏度情况

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(15)

对不符合正态分布我们将其log转换，使其符合正态分布。

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    all_data[feat] = boxcox1p(all_data[feat], lam)

最后对剩下的类别特征进行哑变量转化，至此特征工程结束。

all_data = pd.get_dummies(all_data)
print(all_data.shape)

五、模型建立

首先训练和测试集分离，导入库

train = all_data[:ntrain]
test = all_data[ntrain:]
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

定义交叉验证函数，具体使用可以参考交叉验证在sklearn中的实现

n_folds = 5
def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

使用Lasso Regression套索回归、ElasticNet回归、 Ridge Regression岭回归、Gradient Boosting Regression、XGBoost、LightGBM六种回归技术建模
简单了解下各模型：

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

验证下各个模型的得分情况

score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

六、集成

这里使用Stacking模型融合：

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso,model_xgb,model_lgb,GBoost))
score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

可以看到有提高。
另外，还有一种Meta-model Stacking的方法，可以降低均方误差到0.078左右，效果更好，可以有效提高kaggle排名，具体代码可以参考开头提到的文献，此处不再分析了。

七、预测

averaged_models.fit(train.values, y_train)
stacked_pred = np.expm1(averaged_models.predict(test.values))
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = stacked_pred
sub.to_csv('E:\submission.csv',index=False)