数据分析回归问题：美国King County房价预测训练赛

这是DC竞赛网的一道基础回归问题，美国King County房价预测训练赛

任务：从给定的房屋基本信息以及房屋销售信息等，建立一个回归模型预测房屋的销售价格。

数据：

数据主要包括2014年5月至2015年5月美国King County的房屋销售价格以及房屋的基本信息。

数据分为训练数据和测试数据，分别保存在kc_train.csv和kc_test.csv两个文件中。

其中训练数据主要包括10000条记录，14个字段，主要字段说明如下：

第一列“销售日期”：2014年5月到2015年5月房屋出售时的日期

第二列“销售价格”：房屋交易价格，单位为美元，是目标预测值

第三列“卧室数”：房屋中的卧室数目

第四列“浴室数”：房屋中的浴室数目

第五列“房屋面积”：房屋里的生活面积

第六列“停车面积”：停车坪的面积

第七列“楼层数”：房屋的楼层数

第八列“房屋评分”：King County房屋评分系统对房屋的总体评分

第九列“建筑面积”：除了地下室之外的房屋建筑面积

第十列“地下室面积”：地下室的面积

第十一列“建筑年份”：房屋建成的年份

第十二列“修复年份”：房屋上次修复的年份

第十三列"纬度"：房屋所在纬度

第十四列“经度”：房屋所在经度

测试数据主要包括3000条记录，13个字段，跟训练数据的不同是测试数据并不包括房屋销售价格，通过由训练数据所建立的模型以及所给的测试数据，得出测试数据相应的房屋销售价格预测值。

评分算法：

算法通过计算平均预测误差来衡量回归模型的优劣。平均预测误差越小，说明回归模型越好。平均预测误差计算公式如下：

mse是平均预测误差，m是测试数据的记录数（即3000），是参赛者提交的房屋预测价格，y是对应房屋的真实销售价格。

1. 主函数，按顺序先导入数据，再数据预处理，然后预测模型搭建预测，最后输出预测结果。

from kc_data_import import read_data
from kc_data_preprocessing import preprocessing
from kc_data_prediction import predict, predict2


def main():
    # 读取数据
    columns_test = ['date', 'bedroom', 'bathroom', 'floor space', 'parking space', 'floor', 'grade',
                     'covered area', 'basement area', 'build year', 'repair year', 'longitude', 'latitude']
    columns_train = ['date', 'price', 'bedroom', 'bathroom', 'floor space', 'parking space', 'floor', 'grade',
                     'covered area', 'basement area', 'build year', 'repair year', 'longitude', 'latitude']
    test = read_data('kc_test.csv', columns_test)
    train = read_data('kc_train.csv', columns_train)

    # 数据预处理
    train_data, test_data = preprocessing(train, test)

    # 预测模型搭建
    pred_y = predict(train_data, test_data, is_shuffle=False)

    # 输出预测结果
    pred_y.to_csv('./kc_pred_0925.csv', index=False, header=['price'])


if __name__ == '__main__':
    main()

2.导入数据

其中， ‘销售日期’ 的数据是 20150302 形式，在读取时设定pd.read_csv(parse_dates=[0]) 能转化为日期值形式。

import os
import pandas as pd


def assert_msg(condition, msg):
    if not condition:
        raise Exception(msg)


def read_data(filename, columns):
    # 获取数据路径
    file_path = os.path.join(os.path.dirname(__file__), filename)

    # 判定文件是否存在
    assert_msg(file_path, '文件不存在')

    # 返回CSV文件
    return pd.read_csv(file_path,
                       header=None,
                       parse_dates=[0],  # 20150101 转换成日期值 2015-01-01
                       infer_datetime_format=True,
                       names=columns
                       )

3. 数据预处理

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
# pd.set_option('display.max_rows', None)


# 极坐标转换
def polar_coordinates(x, y, x_min, y_min):
    # 极坐标半径
    radius =  np.sqrt((x - x_min) ** 2 + (y - y_min) ** 2)
    # radius = np.sqrt((x ** 2+y ** 2))

    # 极坐标角度
    angle = np.arctan((y - y_min) / (x - x_min)) * 180 / np.pi
    # angle = np.arctan(y / x * 180 / np.pi)

    return radius, angle


# 极坐标地址
def get_radius_angle(loc_x, loc_y):

    x_min, y_min = loc_x.min(), loc_y.min()
    radius, angle = [], []

    for x, y in zip(loc_x, loc_y):
        radius.append(polar_coordinates(x, y, x_min, y_min)[0])
        angle.append(polar_coordinates(x, y, x_min, y_min)[1])

    radius = np.array(radius)
    angle = np.array(angle)

    return radius, angle


def preprocessing(train, test):

    # 目标售房价格
    temp_target = pd.DataFrame()
    temp_target['price'] = train.pop('price')

    # 合并训练集 测试集
    data_all = pd.concat([train, test])
    data_all.reset_index(inplace=True)

    # 
    temp_all = pd.DataFrame()
    columns = ['bedroom', 'bathroom', 'floor', 'grade',
               'floor space', 'parking space', 'covered area', 'basement area',
               ]
    for col in columns:
        temp_all[col] = data_all[col]

    # 年份 季度 月份
    temp_all['year'] = data_all['date'].apply(lambda x: x.year)
    temp_all['quarter'] = data_all['date'].apply(lambda x: x.quarter)
    temp_all['month'] = data_all['date'].apply(lambda x: x.month)

    # 房屋是否修复
    temp_all['is_repair'] = np.zeros((temp_all.shape[0], 1))
    for i in range(len(temp_all['is_repair'])):
        if data_all['repair year'][i] > 0:
            temp_all['is_repair'][i] = 1

    # 房屋有无地下室
    temp_all['have_basement'] = np.zeros((temp_all.shape[0], 1))
    for i in range(len(temp_all['have_basement'])):
        if data_all['basement area'][i] == 0:
            temp_all['have_basement'][i] = 1

    # 房龄
    temp_all['building_age'] = temp_all['year'] - data_all['build year']

    # 上次修复后年数
    temp_all['repair_age'] = temp_all['year'] - data_all['repair year']
    for i in range(len(temp_all['repair_age'])):
        if temp_all['repair_age'][i] == 2014 or temp_all['repair_age'][i] == 2015:
            temp_all['repair_age'][i] = temp_all['building_age'][i]

    # 卧室数/浴室数 比率
    data_all['bedroom'].replace(0, 1, inplace=True)
    data_all['bathroom'].replace(0, 1, inplace=True)
    temp_all['b_b_ratio'] = data_all['bedroom'] / data_all['bathroom']

    # 房屋面积/建筑面积 比率
    temp_all['f_c_ratio'] = temp_all['floor space'] / temp_all['covered area']

    # 房屋面积/停车面积 比率
    temp_all['f_p_ratio'] = temp_all['floor space'] / temp_all['parking space']

    # 经纬度 转换极坐标
    loc_x = data_all['longitude'].values
    loc_y = data_all['latitude'].values
    radius, angle = get_radius_angle(loc_x, loc_y)
    temp_all['radius'] = radius.round(decimals=8)
    temp_all['angle'] = angle.round(decimals=8)
    
    # 使用get_dummies进行one-hot编码
    temp_all = pd.get_dummies(temp_all, columns=['year', 'quarter', 'month',
                                                 'bedroom', 'bathroom', 'floor',
                                                 'is_repair', 'have_basement'
                                                 ])

    # 训练集  测试集划分
    temp_train = temp_all[temp_all.index < 10000]
    temp_test = temp_all[temp_all.index >= 10000]
    temp_train['price'] = temp_target['price']

    return temp_train, temp_test

4. 创建预测回归模型

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler, LabelEncoder, Normalizer
from sklearn.feature_extraction import DictVectorizer
import warnings
warnings.filterwarnings('ignore')


def feat_standard(data):
    st_scaler = StandardScaler()
    st_scaler.fit(data)
    data = st_scaler.transform(data)
    return data


def feat_normalizer(data):
    no_scaler = Normalizer()
    no_scaler.fit(data)
    data = no_scaler.transform(data)
    return data


def feat_encoder(data, cols):
    for c in cols:
        lbl = LabelEncoder()
        lbl.fit(list(data[c].values))
        data[c] = lbl.transform(list(data[c].values))

    return data


def feat_dictvectorizer(train_x, valid_x):
    dict_vec = DictVectorizer(sparse=False)
    train_x = dict_vec.fit_transform(train_x.to_dict(orient='record'))
    valid_x = dict_vec.transform(valid_x.to_dict(orient='record'))

    return train_x, valid_x


def mse_func(y_true, y_predict):
    assert isinstance(y_true, list), 'y_true must be type of list'
    assert isinstance(y_predict, list), 'y_true must be type of list'

    m = len(y_true)
    squared_error = 0
    for i in range(m):
        error = y_true[i] - y_predict[i]
        squared_error = squared_error + error ** 2
    mse = squared_error / (10000 * m)
    return mse


def predict(train_, valid_, is_shuffle=True):
    print(f'data shape:\ntrain--{train_.shape}\nvalid--{valid_.shape}')
    folds = KFold(n_splits=5, shuffle=is_shuffle, random_state=1024)
    pred = [k for k in train_.columns if k not in ['price']]
    sub_preds = np.zeros((valid_.shape[0], folds.n_splits))
    print(f'Use {len(pred)} features ...')
    res_e = []

    for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_, train_['price']), start=1):
        print(f'the {n_fold} training start ...')
        train_x, train_y = train_[pred].iloc[train_idx], train_['price'].iloc[train_idx]
        valid_x, valid_y = train_[pred].iloc[valid_idx], train_['price'].iloc[valid_idx]

        print('数据标准化...')
        feat_st_cols = ['floor space', 'parking space', 'covered area', 'building_age']
        # train_x[feat_st_cols] = feat_standard(train_x[feat_st_cols])
        # valid_x[feat_st_cols] = feat_standard(valid_x[feat_st_cols])

        train_x, valid_x = feat_dictvectorizer(train_x, valid_x)

        dt_stump = DecisionTreeRegressor(max_depth=30,
                                         min_samples_split=15,
                                         min_samples_leaf=10,
                                         max_features=50,
                                         random_state=11,
                                         max_leaf_nodes=350)

        reg = AdaBoostRegressor(base_estimator=dt_stump, n_estimators=100)

        reg.fit(train_x, train_y)

        train_pred = reg.predict(valid_x)
        tmp_score = mse_func(list(valid_y), list(train_pred))
        res_e.append(tmp_score)

        sub_preds[:, n_fold - 1] = reg.predict(valid_[pred])

    print('5 folds 均值：', np.mean(res_e))
    valid_['price'] = np.mean(sub_preds, axis=1)
    return valid_['price']

5. 提交预测结果，下图为此模型得分，数据处理和预测模型还比较粗糙，需要进一步完善。

源码：我的GitHub

Fargo的火

发布了16 篇原创文章 · 获赞 1 · 访问量 1574

私信关注

数据分析 回归问题： 美国King County房价预测训练赛

猜你喜欢

数据分析回归问题：美国King County房价预测训练赛