二手车交易价格预测-【数据的特征工程】

本文章为天池比赛参赛记录,共涉及【数据的探索性分析(EDA)】、【数据的特征工程】、【建模与调参】、【模型结果融合】四个部分,本文为第二部分。

比赛链接:https://tianchi.aliyun.com/competition/entrance/231784/information

教程链接:https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.6.1cd8593aw4bbL5&postId=95501

一、学习笔记

 

常见的特征工程包括:

  1. 异常处理:
    • 通过箱线图(或 3-Sigma)分析删除异常值;
    • BOX-COX 转换(处理有偏分布);
    • 长尾截断;
  2. 特征归一化/标准化:
    • 标准化(转换为标准正态分布);
    • 归一化(抓换到 [0,1] 区间);
    • 针对幂律分布,可以采用公式: log(1+x1+median)log(1+x1+median)
  3. 数据分桶:
    • 等频分桶;
    • 等距分桶;
    • Best-KS 分桶(类似利用基尼指数进行二分类);
    • 卡方分桶;
  4. 缺失值处理:
    • 不处理(针对类似 XGBoost 等树模型);
    • 删除(缺失数据太多);
    • 插值补全,包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等;
    • 分箱,缺失值一个箱;
  5. 特征构造:
    • 构造统计量特征,报告计数、求和、比例、标准差等;
    • 时间特征,包括相对时间和绝对时间,节假日,双休日等;
    • 地理信息,包括分箱,分布编码等方法;
    • 非线性变换,包括 log/ 平方/ 根号等;
    • 特征组合,特征交叉;
    • 仁者见仁,智者见智。
  6. 特征筛选
    • 过滤式(filter):先对数据进行特征选择,然后在训练学习器,常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法;
    • 包裹式(wrapper):直接把最终将要使用的学习器的性能作为特征子集的评价准则,常见方法有 LVM(Las Vegas Wrapper) ;
    • 嵌入式(embedding):结合过滤式和包裹式,学习器训练过程中自动进行了特征选择,常见的有 lasso 回归;
  7. 降维
    • PCA/ LDA/ ICA;
    • 特征选择也是一种降维。

数据清洗常用方法:

uploading.4e448015.gif正在上传…重新上传取消uploading.4e448015.gif转存失败重新上传取消特征构造常用方法:

uploading.4e448015.gif正在上传…重新上传取消uploading.4e448015.gif转存失败重新上传取消特征选择常用方法:(方差越大,信息量越多)

uploading.4e448015.gif转存失败重新上传取消二、代码实践

针对本次学习内容,对训练代码中关于数据预处理、数据清洗、特征选择三个方面进行了重新编写。具体如下:

## 通过 .describe() 可以查看数值特征列的一些统计信息
#print(Train_data.describe())  #可以看出power最大值超出数据表规定上限;seller、offerType数据倾斜严重

## 查看power字段情况,将长尾分布赋值成均值
Train_data['power'].plot.hist()
#print(Train_data[Train_data['power']>1000].describe()) ## 大于1000的只有106条,大于1000的均值为4026,所以设置为4026
print(Train_data.info())  #可以看到notRepairedDamage为object类型,需要处理

# 接下来依次检查其他特征的分布情况
#Train_data['model'].plot.hist()
## 数据预处理

## 异常数据处理
Train_data.loc[Train_data['power']>1000,'power']=4026  ##这里涉及到一个【问题】,Test集要不要也替换?

# 训练集和测试集放在一起,方便构造特征
#Train_data['train']=1
#Test_data['train']=0
#data = pd.concat([Train_data, Test_data], ignore_index=True)

# 使用时间:data['creatDate'] - data['regDate'],反应汽车使用时间,一般来说价格与使用时间成反比
# 不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'
Train_data['used_time'] = (pd.to_datetime(Train_data['creatDate'], format='%Y%m%d', errors='coerce') - \
                     pd.to_datetime(Train_data['regDate'], format='%Y%m%d', errors='coerce')).dt.days

Test_data['used_time'] = (pd.to_datetime(Train_data['creatDate'], format='%Y%m%d', errors='coerce') - \
                     pd.to_datetime(Train_data['regDate'], format='%Y%m%d', errors='coerce')).dt.days

# 从邮编中提取城市信息,相当于加入了先验知识
Train_data['city'] = Train_data['regionCode'].apply(lambda x : str(x)[:-3])
Test_data['city'] = Train_data['regionCode'].apply(lambda x : str(x)[:-3])

#Train_data = outliers_proc(Train_data, 'power', scale=3)  #用异常数据处理工具进行异常值删除
#Train_data['power'].plot.hist()

# 异常值处理
#print(Train_data['notRepairedDamage'].value_counts())  ##查看到存在‘-’字段,先处理为空
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)  # Test_data同样处理

# 倾斜字段删除
#del Train_data['seller']
#del Train_data['offerType']
#del Test_data['seller']
#del Test_data['offerType']

# 删除price超出3w的值
#Train_data = Train_data[Train_data['price']<30000]
#Train_data.shape

## 转换日期字段并合并为时间差
#Train_data.loc[:, 'car_age'] = np.ceil(Train_data['creatDate']/10000-Train_data['regDate']/10000)
#Test_data.loc[:, 'car_age'] = np.ceil(Test_data['creatDate']/10000-Test_data['regDate']/10000)

# 无用字段删除
#del Train_data['creatDate']
#del Train_data['regDate']
#del Test_data['creatDate']
#del Test_data['regDate']



# 计算某品牌的销售统计量,同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
#Train_gb = Train_data.groupby("brand")
#all_info = {}
#for kind, kind_data in Train_gb:
#    info = {}
#    kind_data = kind_data[kind_data['price'] > 0]
#    info['brand_amount'] = len(kind_data)
#    info['brand_price_max'] = kind_data.price.max()
#    info['brand_price_median'] = kind_data.price.median()
#    info['brand_price_min'] = kind_data.price.min()
#    info['brand_price_sum'] = kind_data.price.sum()
#    info['brand_price_std'] = kind_data.price.std()
#    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
#    all_info[kind] = info
#brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
#Train_data = Train_data.merge(brand_fe, how='left', on='brand')

# 数据分桶
bin = [i*10 for i in range(31)]
Train_data['power_bin'] = pd.cut(Train_data['power'], bin, labels=False)
Train_data[['power_bin', 'power']].head()

bin = [i*10 for i in range(31)]
Test_data['power_bin'] = pd.cut(Test_data['power'], bin, labels=False)
Test_data[['power_bin', 'power']].head()

# 删除不需要的数据
Train_data = Train_data.drop(['seller','offerType','creatDate', 'regDate', 'regionCode'], axis=1)
Test_data = Test_data.drop(['seller','offerType','creatDate', 'regDate', 'regionCode'], axis=1)

## 格式转换
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()

Train_data['notRepairedDamage'] = Train_data['notRepairedDamage'].fillna(-1)
Test_data['notRepairedDamage'] = Test_data['notRepairedDamage'].fillna(-1)
Train_data['city'] = Train_data['city'].fillna(-1)
Test_data['city'] = Test_data['city'].fillna(-1)

Train_data['notRepairedDamage'] = lbl.fit_transform(Train_data['notRepairedDamage'].astype(float))
Test_data['notRepairedDamage'] = lbl.fit_transform(Test_data['notRepairedDamage'].astype(float))


Train_data['city'] = Train_data['city'].apply(pd.to_numeric, errors='coerce').fillna(0.0)
Test_data['city'] = Test_data['city'].apply(pd.to_numeric, errors='coerce').fillna(0.0)

重新上传

## 通过spearman进行特征筛选
print('SaleID',Train_data['SaleID'].corr(Train_data['price'], method='spearman'))
print('name',Train_data['name'].corr(Train_data['price'], method='spearman'))
print('model',Train_data['model'].corr(Train_data['price'], method='spearman'))
print('brand',Train_data['brand'].corr(Train_data['price'], method='spearman'))
print('bodyType',Train_data['bodyType'].corr(Train_data['price'], method='spearman'))
print('fuelType',Train_data['fuelType'].corr(Train_data['price'], method='spearman'))
print('gearbox',Train_data['gearbox'].corr(Train_data['price'], method='spearman'))
print('power',Train_data['power'].corr(Train_data['price'], method='spearman'))
print('kilometer',Train_data['kilometer'].corr(Train_data['price'], method='spearman'))
print('notRepairedDamage',Train_data['notRepairedDamage'].corr(Train_data['price'], method='spearman'))
print('v_0',Train_data['v_0'].corr(Train_data['price'], method='spearman'))
print('v_1',Train_data['v_1'].corr(Train_data['price'], method='spearman'))
print('v_2',Train_data['v_2'].corr(Train_data['price'], method='spearman'))
print('v_3',Train_data['v_3'].corr(Train_data['price'], method='spearman'))
print('v_4',Train_data['v_4'].corr(Train_data['price'], method='spearman'))
print('v_5',Train_data['v_5'].corr(Train_data['price'], method='spearman'))
print('v_6',Train_data['v_6'].corr(Train_data['price'], method='spearman'))
print('v_7',Train_data['v_7'].corr(Train_data['price'], method='spearman'))
print('v_8',Train_data['v_8'].corr(Train_data['price'], method='spearman'))
print('v_9',Train_data['v_9'].corr(Train_data['price'], method='spearman'))
print('v_10',Train_data['v_10'].corr(Train_data['price'], method='spearman'))
print('v_11',Train_data['v_11'].corr(Train_data['price'], method='spearman'))
print('v_12',Train_data['v_12'].corr(Train_data['price'], method='spearman'))
print('v_13',Train_data['v_13'].corr(Train_data['price'], method='spearman'))
print('v_14',Train_data['v_14'].corr(Train_data['price'], method='spearman'))
print('used_time',Train_data['used_time'].corr(Train_data['price'], method='spearman'))
print('city',Train_data['city'].corr(Train_data['price'], method='spearman'))
#print('brand_amount',Train_data['brand_amount'].corr(Train_data['price'], method='spearman'))
#print('brand_price_max',Train_data['brand_price_max'].corr(Train_data['price'], method='spearman'))
#print('brand_price_median',Train_data['brand_price_median'].corr(Train_data['price'], method='spearman'))
#print('brand_price_min',Train_data['brand_price_min'].corr(Train_data['price'], method='spearman'))
#print('brand_price_sum',Train_data['brand_price_sum'].corr(Train_data['price'], method='spearman'))
#print('brand_price_std',Train_data['brand_price_std'].corr(Train_data['price'], method='spearman'))
#print('brand_price_average',Train_data['brand_price_average'].corr(Train_data['price'], method='spearman'))
#print('power_bin',Train_data['power_bin'].corr(Train_data['price'], method='spearman'))
SaleID -0.00043134439580398443
name -0.030872303402763084
model 0.10213736216208198
brand -0.10271577846270857
bodyType 0.1977078836415173
fuelType 0.33182858168936524
gearbox 0.3078480902335971
power 0.5773424318201822
kilometer -0.4097783640876424
notRepairedDamage 0.02841297002425253
v_0 0.8732246379713053
v_1 0.150471931826564
v_2 0.597624478928404
v_3 -0.9253051921242788
v_4 -0.14991114897234445
v_5 0.3536813781126098
v_6 0.42342376387355557
v_7 0.021379110886270435
v_8 0.83603428527076
v_9 -0.23560640645423808
v_10 -0.5071786096504789
v_11 -0.41002285614493744
v_12 0.8600648218739475
v_13 0.052532215258837014
v_14 0.18917541925746592
used_time -0.7847721137740148
city 0.005998112604842056

取消

发布了2 篇原创文章 · 获赞 0 · 访问量 24

猜你喜欢

转载自blog.csdn.net/u010446489/article/details/105151216
今日推荐