本文章为天池比赛参赛记录,共涉及【数据的探索性分析(EDA)】、【数据的特征工程】、【建模与调参】、【模型结果融合】四个部分,本文为第二部分。
比赛链接:https://tianchi.aliyun.com/competition/entrance/231784/information
教程链接:https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.6.1cd8593aw4bbL5&postId=95501
一、学习笔记
常见的特征工程包括:
- 异常处理:
- 通过箱线图(或 3-Sigma)分析删除异常值;
- BOX-COX 转换(处理有偏分布);
- 长尾截断;
- 特征归一化/标准化:
- 标准化(转换为标准正态分布);
- 归一化(抓换到 [0,1] 区间);
- 针对幂律分布,可以采用公式: log(1+x1+median)log(1+x1+median)
- 数据分桶:
- 等频分桶;
- 等距分桶;
- Best-KS 分桶(类似利用基尼指数进行二分类);
- 卡方分桶;
- 缺失值处理:
- 不处理(针对类似 XGBoost 等树模型);
- 删除(缺失数据太多);
- 插值补全,包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等;
- 分箱,缺失值一个箱;
- 特征构造:
- 构造统计量特征,报告计数、求和、比例、标准差等;
- 时间特征,包括相对时间和绝对时间,节假日,双休日等;
- 地理信息,包括分箱,分布编码等方法;
- 非线性变换,包括 log/ 平方/ 根号等;
- 特征组合,特征交叉;
- 仁者见仁,智者见智。
- 特征筛选
- 过滤式(filter):先对数据进行特征选择,然后在训练学习器,常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法;
- 包裹式(wrapper):直接把最终将要使用的学习器的性能作为特征子集的评价准则,常见方法有 LVM(Las Vegas Wrapper) ;
- 嵌入式(embedding):结合过滤式和包裹式,学习器训练过程中自动进行了特征选择,常见的有 lasso 回归;
- 降维
- PCA/ LDA/ ICA;
- 特征选择也是一种降维。
数据清洗常用方法:
正在上传…重新上传取消转存失败重新上传取消特征构造常用方法:
正在上传…重新上传取消转存失败重新上传取消特征选择常用方法:(方差越大,信息量越多)
转存失败重新上传取消二、代码实践
针对本次学习内容,对训练代码中关于数据预处理、数据清洗、特征选择三个方面进行了重新编写。具体如下:
## 通过 .describe() 可以查看数值特征列的一些统计信息
#print(Train_data.describe()) #可以看出power最大值超出数据表规定上限;seller、offerType数据倾斜严重
## 查看power字段情况,将长尾分布赋值成均值
Train_data['power'].plot.hist()
#print(Train_data[Train_data['power']>1000].describe()) ## 大于1000的只有106条,大于1000的均值为4026,所以设置为4026
print(Train_data.info()) #可以看到notRepairedDamage为object类型,需要处理
# 接下来依次检查其他特征的分布情况
#Train_data['model'].plot.hist()
## 数据预处理
## 异常数据处理
Train_data.loc[Train_data['power']>1000,'power']=4026 ##这里涉及到一个【问题】,Test集要不要也替换?
# 训练集和测试集放在一起,方便构造特征
#Train_data['train']=1
#Test_data['train']=0
#data = pd.concat([Train_data, Test_data], ignore_index=True)
# 使用时间:data['creatDate'] - data['regDate'],反应汽车使用时间,一般来说价格与使用时间成反比
# 不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'
Train_data['used_time'] = (pd.to_datetime(Train_data['creatDate'], format='%Y%m%d', errors='coerce') - \
pd.to_datetime(Train_data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
Test_data['used_time'] = (pd.to_datetime(Train_data['creatDate'], format='%Y%m%d', errors='coerce') - \
pd.to_datetime(Train_data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
# 从邮编中提取城市信息,相当于加入了先验知识
Train_data['city'] = Train_data['regionCode'].apply(lambda x : str(x)[:-3])
Test_data['city'] = Train_data['regionCode'].apply(lambda x : str(x)[:-3])
#Train_data = outliers_proc(Train_data, 'power', scale=3) #用异常数据处理工具进行异常值删除
#Train_data['power'].plot.hist()
# 异常值处理
#print(Train_data['notRepairedDamage'].value_counts()) ##查看到存在‘-’字段,先处理为空
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True) # Test_data同样处理
# 倾斜字段删除
#del Train_data['seller']
#del Train_data['offerType']
#del Test_data['seller']
#del Test_data['offerType']
# 删除price超出3w的值
#Train_data = Train_data[Train_data['price']<30000]
#Train_data.shape
## 转换日期字段并合并为时间差
#Train_data.loc[:, 'car_age'] = np.ceil(Train_data['creatDate']/10000-Train_data['regDate']/10000)
#Test_data.loc[:, 'car_age'] = np.ceil(Test_data['creatDate']/10000-Test_data['regDate']/10000)
# 无用字段删除
#del Train_data['creatDate']
#del Train_data['regDate']
#del Test_data['creatDate']
#del Test_data['regDate']
# 计算某品牌的销售统计量,同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
#Train_gb = Train_data.groupby("brand")
#all_info = {}
#for kind, kind_data in Train_gb:
# info = {}
# kind_data = kind_data[kind_data['price'] > 0]
# info['brand_amount'] = len(kind_data)
# info['brand_price_max'] = kind_data.price.max()
# info['brand_price_median'] = kind_data.price.median()
# info['brand_price_min'] = kind_data.price.min()
# info['brand_price_sum'] = kind_data.price.sum()
# info['brand_price_std'] = kind_data.price.std()
# info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
# all_info[kind] = info
#brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
#Train_data = Train_data.merge(brand_fe, how='left', on='brand')
# 数据分桶
bin = [i*10 for i in range(31)]
Train_data['power_bin'] = pd.cut(Train_data['power'], bin, labels=False)
Train_data[['power_bin', 'power']].head()
bin = [i*10 for i in range(31)]
Test_data['power_bin'] = pd.cut(Test_data['power'], bin, labels=False)
Test_data[['power_bin', 'power']].head()
# 删除不需要的数据
Train_data = Train_data.drop(['seller','offerType','creatDate', 'regDate', 'regionCode'], axis=1)
Test_data = Test_data.drop(['seller','offerType','creatDate', 'regDate', 'regionCode'], axis=1)
## 格式转换
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
Train_data['notRepairedDamage'] = Train_data['notRepairedDamage'].fillna(-1)
Test_data['notRepairedDamage'] = Test_data['notRepairedDamage'].fillna(-1)
Train_data['city'] = Train_data['city'].fillna(-1)
Test_data['city'] = Test_data['city'].fillna(-1)
Train_data['notRepairedDamage'] = lbl.fit_transform(Train_data['notRepairedDamage'].astype(float))
Test_data['notRepairedDamage'] = lbl.fit_transform(Test_data['notRepairedDamage'].astype(float))
Train_data['city'] = Train_data['city'].apply(pd.to_numeric, errors='coerce').fillna(0.0)
Test_data['city'] = Test_data['city'].apply(pd.to_numeric, errors='coerce').fillna(0.0)
## 通过spearman进行特征筛选
print('SaleID',Train_data['SaleID'].corr(Train_data['price'], method='spearman'))
print('name',Train_data['name'].corr(Train_data['price'], method='spearman'))
print('model',Train_data['model'].corr(Train_data['price'], method='spearman'))
print('brand',Train_data['brand'].corr(Train_data['price'], method='spearman'))
print('bodyType',Train_data['bodyType'].corr(Train_data['price'], method='spearman'))
print('fuelType',Train_data['fuelType'].corr(Train_data['price'], method='spearman'))
print('gearbox',Train_data['gearbox'].corr(Train_data['price'], method='spearman'))
print('power',Train_data['power'].corr(Train_data['price'], method='spearman'))
print('kilometer',Train_data['kilometer'].corr(Train_data['price'], method='spearman'))
print('notRepairedDamage',Train_data['notRepairedDamage'].corr(Train_data['price'], method='spearman'))
print('v_0',Train_data['v_0'].corr(Train_data['price'], method='spearman'))
print('v_1',Train_data['v_1'].corr(Train_data['price'], method='spearman'))
print('v_2',Train_data['v_2'].corr(Train_data['price'], method='spearman'))
print('v_3',Train_data['v_3'].corr(Train_data['price'], method='spearman'))
print('v_4',Train_data['v_4'].corr(Train_data['price'], method='spearman'))
print('v_5',Train_data['v_5'].corr(Train_data['price'], method='spearman'))
print('v_6',Train_data['v_6'].corr(Train_data['price'], method='spearman'))
print('v_7',Train_data['v_7'].corr(Train_data['price'], method='spearman'))
print('v_8',Train_data['v_8'].corr(Train_data['price'], method='spearman'))
print('v_9',Train_data['v_9'].corr(Train_data['price'], method='spearman'))
print('v_10',Train_data['v_10'].corr(Train_data['price'], method='spearman'))
print('v_11',Train_data['v_11'].corr(Train_data['price'], method='spearman'))
print('v_12',Train_data['v_12'].corr(Train_data['price'], method='spearman'))
print('v_13',Train_data['v_13'].corr(Train_data['price'], method='spearman'))
print('v_14',Train_data['v_14'].corr(Train_data['price'], method='spearman'))
print('used_time',Train_data['used_time'].corr(Train_data['price'], method='spearman'))
print('city',Train_data['city'].corr(Train_data['price'], method='spearman'))
#print('brand_amount',Train_data['brand_amount'].corr(Train_data['price'], method='spearman'))
#print('brand_price_max',Train_data['brand_price_max'].corr(Train_data['price'], method='spearman'))
#print('brand_price_median',Train_data['brand_price_median'].corr(Train_data['price'], method='spearman'))
#print('brand_price_min',Train_data['brand_price_min'].corr(Train_data['price'], method='spearman'))
#print('brand_price_sum',Train_data['brand_price_sum'].corr(Train_data['price'], method='spearman'))
#print('brand_price_std',Train_data['brand_price_std'].corr(Train_data['price'], method='spearman'))
#print('brand_price_average',Train_data['brand_price_average'].corr(Train_data['price'], method='spearman'))
#print('power_bin',Train_data['power_bin'].corr(Train_data['price'], method='spearman'))
SaleID -0.00043134439580398443 name -0.030872303402763084 model 0.10213736216208198 brand -0.10271577846270857 bodyType 0.1977078836415173 fuelType 0.33182858168936524 gearbox 0.3078480902335971 power 0.5773424318201822 kilometer -0.4097783640876424 notRepairedDamage 0.02841297002425253 v_0 0.8732246379713053 v_1 0.150471931826564 v_2 0.597624478928404 v_3 -0.9253051921242788 v_4 -0.14991114897234445 v_5 0.3536813781126098 v_6 0.42342376387355557 v_7 0.021379110886270435 v_8 0.83603428527076 v_9 -0.23560640645423808 v_10 -0.5071786096504789 v_11 -0.41002285614493744 v_12 0.8600648218739475 v_13 0.052532215258837014 v_14 0.18917541925746592 used_time -0.7847721137740148 city 0.005998112604842056