Datawhale 零基础入门数据挖掘-Task3 特征工程

Datawhale 零基础入门数据挖掘-Task3 特征工程

三、 特征工程目标

Tip:此部分为零基础入门数据挖掘的 Task3 特征工程 部分,带你来了解各种特征工程以及分析方法,欢迎大家后续多多交流。

赛题:零基础入门数据挖掘 - 二手车交易价格预测

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

3.1 特征工程目标

  • 对于特征进行进一步分析,并对于数据进行处理

  • 完成对于特征工程的分析,并对于数据进行一些图表或者文字总结并打卡。

3.2 内容介绍

常见的特征工程包括:

  1. 异常处理:
    • 通过箱线图(或 3-Sigma)分析删除异常值;
    • BOX-COX 转换(处理有偏分布);
    • 长尾截断;
  2. 特征归一化/标准化:
    • 标准化(转换为标准正态分布);
    • 归一化(抓换到 [0,1] 区间);
    • 针对幂律分布,可以采用公式: l o g ( 1 + x 1 + m e d i a n ) log(\frac{1+x}{1+median})
  3. 数据分桶:
    • 等频分桶;
    • 等距分桶;
    • Best-KS 分桶(类似利用基尼指数进行二分类);
    • 卡方分桶;
  4. 缺失值处理:
    • 不处理(针对类似 XGBoost 等树模型);
    • 删除(缺失数据太多);
    • 插值补全,包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等;
    • 分箱,缺失值一个箱;
  5. 特征构造:
    • 构造统计量特征,报告计数、求和、比例、标准差等;
    • 时间特征,包括相对时间和绝对时间,节假日,双休日等;
    • 地理信息,包括分箱,分布编码等方法;
    • 非线性变换,包括 log/ 平方/ 根号等;
    • 特征组合,特征交叉;
    • 仁者见仁,智者见智。
  6. 特征筛选
    • 过滤式(filter):先对数据进行特征选择,然后在训练学习器,常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法;
    • 包裹式(wrapper):直接把最终将要使用的学习器的性能作为特征子集的评价准则,常见方法有 LVM(Las Vegas Wrapper) ;
    • 嵌入式(embedding):结合过滤式和包裹式,学习器训练过程中自动进行了特征选择,常见的有 lasso 回归;
  7. 降维
    • PCA/ LDA/ ICA;
    • 特征选择也是一种降维。

3.3 代码示例

3.3.0 导入数据

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

%matplotlib inline
path = './data/'
train = pd.read_csv(path+'train.csv', sep=' ')
test = pd.read_csv(path+'testA.csv', sep=' ')
print(train.shape)
print(test.shape)
(150000, 31)
(50000, 30)
train.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 4366 0 0 20160309 3600 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')
test.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4',
       'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13',
       'v_14'],
      dtype='object')

3.3.1 删除异常值

# 这里我包装了一个异常值处理的代码,可以随便调用。
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值,默认用 box_plot(scale=3)进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """

    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度,
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))#表示box_scale倍的箱的高度
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)#是一个bool值可以用来刷特征
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()#先将数据拷贝一份
    data_series = data_n[col_name]#选出指定的列
    rule, value = box_plot_outliers(data_series, box_scale=scale)#获取异常值的范围
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]#获取异常值的索引
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)                    #根据索引删除异常值
    data_n.reset_index(drop=True, inplace=True)#因为中间删除了部分异常值,所以需要重置索引
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]#选出值比较低的异常值的索引
    outliers = data_series.iloc[index_low]       #获取值比较低的异常值
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())                #获取值比较低的异常值的信息
    index_up = np.arange(data_series.shape[0])[rule[1]]#获取值比较高的异常值的索引
    outliers = data_series.iloc[index_up]             #获取值比较高的异常值
    print("Description of data larger than the upper bound is:")
    print(pd.Series(outliers).describe())       #获取值比较高的异常值的信息
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n
pd.set_option('display.max_columns', 100 )
train.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 150000.000000 150000.000000 1.500000e+05 149999.000000 150000.000000 145494.000000 141320.000000 144019.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.0 1.500000e+05 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000
mean 74999.500000 68349.172873 2.003417e+07 47.129021 8.052733 1.792369 0.375842 0.224943 119.316547 12.597160 2583.077267 0.000007 0.0 2.016033e+07 5923.327333 44.406268 -0.044809 0.080765 0.078833 0.017875 0.248204 0.044923 0.124692 0.058144 0.061996 -0.001000 0.009035 0.004813 0.000313 -0.000688
std 43301.414527 61103.875095 5.364988e+04 49.536040 7.864956 1.760640 0.548677 0.417546 177.168419 3.919576 1885.363218 0.002582 0.0 1.067328e+02 7501.998477 2.457548 3.641893 2.929618 2.026514 1.193661 0.045804 0.051743 0.201410 0.029186 0.035692 3.772386 3.286071 2.517478 1.288988 1.038685
min 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 0.000000 0.000000 0.0 2.015062e+07 11.000000 30.451976 -4.295589 -4.470671 -7.275037 -4.364565 0.000000 0.000000 0.000000 0.000000 0.000000 -9.168192 -5.558207 -9.639552 -4.153899 -6.546556
25% 37499.750000 11156.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 1018.000000 0.000000 0.0 2.016031e+07 1300.000000 43.135799 -3.192349 -0.970671 -1.462580 -0.921191 0.243615 0.000038 0.062474 0.035334 0.033930 -3.722303 -1.951543 -1.871846 -1.057789 -0.437034
50% 74999.500000 51638.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 0.000000 110.000000 15.000000 2196.000000 0.000000 0.0 2.016032e+07 3250.000000 44.610266 -3.052671 -0.382947 0.099722 -0.075910 0.257798 0.000812 0.095866 0.057014 0.058484 1.624076 -0.358053 -0.130753 -0.036245 0.141246
75% 112499.250000 118841.250000 2.007111e+07 66.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 3843.000000 0.000000 0.0 2.016033e+07 7700.000000 46.004721 4.000670 0.241335 1.565838 0.868758 0.265297 0.102009 0.125243 0.079382 0.087491 2.844357 1.255022 1.776933 0.942813 0.680378
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 7.000000 6.000000 1.000000 19312.000000 15.000000 8120.000000 1.000000 0.0 2.016041e+07 99999.000000 52.304178 7.320308 19.035496 9.854702 6.829352 0.291838 0.151420 1.404936 0.160791 0.222787 12.357011 18.819042 13.847792 11.147669 8.658418
# 我们可以删掉一些异常数据,以 power 为例。  
# 这里删不删同学可以自行判断
# 但是要注意 test 的数据不能删 = = 不能掩耳盗铃是不是
fig, ax = plt.subplots(1,1, figsize=(10, 7))
#train.describe()
#sns.boxplot(y=data['power'],data=train, palette="Set1", ax=ax)
train = outliers_proc(train, 'power', scale=3)


pd.set_option('display.max_columns', 100 )
train.describe()
Delete number is: 963
Now column number is: 149037
Description of data less than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count      963.000000
mean       846.836968
std       1929.418081
min        376.000000
25%        400.000000
50%        436.000000
75%        514.000000
max      19312.000000
Name: power, dtype: float64
SaleID name regDate model brand bodyType fuelType gearbox power kilometer regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 149037.000000 149037.000000 1.490370e+05 149036.000000 149037.000000 144543.000000 140405.000000 143083.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.0 1.490370e+05 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000 149037.000000
mean 75000.810040 68266.301730 2.003396e+07 46.969712 8.028973 1.785503 0.377380 0.221564 114.615686 12.611959 2583.189544 0.000007 0.0 2.016033e+07 5759.707328 44.386358 -0.040915 0.077332 0.095597 0.024832 0.248081 0.044970 0.124692 0.057905 0.062213 0.004712 0.022959 -0.017478 0.006048 -0.000010
std 43312.158963 61114.029665 5.361493e+04 49.347667 7.845709 1.754134 0.548392 0.415300 64.189762 3.909222 1885.675848 0.002590 0.0 1.070463e+02 6998.871286 2.445414 3.639592 2.934615 2.016049 1.191609 0.045855 0.051710 0.201773 0.029033 0.035631 3.771117 3.284766 2.502568 1.288631 1.034682
min 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 0.000000 0.000000 0.0 2.015062e+07 11.000000 30.451976 -4.295589 -4.470671 -7.275037 -4.364565 0.000000 0.000000 0.000000 0.000000 0.000000 -8.798810 -5.403044 -9.639552 -4.153899 -6.546556
25% 37485.000000 11093.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 1018.000000 0.000000 0.0 2.016031e+07 1300.000000 43.127305 -3.191726 -0.973892 -1.438553 -0.913525 0.243519 0.000035 0.062308 0.035211 0.034192 -3.720560 -1.938099 -1.880377 -1.054625 -0.435075
50% 74985.000000 51489.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 0.000000 109.000000 15.000000 2196.000000 0.000000 0.0 2.016032e+07 3200.000000 44.595651 -3.051661 -0.388056 0.114803 -0.067013 0.257691 0.000807 0.095763 0.056789 0.058789 1.637443 -0.347436 -0.148725 -0.028004 0.140485
75% 112532.000000 118779.000000 2.007110e+07 66.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 3843.000000 0.000000 0.0 2.016033e+07 7500.000000 45.979786 4.000020 0.234213 1.573818 0.873620 0.265204 0.101998 0.125148 0.079051 0.087643 2.849450 1.262799 1.747968 0.947472 0.678217
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 7.000000 6.000000 1.000000 375.000000 15.000000 8120.000000 1.000000 0.0 2.016041e+07 99999.000000 52.304178 7.320308 19.035496 9.854702 6.829352 0.291838 0.151420 1.404936 0.160791 0.222787 12.357011 18.819042 13.847792 11.147669 8.658418

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Tij2xkK0-1585390656196)(output_13_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pmuhHTP4-1585390656197)(output_13_3.png)]

3.3.2 特征构造

# 训练集和测试集放在一起,方便构造特征
train['train']=1
test['train']=0
data = pd.concat([train, test], ignore_index=True, sort=False)
data.head().append(data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 4366 0 0 20160309 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1
199032 199995 20903 19960503 4.0 4 4.0 0.0 0.0 116 15.0 0.0 3219 0 0 20160320 NaN 45.621391 5.958453 -0.918571 0.774826 -2.021739 0.284664 0.130044 0.049833 0.028807 0.004616 -5.978511 1.303174 -1.207191 -1.981240 -0.357695 0
199033 199996 708 19991011 0.0 0 0.0 0.0 0.0 75 15.0 0.0 1857 0 0 20160329 NaN 43.935162 4.476841 -0.841710 1.328253 -1.292675 0.268101 0.108095 0.066039 0.025468 0.025971 -3.913825 1.759524 -2.075658 -1.154847 0.169073 0
199034 199997 6693 20040412 49.0 1 0.0 1.0 1.0 224 15.0 0.0 3452 0 0 20160305 NaN 46.537137 4.170806 0.388595 -0.704689 -1.480710 0.269432 0.105724 0.117652 0.057479 0.015669 -4.639065 0.654713 1.137756 -1.390531 0.254420 0
199035 199998 96900 20020008 27.0 1 0.0 0.0 1.0 334 15.0 0.0 1998 0 0 20160404 NaN 46.771359 -3.296814 0.243566 -1.277411 -0.404881 0.261152 0.000490 0.137366 0.086216 0.051383 1.833504 -2.828687 2.465630 -0.911682 -2.057353 0
199036 199999 193384 20041109 166.0 6 1.0 NaN 1.0 68 9.0 0.0 3276 0 0 20160322 NaN 43.731010 -3.121867 0.027348 -0.808914 2.116551 0.228730 0.000300 0.103534 0.080625 0.124264 2.914571 -1.135270 0.547628 2.094057 -1.552150 0
# 使用时间:data['creatDate'] - data['regDate'],反应汽车使用时间,一般来说价格与使用时间成反比
# 不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - #
                            pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days#
#data['test1']=pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce')
#data['test2']=pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')
#data['test']=(pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') -pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce'))
data.head().append(data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.0
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 4366 0 0 20160309 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.0
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.0
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.0
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.0
199032 199995 20903 19960503 4.0 4 4.0 0.0 0.0 116 15.0 0.0 3219 0 0 20160320 NaN 45.621391 5.958453 -0.918571 0.774826 -2.021739 0.284664 0.130044 0.049833 0.028807 0.004616 -5.978511 1.303174 -1.207191 -1.981240 -0.357695 0 7261.0
199033 199996 708 19991011 0.0 0 0.0 0.0 0.0 75 15.0 0.0 1857 0 0 20160329 NaN 43.935162 4.476841 -0.841710 1.328253 -1.292675 0.268101 0.108095 0.066039 0.025468 0.025971 -3.913825 1.759524 -2.075658 -1.154847 0.169073 0 6014.0
199034 199997 6693 20040412 49.0 1 0.0 1.0 1.0 224 15.0 0.0 3452 0 0 20160305 NaN 46.537137 4.170806 0.388595 -0.704689 -1.480710 0.269432 0.105724 0.117652 0.057479 0.015669 -4.639065 0.654713 1.137756 -1.390531 0.254420 0 4345.0
199035 199998 96900 20020008 27.0 1 0.0 0.0 1.0 334 15.0 0.0 1998 0 0 20160404 NaN 46.771359 -3.296814 0.243566 -1.277411 -0.404881 0.261152 0.000490 0.137366 0.086216 0.051383 1.833504 -2.828687 2.465630 -0.911682 -2.057353 0 NaN
199036 199999 193384 20041109 166.0 6 1.0 NaN 1.0 68 9.0 0.0 3276 0 0 20160322 NaN 43.731010 -3.121867 0.027348 -0.808914 2.116551 0.228730 0.000300 0.103534 0.080625 0.124264 2.914571 -1.135270 0.547628 2.094057 -1.552150 0 4151.0
#data['test'] = data['test'].fillna(data['test'].mean())#好像是这个
data['used_time'] = data['used_time'].fillna(data['used_time'].mean())
data.head().append(data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.000000
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 4366 0 0 20160309 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.000000
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.000000
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.000000
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.000000
199032 199995 20903 19960503 4.0 4 4.0 0.0 0.0 116 15.0 0.0 3219 0 0 20160320 NaN 45.621391 5.958453 -0.918571 0.774826 -2.021739 0.284664 0.130044 0.049833 0.028807 0.004616 -5.978511 1.303174 -1.207191 -1.981240 -0.357695 0 7261.000000
199033 199996 708 19991011 0.0 0 0.0 0.0 0.0 75 15.0 0.0 1857 0 0 20160329 NaN 43.935162 4.476841 -0.841710 1.328253 -1.292675 0.268101 0.108095 0.066039 0.025468 0.025971 -3.913825 1.759524 -2.075658 -1.154847 0.169073 0 6014.000000
199034 199997 6693 20040412 49.0 1 0.0 1.0 1.0 224 15.0 0.0 3452 0 0 20160305 NaN 46.537137 4.170806 0.388595 -0.704689 -1.480710 0.269432 0.105724 0.117652 0.057479 0.015669 -4.639065 0.654713 1.137756 -1.390531 0.254420 0 4345.000000
199035 199998 96900 20020008 27.0 1 0.0 0.0 1.0 334 15.0 0.0 1998 0 0 20160404 NaN 46.771359 -3.296814 0.243566 -1.277411 -0.404881 0.261152 0.000490 0.137366 0.086216 0.051383 1.833504 -2.828687 2.465630 -0.911682 -2.057353 0 4441.030582
199036 199999 193384 20041109 166.0 6 1.0 NaN 1.0 68 9.0 0.0 3276 0 0 20160322 NaN 43.731010 -3.121867 0.027348 -0.808914 2.116551 0.228730 0.000300 0.103534 0.080625 0.124264 2.914571 -1.135270 0.547628 2.094057 -1.552150 0 4151.000000
# 看一下空数据,有 15k 个样本的时间是有问题的,我们可以选择删除,也可以选择放着。
# 但是这里不建议删除,因为删除缺失数据占总样本量过大,7.5%
# 我们可以先放着,因为如果我们 XGBoost 之类的决策树,其本身就能处理缺失值,所以可以不用管;
data['used_time'].isnull().sum()
0
# 从邮编中提取城市信息,因为是德国的数据,所以参考德国的邮编,相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : int(str(x)[:-3])if len(str(x))> 3 else 0)
data.info()
data.head().append(data.tail())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199037 entries, 0 to 199036
Data columns (total 34 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             199037 non-null  int64  
 1   name               199037 non-null  int64  
 2   regDate            199037 non-null  int64  
 3   model              199036 non-null  float64
 4   brand              199037 non-null  int64  
 5   bodyType           193130 non-null  float64
 6   fuelType           187512 non-null  float64
 7   gearbox            191173 non-null  float64
 8   power              199037 non-null  int64  
 9   kilometer          199037 non-null  float64
 10  notRepairedDamage  199037 non-null  object 
 11  regionCode         199037 non-null  int64  
 12  seller             199037 non-null  int64  
 13  offerType          199037 non-null  int64  
 14  creatDate          199037 non-null  int64  
 15  price              149037 non-null  float64
 16  v_0                199037 non-null  float64
 17  v_1                199037 non-null  float64
 18  v_2                199037 non-null  float64
 19  v_3                199037 non-null  float64
 20  v_4                199037 non-null  float64
 21  v_5                199037 non-null  float64
 22  v_6                199037 non-null  float64
 23  v_7                199037 non-null  float64
 24  v_8                199037 non-null  float64
 25  v_9                199037 non-null  float64
 26  v_10               199037 non-null  float64
 27  v_11               199037 non-null  float64
 28  v_12               199037 non-null  float64
 29  v_13               199037 non-null  float64
 30  v_14               199037 non-null  float64
 31  train              199037 non-null  int64  
 32  used_time          199037 non-null  float64
 33  city               199037 non-null  int64  
dtypes: float64(22), int64(11), object(1)
memory usage: 51.6+ MB
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time city
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.000000 1
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 4366 0 0 20160309 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.000000 4
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.000000 2
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.000000 0
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.000000 6
199032 199995 20903 19960503 4.0 4 4.0 0.0 0.0 116 15.0 0.0 3219 0 0 20160320 NaN 45.621391 5.958453 -0.918571 0.774826 -2.021739 0.284664 0.130044 0.049833 0.028807 0.004616 -5.978511 1.303174 -1.207191 -1.981240 -0.357695 0 7261.000000 3
199033 199996 708 19991011 0.0 0 0.0 0.0 0.0 75 15.0 0.0 1857 0 0 20160329 NaN 43.935162 4.476841 -0.841710 1.328253 -1.292675 0.268101 0.108095 0.066039 0.025468 0.025971 -3.913825 1.759524 -2.075658 -1.154847 0.169073 0 6014.000000 1
199034 199997 6693 20040412 49.0 1 0.0 1.0 1.0 224 15.0 0.0 3452 0 0 20160305 NaN 46.537137 4.170806 0.388595 -0.704689 -1.480710 0.269432 0.105724 0.117652 0.057479 0.015669 -4.639065 0.654713 1.137756 -1.390531 0.254420 0 4345.000000 3
199035 199998 96900 20020008 27.0 1 0.0 0.0 1.0 334 15.0 0.0 1998 0 0 20160404 NaN 46.771359 -3.296814 0.243566 -1.277411 -0.404881 0.261152 0.000490 0.137366 0.086216 0.051383 1.833504 -2.828687 2.465630 -0.911682 -2.057353 0 4441.030582 1
199036 199999 193384 20041109 166.0 6 1.0 NaN 1.0 68 9.0 0.0 3276 0 0 20160322 NaN 43.731010 -3.121867 0.027348 -0.808914 2.116551 0.228730 0.000300 0.103534 0.080625 0.124264 2.914571 -1.135270 0.547628 2.094057 -1.552150 0 4151.000000 3
#train.groupby("brand")["price"].describe()
train.groupby("brand").describe()
SaleID name regDate model bodyType fuelType gearbox ... v_9 v_10 v_11 v_12 v_13 v_14 train
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
brand
0 31429.0 75135.898979 43216.003175 14.0 37611.00 75259.0 112411.00 149988.0 31429.0 68280.543002 61481.259318 0.0 10298.00 51720.0 119576.00 196811.0 31429.0 2.002938e+07 57745.773061 19910001.0 19981111.00 20030204.0 20071101.00 20151208.0 31429.0 26.238124 34.731275 0.0 0.0 8.0 44.0 230.0 30295.0 1.612642 1.576731 0.0 0.0 1.0 3.0 7.0 29572.0 0.425064 0.534715 0.0 0.0 0.0 1.0 6.0 30017.0 0.143985 ... 0.070044 0.181968 31429.0 0.202655 3.819637 -8.251269 -3.755349 1.812951 2.969662 12.285299 31429.0 0.130102 3.414541 -5.114536 -1.856988 -0.342067 1.237629 18.379089 31429.0 0.010158 2.442354 -8.679290 -1.839407 -0.197639 1.741408 12.157611 31429.0 -0.226931 0.860948 -2.125022 -1.041485 -0.223171 0.366677 2.289283 31429.0 0.258134 0.921088 -3.960301 -0.080196 0.288050 0.872530 2.300021 31429.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
1 13656.0 74957.974297 43258.259449 1.0 37470.00 74582.5 112659.50 149994.0 13656.0 62886.934827 61452.088129 45.0 6693.00 40935.0 113516.50 196781.0 13656.0 2.004154e+07 57048.978974 19910001.0 20000102.00 20050111.0 20081104.00 20151210.0 13656.0 53.608890 29.804153 1.0 40.0 49.0 65.0 247.0 13359.0 1.591586 1.634055 0.0 0.0 2.0 2.0 7.0 13043.0 0.524649 0.531108 0.0 0.0 1.0 1.0 6.0 13210.0 0.340424 ... 0.043838 0.193907 13656.0 -0.931170 3.883604 -8.051120 -4.793600 0.988510 2.287106 12.103094 13656.0 -0.416706 2.959515 -5.161408 -2.353655 -0.733364 1.107900 18.408500 13656.0 1.170914 2.357446 -5.969378 -0.683091 1.265707 2.820971 12.973057 13656.0 -1.061234 0.851262 -2.521308 -1.719965 -1.109461 -0.533658 4.907355 13656.0 -0.049340 0.875329 -3.683050 -0.452126 0.086691 0.566010 2.021592 13656.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
2 318.0 78087.817610 43396.063117 1460.0 39411.25 80894.5 116589.50 148937.0 318.0 78339.481132 55498.550938 855.0 29776.00 70464.5 117160.00 196276.0 318.0 2.004285e+07 62012.317580 19910304.0 20000028.00 20050158.5 20090405.25 20151112.0 318.0 84.937107 96.134973 1.0 2.0 19.0 197.0 207.0 314.0 5.990446 0.169300 3.0 6.0 6.0 6.0 6.0 306.0 0.725490 0.619204 0.0 0.0 1.0 1.0 2.0 312.0 0.753205 ... 0.063097 0.132520 318.0 -1.465617 3.744038 -7.448232 -5.287060 0.740907 1.678765 11.312121 318.0 -1.997598 2.749805 -5.040389 -3.657794 -2.923828 -0.379690 17.834444 318.0 1.534046 1.989458 -6.188436 0.151433 1.262676 2.837065 8.071152 318.0 -0.528789 0.664283 -1.792006 -1.012964 -0.696783 -0.122290 1.082124 318.0 -1.543233 1.369426 -4.953890 -2.186592 -1.535352 -0.716360 0.996607 318.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
3 2461.0 76863.838683 43479.432350 25.0 38348.00 78782.0 115192.00 149975.0 2461.0 70155.264120 59515.861583 4.0 15034.00 56292.0 116235.00 196716.0 2461.0 2.006858e+07 44660.632387 19920107.0 20040206.00 20070705.0 20100910.00 20151208.0 2461.0 58.330760 50.853115 1.0 3.0 87.0 87.0 193.0 2422.0 1.678778 1.275721 0.0 1.0 2.0 2.0 7.0 2374.0 0.396799 0.512061 0.0 0.0 0.0 1.0 3.0 2413.0 0.120182 ... 0.102473 0.190810 2461.0 -0.587122 3.517722 -7.081028 -4.048563 1.250506 2.386922 11.367486 2461.0 -0.329133 2.608295 -4.698429 -2.096946 -0.678162 1.083719 17.760794 2461.0 1.075896 2.009365 -5.889687 -0.304789 1.022185 2.460299 10.826623 2461.0 0.920820 0.938277 -1.944913 0.419845 0.990531 1.442011 3.774539 2461.0 -1.307435 1.164774 -3.975906 -2.162526 -1.171653 -0.430467 1.954672 2461.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
4 16575.0 74424.163801 43141.145888 6.0 37016.50 74228.0 111890.00 149986.0 16575.0 64769.137376 62658.750071 18.0 5555.00 45038.0 117141.50 196812.0 16575.0 2.003306e+07 53330.969493 19910001.0 19990706.00 20040012.0 20071009.00 20151212.0 16575.0 18.339970 30.471381 1.0 4.0 4.0 13.0 245.0 16210.0 1.648982 1.986454 0.0 0.0 0.0 2.0 7.0 15762.0 0.440426 0.541446 0.0 0.0 0.0 1.0 6.0 16130.0 0.356293 ... 0.036444 0.138871 16575.0 -0.644018 3.854532 -8.192090 -4.906936 1.398704 2.331440 12.010943 16575.0 -0.641841 2.964420 -5.088259 -2.582502 -1.397307 1.004037 18.341317 16575.0 0.968379 2.217971 -7.882603 -0.743230 0.931882 2.488241 13.083661 16575.0 -1.268478 0.589530 -2.908192 -1.820144 -1.293616 -0.770327 1.549990 16575.0 -0.000511 0.672059 -4.476282 -0.350878 0.005552 0.455194 2.166396 16575.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
5 4662.0 74953.461819 43625.801338 4.0 37394.75 76202.5 112649.75 149969.0 4662.0 66926.721793 60586.323126 6.0 10626.50 48965.0 117049.50 196770.0 4662.0 2.003742e+07 44883.289281 19910008.0 20010004.00 20040206.5 20070404.50 20150907.0 4662.0 25.874517 40.397238 1.0 5.0 5.0 19.0 117.0 4542.0 1.999339 1.520976 0.0 1.0 1.0 4.0 7.0 4386.0 0.262882 0.482746 0.0 0.0 0.0 0.0 6.0 4507.0 0.060794 ... 0.093820 0.173234 4662.0 0.296861 3.435354 -6.538353 -3.091683 1.859932 3.009183 12.148416 4662.0 0.341170 3.056326 -4.226201 -1.461243 0.024974 1.517538 18.630773 4662.0 -0.890986 2.047665 -7.233860 -2.242361 -1.035290 0.325174 11.369898 4662.0 0.835170 0.803478 -1.567229 0.421948 0.914513 1.355521 3.062267 4662.0 0.413911 0.571758 -2.512437 0.074772 0.404666 0.816989 2.228885 4662.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
6 10193.0 74865.558030 43552.630106 0.0 37048.00 74727.0 112624.00 149990.0 10193.0 67759.671049 61047.953072 7.0 10249.00 50267.0 117899.00 196763.0 10193.0 2.003343e+07 49126.243235 19910004.0 20000005.00 20030504.0 20070307.00 20151210.0 10193.0 50.601001 32.602429 1.0 30.0 46.0 69.0 236.0 9823.0 1.739998 1.477240 0.0 1.0 1.0 2.0 7.0 9451.0 0.331605 0.507588 0.0 0.0 0.0 1.0 5.0 9741.0 0.079355 ... 0.088849 0.174175 10193.0 0.453570 3.751521 -7.417392 -3.318933 2.097469 3.290343 12.357011 10193.0 0.647936 3.466200 -4.644868 -1.297772 0.104708 1.730303 18.765443 10193.0 -1.049689 2.415648 -8.293423 -2.783272 -1.408243 0.606464 11.741638 10193.0 0.503864 0.827696 -1.663949 -0.018258 0.465792 1.047989 3.503605 10193.0 0.562719 0.793285 -4.100143 0.163964 0.617502 1.076853 2.467846 10193.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
7 2360.0 75250.205932 43678.252285 9.0 37552.25 74878.0 113342.25 149850.0 2360.0 69299.372034 58127.088681 9.0 17252.75 51459.5 116960.75 196742.0 2360.0 2.002641e+07 50934.788003 19910002.0 19990307.75 20030005.0 20060903.00 20151201.0 2360.0 67.324576 58.199505 1.0 7.0 78.0 90.0 195.0 2293.0 2.193197 1.888402 0.0 0.0 2.0 4.0 7.0 2243.0 0.236737 0.472855 0.0 0.0 0.0 0.0 3.0 2278.0 0.056190 ... 0.097156 0.154929 2360.0 0.043372 3.698984 -7.335358 -3.682216 1.797291 2.898247 11.363643 2360.0 -0.079306 3.112975 -4.762878 -1.830181 -0.447594 1.017011 18.478554 2360.0 -0.608090 2.274225 -6.211157 -2.247694 -0.774330 0.897483 12.433202 2360.0 0.569155 0.857917 -1.487128 0.167047 0.624866 1.245085 2.403954 2360.0 -0.725277 0.844597 -3.576179 -1.277176 -0.659159 -0.143988 1.800145 2360.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
8 2070.0 74586.202415 44205.548259 120.0 35745.75 76107.5 111959.50 149992.0 2070.0 72592.696618 58963.388886 13.0 18045.00 59220.0 120232.75 196568.0 2070.0 2.003449e+07 56140.318328 19910002.0 19990807.50 20030605.5 20080482.50 20151111.0 2070.0 80.477778 62.427835 1.0 32.0 32.0 129.0 204.0 2007.0 2.112606 2.169138 0.0 1.0 1.0 3.0 7.0 1936.0 0.264979 0.478513 0.0 0.0 0.0 1.0 5.0 1999.0 0.118559 ... 0.126626 0.182241 2070.0 0.497674 3.536123 -7.035689 -2.837964 1.518881 3.211395 11.426592 2070.0 0.052955 3.427019 -4.865708 -1.648288 -0.278306 1.369966 18.504017 2070.0 -0.549322 2.654042 -7.585131 -2.729013 -1.041358 1.662843 9.340888 2070.0 1.136608 1.208165 -1.755694 0.667204 1.377709 1.960064 3.639345 2070.0 -0.559776 1.068399 -4.548068 -1.071647 -0.407986 0.125283 1.869232 2070.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
9 7299.0 75216.050418 43212.153514 10.0 38410.50 74754.0 112200.50 149987.0 7299.0 70244.821619 60841.985920 14.0 13946.00 56217.0 119169.50 196713.0 7299.0 2.002628e+07 45428.588454 19910003.0 19990809.00 20020603.0 20051004.50 20151208.0 7299.0 42.144266 36.923623 1.0 10.0 22.0 66.0 123.0 7019.0 1.718906 1.372095 0.0 1.0 1.0 3.0 7.0 6722.0 0.235347 0.522330 0.0 0.0 0.0 0.0 6.0 6973.0 0.073139 ... 0.111572 0.184351 7299.0 0.933918 3.472885 -6.263700 -2.706162 2.477623 3.492634 11.894036 7299.0 0.854147 3.520184 -4.321031 -1.089517 0.122418 1.932959 18.819042 7299.0 -1.572967 2.195678 -7.407455 -3.127826 -1.945874 -0.323995 9.711988 7299.0 1.297567 0.794103 -1.347980 0.834339 1.297944 1.770357 3.632836 7299.0 0.234535 0.751815 -2.813847 -0.220190 0.332027 0.719093 2.278235 7299.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
10 13994.0 75205.874803 43439.758881 3.0 37610.50 75121.0 112569.50 149998.0 13994.0 67331.798342 60753.607940 16.0 11242.00 48541.0 117395.00 196800.0 13994.0 2.003084e+07 52059.493889 19910002.0 19991205.00 20030405.5 20070309.00 20151207.0 13994.0 38.545591 37.370739 1.0 17.0 31.0 33.0 226.0 13713.0 1.962809 2.024122 0.0 0.0 2.0 3.0 7.0 13398.0 0.505896 0.560648 0.0 0.0 0.0 1.0 6.0 13564.0 0.575420 ... 0.044004 0.125604 13994.0 -0.787944 3.789618 -8.436945 -4.676338 1.234128 2.326256 12.118343 13994.0 -0.595754 2.864221 -5.366580 -2.480332 -0.995418 0.903215 18.455196 13994.0 0.860639 2.138816 -7.195385 -0.722816 0.663825 2.270567 13.562011 13994.0 -1.111276 0.690632 -2.828945 -1.641285 -1.178183 -0.596863 1.446954 13994.0 -0.215205 0.902067 -6.113291 -0.552021 -0.095289 0.356898 2.113267 13994.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
11 2944.0 75153.740829 42823.474531 83.0 38343.75 75868.5 112185.00 149997.0 2944.0 69966.151834 61340.869319 19.0 12643.00 52956.0 119103.75 196785.0 2944.0 2.004489e+07 52721.427797 19910003.0 20000510.50 20040611.5 20090404.50 20151205.0 2944.0 83.091712 49.708246 1.0 60.0 60.0 116.0 184.0 2852.0 1.104839 1.168938 0.0 0.0 1.0 1.0 7.0 2785.0 0.303052 0.482534 0.0 0.0 0.0 1.0 3.0 2843.0 0.046782 ... 0.113617 0.189313 2944.0 0.246968 3.590947 -6.711672 -3.468983 1.772204 3.089890 11.675463 2944.0 0.439107 3.252243 -3.911240 -1.528835 0.028555 1.630662 18.484508 2944.0 -0.205660 2.523470 -6.919479 -2.195230 -0.213088 1.690328 11.443174 2944.0 1.330700 0.964266 -1.619776 0.704863 1.250365 2.050045 3.896473 2944.0 -0.648424 0.922214 -4.576412 -0.930180 -0.530655 -0.100611 1.702045 2944.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
12 1108.0 74192.030686 42167.462987 345.0 38334.50 73457.5 109422.75 149629.0 1108.0 73052.243682 59664.008784 21.0 17032.00 61923.5 119556.75 196793.0 1108.0 2.001690e+07 55599.544665 19910003.0 19970411.00 20010810.5 20060904.00 20151210.0 1108.0 55.358303 61.001523 1.0 15.0 15.0 131.0 176.0 1067.0 2.029053 2.268162 0.0 0.0 1.0 5.0 7.0 1044.0 0.182950 0.625487 0.0 0.0 0.0 0.0 5.0 1082.0 0.118299 ... 0.093852 0.139933 1108.0 0.301023 3.609198 -6.522588 -3.271745 2.008425 2.908080 11.364633 1108.0 0.069176 3.483745 -4.505058 -1.983762 -0.510518 1.050628 18.213683 1108.0 -0.546464 2.234541 -5.240744 -2.385848 -0.719024 1.012884 9.901890 1108.0 0.538105 0.925150 -1.562079 0.036841 0.437971 1.314252 2.635682 1108.0 -0.436398 0.925910 -3.297603 -0.823304 -0.329716 0.170607 1.689872 1108.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
13 3813.0 75807.485969 43369.185052 45.0 37548.00 75584.0 113804.00 149917.0 3813.0 69418.422502 60943.542910 22.0 10870.00 54468.0 120432.00 196756.0 3813.0 2.003626e+07 51050.488935 19910003.0 20000301.00 20030512.0 20080402.00 20151204.0 3813.0 75.765539 75.052396 1.0 16.0 19.0 164.0 228.0 3647.0 1.538799 1.379087 0.0 1.0 1.0 2.0 7.0 3536.0 0.228224 0.526208 0.0 0.0 0.0 0.0 6.0 3616.0 0.024336 ... 0.124263 0.176603 3813.0 0.974184 3.544076 -6.729835 -2.633895 2.398146 3.467563 11.974500 3813.0 0.905273 3.601515 -3.991513 -1.144972 0.195962 1.960395 18.423083 3813.0 -1.115010 2.419075 -6.660521 -2.940837 -1.543928 0.554868 10.251221 3813.0 1.420437 1.222494 -1.314333 0.772731 1.583979 2.442350 3.990237 3813.0 0.146962 0.693859 -2.795332 -0.176193 0.238874 0.552159 2.039379 3813.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
14 16073.0 74965.405400 43273.533270 7.0 37602.00 74681.0 112612.00 149970.0 16073.0 68175.404343 61311.451932 27.0 11440.00 50531.0 118903.00 196795.0 16073.0 2.002211e+07 50572.857718 19910001.0 19981201.00 20011107.0 20060301.00 20151101.0 16073.0 53.728862 38.714159 1.0 26.0 48.0 73.0 217.0 15518.0 1.621601 1.457703 0.0 1.0 1.0 2.0 7.0 14883.0 0.251092 0.516146 0.0 0.0 0.0 0.0 6.0 15382.0 0.101547 ... 0.095271 0.177625 16073.0 0.628787 3.556379 -6.866449 -3.019136 2.207823 3.344779 12.034736 16073.0 0.635476 3.371117 -4.474906 -1.263805 0.108876 1.680342 18.746229 16073.0 -1.346439 2.356970 -9.639552 -3.098923 -1.703890 0.123554 10.681064 16073.0 0.897454 0.792761 -1.401177 0.255511 0.882386 1.475575 3.462262 16073.0 0.416238 0.985198 -3.412186 -0.002721 0.585602 1.044119 2.743993 16073.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
15 1458.0 75836.777778 42919.175888 2.0 39051.25 74694.0 113072.75 149878.0 1458.0 54366.361454 59170.040930 496.0 5185.00 25150.0 99853.00 196730.0 1458.0 2.007449e+07 38355.204312 19911010.0 20050202.25 20080107.5 20100909.00 20151101.0 1458.0 91.209191 52.526146 1.0 20.0 115.0 115.0 208.0 1430.0 1.951049 1.551219 0.0 1.0 1.0 4.0 7.0 1421.0 0.104856 0.308765 0.0 0.0 0.0 0.0 2.0 1434.0 0.068340 ... 0.081460 0.158983 1458.0 -1.638239 3.886271 -6.725212 -5.291464 -3.185584 2.086631 10.866295 1458.0 -0.200208 2.910238 -4.449206 -1.998758 -0.043786 1.805868 18.026409 1458.0 2.254994 1.428281 -5.134526 1.352491 2.064085 3.018247 10.737935 1458.0 -0.021208 0.878520 -2.068686 -0.602893 -0.168820 0.539850 2.368825 1458.0 -0.584835 1.093781 -4.302939 -1.218293 -0.244310 0.129999 1.787471 1458.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
16 2219.0 73605.734114 43241.191320 24.0 36326.50 73669.0 110490.50 149967.0 2219.0 69268.566021 62853.985952 32.0 7904.00 53724.0 122929.00 196810.0 2219.0 2.005215e+07 42439.839640 19950003.0 20020007.50 20050309.0 20081210.00 20151203.0 2219.0 35.250113 43.977668 1.0 21.0 21.0 21.0 169.0 2179.0 2.031666 1.535143 0.0 1.0 1.0 4.0 7.0 2128.0 0.184680 0.446698 0.0 0.0 0.0 0.0 6.0 1828.0 0.833698 ... 0.079165 0.175786 2219.0 0.232958 3.355363 -6.007778 -3.303033 1.975813 2.850004 12.062635 2219.0 0.023905 2.749959 -3.747783 -1.521359 -0.554427 1.770565 17.938531 2219.0 -0.006882 1.636191 -4.818190 -1.197193 -0.141742 1.050875 9.422496 2219.0 0.356979 0.919330 -1.609028 -0.338461 0.323656 0.799167 3.687518 2219.0 0.283270 0.975365 -3.641332 0.062668 0.456052 0.922392 1.673653 2219.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
17 913.0 76391.495071 43001.469264 127.0 42395.00 76124.0 114306.00 149902.0 913.0 74632.959474 59677.582227 63.0 19057.00 60976.0 123652.00 196611.0 913.0 2.003280e+07 40929.411229 19910009.0 20001201.00 20030312.0 20060701.00 20151012.0 913.0 53.189485 44.684545 1.0 19.0 35.0 55.0 234.0 876.0 1.410959 1.752078 0.0 0.0 1.0 2.0 7.0 845.0 0.340828 0.503355 0.0 0.0 0.0 1.0 2.0 879.0 0.067122 ... 0.112651 0.162548 913.0 0.416094 3.722685 -6.531455 -3.334165 2.140332 3.061162 12.180503 913.0 0.266060 3.476105 -4.446526 -1.645925 -0.416363 1.441443 18.198690 913.0 -0.719356 2.331593 -6.479402 -2.384324 -1.261515 0.950344 9.375095 913.0 0.910572 1.047373 -1.642076 0.148259 1.109650 1.731515 2.921252 913.0 -0.056988 0.883488 -1.889225 -0.860490 -0.030175 0.765552 1.649960 913.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
18 315.0 77633.352381 42433.451781 189.0 41542.00 77918.0 116743.50 149618.0 315.0 82412.244444 58198.407943 67.0 30605.00 70827.0 137309.50 193771.0 315.0 2.001261e+07 51348.748716 19910411.0 19971208.00 20000609.0 20050109.50 20150808.0 315.0 100.771429 70.724403 1.0 37.0 72.0 149.0 211.0 302.0 1.645695 1.573334 0.0 0.0 2.0 2.0 7.0 291.0 0.154639 0.491448 0.0 0.0 0.0 0.0 3.0 306.0 0.160131 ... 0.110953 0.157512 315.0 0.887018 3.563810 -6.671605 -2.337864 2.286829 3.274892 11.002059 315.0 0.166556 3.711119 -4.157351 -1.944996 -0.785210 1.349590 18.702045 315.0 -0.595573 2.602414 -5.592622 -2.584811 -1.188427 1.283593 10.152891 315.0 0.851147 0.982307 -1.331271 0.144258 0.752683 1.595511 3.357300 315.0 -0.977489 1.076947 -3.405637 -1.797545 -0.772322 -0.296763 1.913713 315.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
19 1386.0 75560.020924 43706.641419 108.0 36555.50 76038.5 113820.75 149928.0 1386.0 78210.521645 59147.648804 141.0 25283.50 67916.5 126183.00 196712.0 1386.0 2.002140e+07 52415.137034 19910002.0 19980705.00 20010802.0 20060705.00 20150905.0 1386.0 111.703463 74.065212 1.0 38.0 59.0 178.0 233.0 1361.0 2.072006 1.587391 0.0 2.0 2.0 2.0 6.0 1322.0 0.482602 0.627469 0.0 0.0 0.0 1.0 5.0 1345.0 0.318216 ... 0.100461 0.168084 1386.0 0.458372 3.322205 -7.524217 -2.880910 1.835827 3.014082 12.169001 1386.0 -0.467656 2.853088 -4.801876 -2.113042 -0.981589 1.185698 17.911345 1386.0 -0.300360 2.549155 -6.156476 -2.391214 -0.695172 1.675446 10.368790 1386.0 0.314171 1.097748 -1.889330 -0.408416 0.170611 0.989450 3.056540 1386.0 -0.829898 1.467238 -4.522631 -2.368066 -0.222489 0.294714 2.166740 1386.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
20 1235.0 72852.993522 42770.279495 270.0 36730.00 71808.0 109085.00 149923.0 1235.0 75610.621053 57519.555046 69.0 25344.00 62591.0 124960.50 196521.0 1235.0 2.002248e+07 49976.251607 19910005.0 19990006.00 20010911.0 20060301.00 20150912.0 1235.0 96.965182 72.403099 1.0 19.0 71.0 148.0 225.0 1175.0 1.939574 2.228954 0.0 0.0 1.0 3.0 7.0 1142.0 0.237303 0.518451 0.0 0.0 0.0 0.0 5.0 1190.0 0.164706 ... 0.111390 0.177199 1235.0 0.785034 3.588094 -7.092171 -2.418497 2.126236 3.257914 11.325653 1235.0 0.307363 3.950357 -4.688188 -1.583725 -0.531796 1.171405 18.635290 1235.0 -1.070952 2.549985 -9.223993 -3.036032 -1.425503 0.578787 8.568987 1235.0 0.920612 1.223262 -1.451778 -0.168708 1.165518 1.948650 3.089669 1235.0 -0.227102 1.388666 -4.633527 -0.666089 0.063963 0.836313 1.857025 1235.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
21 1546.0 74411.115136 43042.507428 17.0 37334.75 75272.5 110606.00 149710.0 1546.0 68804.486417 57792.723668 83.0 19015.00 54175.0 113818.50 196158.0 1546.0 2.007134e+07 46009.516333 19921211.0 20040403.00 20071209.0 20110307.75 20151206.0 1546.0 65.705045 45.426932 1.0 19.0 82.0 82.0 191.0 1521.0 2.564103 2.358356 0.0 1.0 1.0 6.0 7.0 1486.0 0.337147 0.557827 0.0 0.0 0.0 1.0 5.0 1503.0 0.126414 ... 0.070734 0.177306 1546.0 -0.475685 3.523654 -6.666113 -3.958469 1.187783 2.397824 11.487130 1546.0 -0.432908 2.722832 -4.713837 -2.167777 -0.668103 0.938675 17.991023 1546.0 0.525370 2.047954 -6.774670 -1.059614 0.709748 2.010039 11.683460 1546.0 0.311608 1.191928 -1.841718 -0.594075 0.300731 0.677414 3.113212 1546.0 0.095564 1.064292 -3.496441 -0.488067 0.231581 0.948296 2.017902 1546.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
22 1085.0 74106.372350 43456.470274 286.0 36621.00 73611.0 111236.00 149966.0 1085.0 75645.853456 57361.844785 354.0 24505.00 64182.0 123326.00 196675.0 1085.0 2.006849e+07 42384.456746 19940404.0 20040502.00 20060910.0 20100704.00 20151110.0 1085.0 88.134562 53.247892 1.0 58.0 95.0 118.0 187.0 1069.0 2.827877 2.350458 0.0 1.0 2.0 6.0 7.0 1041.0 0.480307 0.569958 0.0 0.0 0.0 1.0 2.0 1064.0 0.219925 ... 0.119286 0.180232 1085.0 -0.068847 3.459512 -7.377229 -3.491510 1.554127 2.618712 11.358895 1085.0 -0.900918 2.624135 -4.631540 -2.373848 -1.220473 0.308024 17.850123 1085.0 0.652282 2.106158 -5.786973 -0.849187 0.660569 2.171080 9.804298 1085.0 0.990317 1.206859 -1.828631 0.235624 1.122912 1.895986 3.488786 1085.0 -1.325744 1.051142 -3.483484 -2.099871 -1.535887 -0.535032 1.632716 1085.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
23 183.0 71463.065574 42681.617690 981.0 37411.50 72602.0 106686.50 149073.0 183.0 91299.289617 56983.542288 200.0 42267.50 84050.0 142807.00 194901.0 183.0 2.001607e+07 47984.952170 19910601.0 19980557.00 20010012.0 20041205.00 20150712.0 183.0 141.879781 76.034237 1.0 147.0 147.0 198.0 246.0 177.0 1.361582 1.008089 0.0 1.0 1.0 2.0 5.0 170.0 0.241176 0.505073 0.0 0.0 0.0 0.0 2.0 174.0 0.132184 ... 0.151933 0.195777 183.0 1.859704 2.923402 -4.659997 -1.039074 2.913294 3.682503 10.600353 183.0 0.413718 3.339626 -3.996334 -1.139900 -0.230774 1.316077 18.134975 183.0 -1.458836 2.478543 -6.134694 -3.138181 -2.029954 -0.044675 5.744553 183.0 1.685663 1.572607 -1.319020 0.641498 1.953308 2.837793 4.207282 183.0 -0.584089 0.954694 -3.119766 -0.832251 -0.573239 -0.049845 1.813045 183.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
24 630.0 77544.577778 43472.950598 104.0 41483.00 77434.0 116516.75 149770.0 630.0 68077.846032 63207.834220 754.0 6010.00 48344.0 123193.25 196698.0 630.0 2.004602e+07 55756.401134 19910101.0 20010705.00 20050706.0 20081009.75 20151004.0 630.0 141.434921 57.203244 1.0 135.0 167.0 167.0 196.0 609.0 4.722496 0.940763 0.0 4.0 5.0 5.0 7.0 596.0 0.115772 0.364408 0.0 0.0 0.0 0.0 2.0 605.0 0.454545 ... 0.010444 0.096948 630.0 -2.181525 4.474586 -8.563886 -7.114687 0.431129 0.936029 11.517281 630.0 -2.068203 3.619911 -5.403044 -4.436895 -3.733340 -0.299474 17.321064 630.0 4.232475 1.772247 -2.501647 3.090258 3.970751 5.314654 13.847792 630.0 -2.606416 1.006631 -4.153899 -3.483702 -2.789147 -1.705373 -0.369503 630.0 -1.547405 1.520931 -4.990900 -2.049624 -1.381883 -0.481947 1.464521 630.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
25 2059.0 74127.034483 43501.338220 37.0 37139.00 74159.0 112051.50 149751.0 2059.0 78456.486644 59066.256877 270.0 22920.00 68579.0 126283.00 196732.0 2059.0 2.004729e+07 45101.627596 19910301.0 20011109.00 20050512.0 20080610.00 20151205.0 2059.0 77.587664 59.440144 1.0 19.0 74.0 107.0 213.0 2003.0 1.918622 1.544501 0.0 1.0 1.0 3.0 7.0 1964.0 0.402749 0.550286 0.0 0.0 0.0 1.0 6.0 1990.0 0.137186 ... 0.129609 0.179783 2059.0 0.631146 3.364486 -5.958284 -3.020590 2.142861 2.997208 11.318870 2059.0 -0.011960 3.091690 -4.229087 -1.762297 -0.722704 1.159274 18.672101 2059.0 -0.291924 2.100629 -7.142263 -1.774015 -0.336165 1.088873 9.376154 2059.0 0.970131 1.428631 -1.543721 -0.544231 1.109999 2.224228 3.841097 2059.0 -0.709130 0.910082 -2.964917 -1.563683 -0.554168 -0.095176 1.545084 2059.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
26 878.0 77493.378132 43487.023734 347.0 40607.75 76995.5 115285.25 149956.0 878.0 88815.642369 58856.666895 319.0 37001.00 84736.5 138247.25 196809.0 878.0 2.003434e+07 57256.432799 19910012.0 20000002.00 20030909.0 20077805.25 20151211.0 878.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0 736.0 3.501359 2.472158 0.0 1.0 4.0 6.0 7.0 712.0 0.667135 1.085715 0.0 0.0 0.0 1.0 6.0 714.0 0.523810 ... 0.031951 0.115086 878.0 1.585720 4.774101 -7.936043 -2.071304 1.999438 3.112465 12.319303 878.0 1.122851 6.295803 -4.734807 -2.794448 -1.101424 0.872000 18.563847 878.0 1.514269 2.664720 -6.849828 -0.248631 1.420006 2.910000 12.964502 878.0 -1.450474 0.622820 -2.683358 -2.035405 -1.523719 -0.929298 -0.012991 878.0 0.794297 0.602347 -1.421036 0.360552 0.820368 1.257269 2.228209 878.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
27 2049.0 73845.705710 43240.478883 15.0 36399.00 73615.0 110825.00 149922.0 2049.0 69949.441191 59332.550713 456.0 15265.00 52631.0 117151.00 196757.0 2049.0 2.004516e+07 52069.929164 19910002.0 20010306.00 20050901.0 20080906.00 20150611.0 2049.0 119.454856 58.426871 1.0 111.0 136.0 160.0 219.0 2013.0 1.959762 1.921902 0.0 1.0 1.0 3.0 7.0 1969.0 0.376333 0.797652 0.0 0.0 0.0 1.0 4.0 1996.0 0.132766 ... 0.117030 0.171262 2049.0 -0.349411 3.401541 -8.385159 -3.775698 1.388747 2.568366 11.438472 2049.0 -0.220547 2.787389 -4.900100 -1.958291 -0.485905 1.291661 17.955811 2049.0 0.433205 1.927001 -5.487213 -1.092251 0.502035 1.570864 9.238937 2049.0 0.909657 1.208100 -1.819820 0.267599 1.155180 1.789408 3.036236 2049.0 -1.148241 1.261924 -4.288604 -1.986162 -1.035812 -0.208064 1.791176 2049.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
28 633.0 74738.657188 43356.425965 267.0 37797.00 76432.0 112074.00 149999.0 633.0 75250.409163 57674.072361 328.0 21680.00 63202.0 118631.00 196106.0 633.0 2.006528e+07 52284.870060 19910003.0 20050609.00 20070810.0 20100907.00 20140312.0 633.0 83.764613 79.789794 1.0 19.0 19.0 177.0 238.0 625.0 2.417600 2.278666 0.0 1.0 1.0 5.0 7.0 603.0 0.406302 0.726179 0.0 0.0 0.0 1.0 5.0 619.0 0.268174 ... 0.145658 0.182138 633.0 0.083255 3.220259 -6.585369 -2.903576 1.680640 2.600357 11.081233 633.0 -0.673638 2.632410 -4.493970 -2.368460 -1.022317 0.951035 17.505573 633.0 0.662696 1.697063 -4.261720 -0.609220 0.507734 1.879061 8.397233 633.0 0.913849 1.738917 -1.765759 -0.652658 0.216346 2.616105 3.267012 633.0 -0.725659 1.529047 -4.800336 -0.651846 -0.316054 0.056592 1.681307 633.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
29 406.0 74621.884236 43236.633955 228.0 37153.50 73950.5 111987.25 149794.0 406.0 70945.206897 56446.897666 364.0 19719.00 55950.5 115637.75 191417.0 406.0 2.010308e+07 22641.540390 19990307.0 20090311.00 20101007.0 20120607.75 20150801.0 406.0 145.527094 53.246485 1.0 97.0 153.0 203.0 220.0 402.0 2.766169 2.209716 0.0 1.0 2.0 6.0 7.0 387.0 0.410853 0.643167 0.0 0.0 0.0 1.0 3.0 396.0 0.000000 ... 0.154961 0.180703 406.0 -0.790372 3.346689 -6.638600 -4.004475 1.189307 2.337868 9.911740 406.0 -0.815966 2.505750 -4.278372 -2.240155 -1.316043 1.003902 18.215280 406.0 1.264502 1.273312 -1.366109 0.238120 1.313790 2.125154 7.968338 406.0 2.348716 0.891323 -1.267262 1.340144 2.740750 2.959609 3.670654 406.0 -1.569358 1.180541 -4.288898 -2.280369 -2.059071 -0.173354 1.671054 406.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
30 940.0 75172.069149 44258.585908 223.0 36734.00 74946.0 114503.50 149916.0 940.0 70067.446809 57915.463701 368.0 20215.00 52366.0 116240.00 196292.0 940.0 2.003896e+07 55861.120829 19910005.0 19990907.50 20050009.5 20080806.00 20151207.0 940.0 71.619149 69.901983 1.0 19.0 19.0 137.0 194.0 914.0 2.840263 2.467461 0.0 1.0 1.0 6.0 7.0 885.0 0.148023 0.388769 0.0 0.0 0.0 0.0 2.0 905.0 0.081768 ... 0.112232 0.177734 940.0 -0.126412 3.654306 -6.275393 -3.720092 1.457661 2.748777 11.517406 940.0 0.005554 3.280222 -4.358551 -1.716405 -0.350104 1.357139 18.455192 940.0 -0.173341 2.100211 -6.730851 -1.817245 -0.051923 1.165228 9.049487 940.0 0.576410 1.292044 -1.367847 -0.634741 0.557611 1.816135 3.403661 940.0 -0.826164 1.106440 -4.161693 -1.277038 -0.576715 -0.192327 1.843991 940.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
31 318.0 79211.823899 44752.440924 986.0 38857.25 81918.5 119649.75 149436.0 318.0 72795.921384 59748.139844 19.0 19852.00 55504.5 124235.25 196533.0 318.0 2.002378e+07 41200.588927 19920009.0 19991135.25 20030156.5 20060481.00 20120903.0 318.0 122.251572 66.370955 1.0 100.0 100.0 150.0 241.0 299.0 1.515050 1.537760 0.0 1.0 1.0 1.0 7.0 285.0 0.021053 0.204472 0.0 0.0 0.0 0.0 2.0 300.0 0.110000 ... 0.154065 0.197007 318.0 1.544940 3.692167 -5.214919 -2.356698 2.989602 3.932320 11.722887 318.0 1.596768 4.075336 -3.331715 -0.364359 0.453222 2.303626 18.434987 318.0 -2.145505 2.219577 -7.476219 -3.721772 -2.628344 -0.437923 5.561065 318.0 2.790518 1.253698 -0.601478 2.628097 3.147084 3.627427 4.574046 318.0 -0.098571 1.047429 -3.530033 -0.350457 0.116856 0.605951 1.781382 318.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
32 588.0 77254.421769 43933.469591 785.0 38649.50 79253.0 116300.50 149906.0 588.0 80697.831633 60478.938739 532.0 23408.00 73592.0 133244.25 196673.0 588.0 2.002166e+07 37778.359656 19910012.0 19991207.50 20020457.5 20050602.25 20150007.0 588.0 101.000000 74.183804 1.0 19.0 120.0 173.0 185.0 568.0 2.457746 1.557247 0.0 2.0 3.0 3.0 7.0 551.0 0.471869 0.631469 0.0 0.0 0.0 1.0 2.0 571.0 0.553415 ... 0.109876 0.148974 588.0 0.433381 3.612880 -6.647469 -3.321114 2.084440 2.931624 11.528951 588.0 -0.221359 3.340916 -4.232155 -2.162154 -1.201379 1.061553 17.634468 588.0 -0.626338 2.320674 -7.328804 -2.191988 -1.058943 0.740408 9.804773 588.0 0.564083 1.066244 -1.316508 -0.479304 0.759701 1.385450 2.695913 588.0 -1.007297 1.020659 -4.544762 -1.584126 -0.847772 -0.353823 1.404809 588.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
33 201.0 74372.567164 46552.749727 213.0 34590.00 68461.0 118484.00 149600.0 201.0 81124.049751 57825.076347 726.0 30423.00 70408.0 127691.00 196467.0 201.0 2.002732e+07 49375.949773 19910111.0 20000403.00 20021211.0 20061008.00 20150808.0 201.0 111.517413 80.162903 1.0 19.0 179.0 181.0 181.0 198.0 0.555556 1.364690 0.0 0.0 0.0 0.0 5.0 195.0 0.420513 0.589941 0.0 0.0 0.0 1.0 2.0 198.0 0.792929 ... 0.101085 0.122254 201.0 -0.382310 3.667147 -6.699000 -4.361581 1.701981 2.388892 9.788658 201.0 -0.963164 2.475391 -4.042492 -2.448779 -1.813465 0.695686 14.265191 201.0 1.162601 1.828595 -1.835974 -0.105305 0.825324 2.461365 8.332826 201.0 -0.212318 1.028219 -2.181042 -1.122769 0.195374 0.548407 1.846387 201.0 -1.430185 1.426232 -3.699757 -2.725109 -1.730609 -0.153571 1.100867 201.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
34 227.0 67690.295154 45058.232029 821.0 26593.50 63861.0 108637.00 149962.0 227.0 86537.942731 58063.238676 733.0 39316.00 75063.0 135185.00 196322.0 227.0 2.001670e+07 25854.786441 19940305.0 20000306.50 20021009.0 20040357.00 20090806.0 227.0 103.903084 63.722708 1.0 92.0 92.0 141.0 216.0 214.0 1.116822 1.130207 0.0 1.0 1.0 1.0 7.0 206.0 0.106796 0.450747 0.0 0.0 0.0 0.0 2.0 218.0 0.073394 ... 0.180723 0.213617 227.0 2.336748 3.069714 -3.994545 -0.675070 3.451118 3.842282 11.482525 227.0 1.297720 3.818776 -1.979440 -0.628523 0.084570 2.241391 18.004163 227.0 -2.717730 2.046480 -7.591009 -3.716544 -2.843971 -2.143089 5.869691 227.0 3.593484 1.683694 0.175089 3.653462 4.219027 4.847437 5.249750 227.0 -0.478732 0.482595 -1.466562 -0.833232 -0.483200 -0.137173 0.978073 227.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
35 180.0 75761.288889 42658.662014 1239.0 42427.25 76158.0 109473.00 149835.0 180.0 92504.377778 56658.832239 1985.0 43453.50 89256.0 137603.50 193029.0 180.0 1.999566e+07 26177.571767 19920202.0 19980753.00 19991104.0 20010412.00 20081012.0 180.0 20.794444 31.562600 1.0 19.0 19.0 19.0 240.0 174.0 1.137931 1.774400 0.0 0.0 0.0 1.0 7.0 163.0 0.171779 0.438785 0.0 0.0 0.0 0.0 2.0 171.0 0.128655 ... 0.072706 0.111065 180.0 2.698322 2.546303 -5.300156 2.679139 3.218441 3.602274 11.216075 180.0 -0.132526 3.185104 -4.492133 -1.452299 -0.933803 0.004659 15.908381 180.0 -2.118162 1.833393 -7.024219 -3.163407 -2.343118 -1.363827 4.675358 180.0 -0.310187 0.406609 -0.710799 -0.534089 -0.451840 -0.195753 2.305511 180.0 -0.269934 0.703368 -4.304987 -0.470896 -0.285957 0.087244 1.152364 180.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
36 228.0 73029.513158 41254.861198 699.0 36801.25 74890.0 106929.50 147599.0 228.0 95786.000000 58318.848570 2310.0 45353.25 93729.5 145547.00 196115.0 228.0 2.000187e+07 43634.811252 19910001.0 19971006.50 20000703.5 20030533.75 20120201.0 228.0 68.736842 84.506711 1.0 19.0 19.0 205.0 232.0 226.0 2.075221 2.010772 0.0 0.0 2.0 4.0 7.0 210.0 0.280952 0.604702 0.0 0.0 0.0 0.0 5.0 219.0 0.210046 ... 0.088218 0.178423 228.0 1.428753 2.682799 -5.395128 1.532496 2.541814 2.996332 10.525269 228.0 -1.140752 2.139584 -3.857012 -2.153057 -1.579656 -0.685238 17.097484 228.0 -0.905136 1.914516 -5.375736 -2.362553 -1.033816 0.415038 4.007988 228.0 -0.026382 0.938113 -1.156501 -0.848372 -0.379545 1.012891 2.450867 228.0 -0.661150 0.820463 -2.472309 -1.188085 -0.478278 -0.047181 0.935694 228.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
37 331.0 73778.833837 44279.708500 57.0 34365.50 73581.0 111356.00 149816.0 331.0 82216.232628 58785.168955 1747.0 30364.50 69461.0 136735.50 192926.0 331.0 2.005355e+07 53794.575230 19910106.0 20010907.50 20060112.0 20091008.00 20151210.0 330.0 190.275758 28.105528 1.0 189.0 200.0 202.0 206.0 324.0 5.978395 0.355622 0.0 6.0 6.0 6.0 7.0 326.0 0.861963 0.371228 0.0 1.0 1.0 1.0 2.0 324.0 0.478395 ... 0.106693 0.222787 331.0 -1.205835 3.866438 -8.798810 -5.546108 0.586353 1.297289 10.207696 331.0 -2.449968 3.002255 -5.391154 -4.375784 -3.313433 -0.703420 14.724600 331.0 2.681481 2.293266 -2.854245 0.874165 2.821037 4.536863 9.983565 331.0 0.091308 1.051330 -2.255182 -0.374834 0.060246 0.455534 11.147669 331.0 -4.089576 1.169474 -6.546556 -4.561176 -4.118236 -3.682716 8.658418 331.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
38 65.0 79568.169231 46074.928132 570.0 35584.00 98020.0 115964.00 147553.0 65.0 69180.092308 53590.168691 2632.0 21519.00 56816.0 111613.00 189410.0 65.0 2.006727e+07 42559.876836 19950008.0 20050701.00 20080205.0 20091003.00 20150205.0 65.0 171.600000 83.524847 1.0 214.0 214.0 214.0 242.0 60.0 4.666667 2.282332 0.0 2.0 6.0 6.0 6.0 60.0 0.350000 0.755208 0.0 0.0 0.0 0.0 2.0 60.0 0.000000 ... 0.156860 0.210402 65.0 0.925148 4.186859 -5.543358 -3.030515 1.955995 2.588624 11.613387 65.0 -0.364350 4.634212 -3.655055 -2.905654 -1.627678 0.216899 16.388386 65.0 -0.025510 2.116160 -3.477392 -1.090534 -0.339330 0.525587 8.297318 65.0 2.623626 1.387277 -0.887793 2.542658 3.255139 3.416094 4.599514 65.0 -1.855674 1.140203 -3.268925 -2.505578 -2.328672 -1.755200 0.878431 65.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
39 9.0 70365.222222 39637.457808 5144.0 49803.00 76258.0 99049.00 127071.0 9.0 74224.666667 59824.384652 22825.0 38387.00 50778.0 65608.00 181810.0 9.0 2.000169e+07 75637.080552 19910707.0 19950009.00 19981203.0 20040710.00 20150402.0 9.0 86.000000 118.727629 1.0 1.0 19.0 244.0 244.0 7.0 2.571429 2.439750 0.0 0.5 2.0 4.5 6.0 6.0 0.166667 0.408248 0.0 0.0 0.0 0.0 1.0 7.0 0.000000 ... 0.072366 0.184668 9.0 1.515873 2.985496 -6.214428 1.717704 2.118526 3.142073 3.399385 9.0 2.910655 8.745510 -3.436782 -1.548087 -1.059497 0.697001 18.197699 9.0 0.760231 2.972780 -2.862146 -1.400021 -0.249955 1.914129 5.452985 9.0 0.496650 1.477420 -0.897785 -0.674703 0.075403 1.532134 3.577009 9.0 -0.899515 1.813946 -3.620942 -2.306780 0.000658 0.258403 1.005671 9.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0

40 rows × 240 columns

# 计算某品牌的销售统计量,同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    #print('kind/n',kind)
    #print('kind_data/n',kind_data)
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
#brand_fe.head()
data = data.merge(brand_fe, how='left', on='brand')
brand_fe.head()
brand brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average
0 0 31429.0 68500.0 3199.0 13.0 173719698.0 6261.371627 5527.19
1 1 13656.0 84000.0 6399.0 15.0 124044603.0 8988.865406 9082.86
2 2 318.0 55800.0 7500.0 35.0 3766241.0 10576.224444 11806.40
3 3 2461.0 37500.0 4990.0 65.0 15954226.0 5396.327503 6480.19
4 4 16575.0 99999.0 5999.0 12.0 138279069.0 8089.863295 8342.13
data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.0 1 10193.0 35990.0 1800.0 13.0 36457518.0 4562.233331 3576.37
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 4366 0 0 20160309 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.0 4 13656.0 84000.0 6399.0 15.0 124044603.0 8988.865406 9082.86
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.0 2 1458.0 45000.0 8500.0 100.0 14373814.0 5425.058140 9851.83
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.0 0 13994.0 92900.0 5200.0 15.0 113034210.0 8244.695287 8076.76
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.0 6 4662.0 31500.0 2300.0 20.0 15414322.0 3344.689763 3305.67
# 数据分桶 以 power 为例
# 这时候我们的缺失值也进桶了,
# 为什么要做数据分桶呢,原因有很多,= =
# 1. 离散后稀疏向量内积乘法运算速度更快,计算结果也方便存储,容易扩展;(one_hot的优点)
# 2. 离散后的特征对异常值更具鲁棒性,如 age>30 为 1 否则为 0,对于年龄为 200 的也不会对模型造成很大的干扰;
# 3. LR 属于广义线性模型,表达能力有限,经过离散化后,每个变量有单独的权重,这相当于引入了非线性,能够提升模型的表达能力,加大拟合;(one_hot 优点)
# 4. 离散后特征可以进行特征交叉,提升表达能力,由 M+N 个变量编程 M*N 个变量,进一步引入非线形,提升了表达能力;(one_hot优点)
# 5. 特征离散后模型更稳定,如用户年龄区间,不会因为用户年龄长了一岁就变化

# 当然还有很多原因,LightGBM 在改进 XGBoost 时就增加了数据分桶,增强了模型的泛化性

bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

power_bin power
0 5.0 60
1 NaN 0
2 16.0 163
3 19.0 193
4 6.0 68
data['power_bin']#由于设置了label=False,power_bin 分别表示power被分到了第几个桶中
0          5.0
1          NaN
2         16.0
3         19.0
4          6.0
          ... 
199032    11.0
199033     7.0
199034    22.0
199035     NaN
199036     6.0
Name: power_bin, Length: 199037, dtype: float64
# 利用好了,就可以删掉原始数据了
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)
print(data.shape)
data.columns
(199037, 39)





Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox',
       'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType',
       'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8',
       'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time',
       'city', 'brand_amount', 'brand_price_max', 'brand_price_median',
       'brand_price_min', 'brand_price_sum', 'brand_price_std',
       'brand_price_average', 'power_bin'],
      dtype='object')
# 目前的数据其实已经可以给树模型使用了,所以我们导出一下
data.to_csv('data_for_tree.csv', index=0)
data.head()
SaleID name model brand bodyType fuelType gearbox power kilometer notRepairedDamage seller offerType price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average power_bin
0 0 736 30.0 6 1.0 0.0 0.0 60 12.5 0.0 0 0 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.0 1 10193.0 35990.0 1800.0 13.0 36457518.0 4562.233331 3576.37 5.0
1 1 2262 40.0 1 2.0 0.0 0.0 0 15.0 - 0 0 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.0 4 13656.0 84000.0 6399.0 15.0 124044603.0 8988.865406 9082.86 NaN
2 2 14874 115.0 15 1.0 0.0 0.0 163 12.5 0.0 0 0 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.0 2 1458.0 45000.0 8500.0 100.0 14373814.0 5425.058140 9851.83 16.0
3 3 71865 109.0 10 0.0 0.0 1.0 193 15.0 0.0 0 0 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.0 0 13994.0 92900.0 5200.0 15.0 113034210.0 8244.695287 8076.76 19.0
4 4 111080 110.0 5 1.0 0.0 0.0 68 5.0 0.0 0 0 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.0 6 4662.0 31500.0 2300.0 20.0 15414322.0 3344.689763 3305.67 6.0
# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为,不同模型对数据集的要求不同
# 我们看下数据分布:
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2870d506978>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4lGElW8Y-1585390656202)(output_31_1.png)]

# 我们刚刚已经对 train 进行异常值处理了,但是现在还有这么奇怪的分布是因为 test 中的 power 异常值,
# 所以我们其实刚刚 train 中的 power 异常值不删为好,可以用长尾分布截断来代替
train['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x287001c6f60>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EsrGxDJj-1585390656202)(output_32_1.png)]

# 我们对其取 log,在做归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1) 
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2870022bac8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jlZHvc4e-1585390656202)(output_33_1.png)]

# km 的比较正常,应该是已经做过分桶了
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x287002b1358>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5E1m1ecB-1585390656202)(output_34_1.png)]

data['kilometer'].value_counts()#从这里我们可以直观的看到为啥判定kilometer可能是分桶过的,他的类别只有这有限的几类。
15.0    128682
12.5     20958
10.0      8506
9.0       6992
8.0       6043
7.0       5442
6.0       4886
5.0       4197
4.0       3576
3.0       3309
2.0       3034
0.5       2431
1.0        981
Name: kilometer, dtype: int64
data['power_bin'].plot.hist()#画一下power分桶之后的效果,看一下
<matplotlib.axes._subplots.AxesSubplot at 0x287004c1ac8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OFyhH10Q-1585390656203)(output_36_1.png)]

# 所以我们可以直接做归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / 
                        (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x28700291e10>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-a1NhwAZX-1585390656203)(output_37_1.png)]

# 除此之外 还有我们刚刚构造的统计量特征:
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了,直接做变换,
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / 
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / 
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / 
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / 
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / 
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / 
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))
data.head()

SaleID name model brand bodyType fuelType gearbox power kilometer notRepairedDamage seller offerType price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average power_bin
0 0 736 30.0 6 1.0 0.0 0.0 0.415091 0.827586 0.0 0 0 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.0 1 0.324125 0.340786 0.032075 0.002064 0.209684 0.207660 0.081655 5.0
1 1 2262 40.0 1 2.0 0.0 0.0 0.000000 1.000000 - 0 0 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.0 4 0.434341 0.835230 0.205623 0.004128 0.713985 0.437002 0.257305 NaN
2 2 14874 115.0 15 1.0 0.0 0.0 0.514954 0.827586 0.0 0 0 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.0 2 0.046117 0.433578 0.284906 0.091847 0.082533 0.252362 0.281834 16.0
3 3 71865 109.0 10 0.0 0.0 1.0 0.531917 1.000000 0.0 0 0 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.0 0 0.445099 0.926889 0.160377 0.004128 0.650591 0.398447 0.225212 19.0
4 4 111080 110.0 5 1.0 0.0 0.0 0.427535 0.310345 0.0 0 0 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.0 6 0.148090 0.294545 0.050943 0.009288 0.088524 0.144579 0.073020 6.0
# 对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin'])
data.head()
SaleID name power kilometer seller offerType price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average model_0.0 model_1.0 model_2.0 model_3.0 model_4.0 model_5.0 model_6.0 model_7.0 model_8.0 model_9.0 model_10.0 model_11.0 model_12.0 model_13.0 model_14.0 model_15.0 model_16.0 model_17.0 ... bodyType_0.0 bodyType_1.0 bodyType_2.0 bodyType_3.0 bodyType_4.0 bodyType_5.0 bodyType_6.0 bodyType_7.0 fuelType_0.0 fuelType_1.0 fuelType_2.0 fuelType_3.0 fuelType_4.0 fuelType_5.0 fuelType_6.0 gearbox_0.0 gearbox_1.0 notRepairedDamage_- notRepairedDamage_0.0 notRepairedDamage_1.0 power_bin_0.0 power_bin_1.0 power_bin_2.0 power_bin_3.0 power_bin_4.0 power_bin_5.0 power_bin_6.0 power_bin_7.0 power_bin_8.0 power_bin_9.0 power_bin_10.0 power_bin_11.0 power_bin_12.0 power_bin_13.0 power_bin_14.0 power_bin_15.0 power_bin_16.0 power_bin_17.0 power_bin_18.0 power_bin_19.0 power_bin_20.0 power_bin_21.0 power_bin_22.0 power_bin_23.0 power_bin_24.0 power_bin_25.0 power_bin_26.0 power_bin_27.0 power_bin_28.0 power_bin_29.0
0 0 736 0.415091 0.827586 0 0 1850.0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.0 1 0.324125 0.340786 0.032075 0.002064 0.209684 0.207660 0.081655 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 2262 0.000000 1.000000 0 0 3600.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.0 4 0.434341 0.835230 0.205623 0.004128 0.713985 0.437002 0.257305 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 14874 0.514954 0.827586 0 0 6222.0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.0 2 0.046117 0.433578 0.284906 0.091847 0.082533 0.252362 0.281834 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 71865 0.531917 1.000000 0 0 2400.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.0 0 0.445099 0.926889 0.160377 0.004128 0.650591 0.398447 0.225212 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 4 111080 0.427535 0.310345 0 0 5200.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.0 6 0.148090 0.294545 0.050943 0.009288 0.088524 0.144579 0.073020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 370 columns

print(data.shape)
data.columns
(199037, 370)





Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'price',
       'v_0', 'v_1', 'v_2',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=370)
# 这份数据可以给 LR 用
data.to_csv('data_for_lr.csv', index=0)

3.3.3 特征筛选

1) 过滤式

# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))
0.5728285196051496
-0.4082569701616764
0.058156610025581514
0.3834909576057687
0.259066833880992
0.38691042393409447
# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x28700594b70>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QyABs0VQ-1585390656204)(output_47_1.png)]

x.columns
Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'v_0',
       'v_1', 'v_2', 'v_3',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=369)
# from sklearn.model_selection import cross_val_score, ShuffleSplit
# from sklearn.datasets import load_boston
# from sklearn.ensemble import RandomForestRegressor
# import numpy as np

# # Load boston housing dataset as an example
# boston = load_boston()
# X = x
# Y = y
# # names = x[0]

# rf = RandomForestRegressor(n_estimators=20, max_depth=4)
# scores = []
# # 单独采用每个特征进行建模,并进行交叉验证
# for i in range(X.shape[1]):
#     score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",  # 注意X[:, i]和X[:, i:i+1]的区别
#                             cv=ShuffleSplit(len(X), 3, .3))
#     scores.append((format(np.mean(score), '.3f')))#, names[i]
# print(sorted(scores, reverse=True))

2) 包裹式

# !pip install mlxtend
def fill(x):
    if not x:
        x = 0
    return int(x)
data['city'] = data['city'].map(fill)
data.groupby("city").describe()
#data['city'].isnull().sum()
data['city'].value_counts()
0    48645
1    42188
2    35133
3    27325
4    19945
5    13462
6     8313
7     3887
8      139
Name: city, dtype: int64
data['price'][0:train.shape[0]].isnull().sum()
0
print (train.shape[0])
print (test.shape[0])
149037
50000
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=20,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)[0:train.shape[0]]
y = data['price'][0:train.shape[0]]
sfs.fit(x, y)
sfs.k_feature_names_ 
('kilometer',
 'v_3',
 'v_4',
 'v_6',
 'v_13',
 'v_14',
 'used_time',
 'brand_price_average',
 'model_44.0',
 'model_105.0',
 'model_113.0',
 'model_167.0',
 'brand_16',
 'bodyType_6.0',
 'gearbox_1.0',
 'power_bin_6.0',
 'power_bin_18.0',
 'power_bin_24.0',
 'power_bin_25.0',
 'power_bin_26.0')
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=10,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price','city'],axis=1)
#x.head()
x = x.fillna(0)[0:train.shape[0]]
y = data['price'][0:train.shape[0]]
sfs.fit(x, y)
sfs.k_feature_names_ 
('kilometer',
 'v_3',
 'v_4',
 'v_13',
 'v_14',
 'used_time',
 'brand_price_average',
 'model_167.0',
 'gearbox_1.0',
 'power_bin_24.0')
# k_feature=sfs.get_metric_dict()
# for fea in k_feature:
#     fea=k_feature[fea]
#     print(f"Feature Name:{fea['feature_names']},")
#           # /t 
#     print(f"Avg_Soure:{fea["avg_score"]}")
x.head()
#print(train.shape[0])
SaleID name power kilometer seller offerType v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14 train used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average model_0.0 model_1.0 model_2.0 model_3.0 model_4.0 model_5.0 model_6.0 model_7.0 model_8.0 model_9.0 model_10.0 model_11.0 model_12.0 model_13.0 model_14.0 model_15.0 model_16.0 model_17.0 model_18.0 ... bodyType_0.0 bodyType_1.0 bodyType_2.0 bodyType_3.0 bodyType_4.0 bodyType_5.0 bodyType_6.0 bodyType_7.0 fuelType_0.0 fuelType_1.0 fuelType_2.0 fuelType_3.0 fuelType_4.0 fuelType_5.0 fuelType_6.0 gearbox_0.0 gearbox_1.0 notRepairedDamage_- notRepairedDamage_0.0 notRepairedDamage_1.0 power_bin_0.0 power_bin_1.0 power_bin_2.0 power_bin_3.0 power_bin_4.0 power_bin_5.0 power_bin_6.0 power_bin_7.0 power_bin_8.0 power_bin_9.0 power_bin_10.0 power_bin_11.0 power_bin_12.0 power_bin_13.0 power_bin_14.0 power_bin_15.0 power_bin_16.0 power_bin_17.0 power_bin_18.0 power_bin_19.0 power_bin_20.0 power_bin_21.0 power_bin_22.0 power_bin_23.0 power_bin_24.0 power_bin_25.0 power_bin_26.0 power_bin_27.0 power_bin_28.0 power_bin_29.0
0 0 736 0.415091 0.827586 0 0 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762 1 4385.0 1 0.324125 0.340786 0.032075 0.002064 0.209684 0.207660 0.081655 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 2262 0.000000 1.000000 0 0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522 1 4757.0 4 0.434341 0.835230 0.205623 0.004128 0.713985 0.437002 0.257305 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 14874 0.514954 0.827586 0 0 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963 1 4382.0 2 0.046117 0.433578 0.284906 0.091847 0.082533 0.252362 0.281834 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 71865 0.531917 1.000000 0 0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699 1 7125.0 0 0.445099 0.926889 0.160377 0.004128 0.650591 0.398447 0.225212 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 4 111080 0.427535 0.310345 0 0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482 1 1531.0 6 0.148090 0.294545 0.050943 0.009288 0.088524 0.144579 0.073020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 369 columns

# 画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qIL7Nv8w-1585390656204)(output_60_1.png)]

pd.DataFrame.from_dict(sfs.get_metric_dict()).T
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
1 (9,) [0.5580593794194673] 0.558059 (v_3,) NaN 0 NaN
2 (9, 30) [0.6253563249806938] 0.625356 (v_3, brand_price_average) NaN 0 NaN
3 (3, 9, 30) [0.6614119003709955] 0.661412 (kilometer, v_3, brand_price_average) NaN 0 NaN
4 (3, 9, 30, 335) [0.6712706106724942] 0.671271 (kilometer, v_3, brand_price_average, gearbox_... NaN 0 NaN
5 (3, 9, 30, 198, 335) [0.6801326459700268] 0.680133 (kilometer, v_3, brand_price_average, model_16... NaN 0 NaN
6 (3, 9, 22, 30, 198, 335) [0.686927264547389] 0.686927 (kilometer, v_3, used_time, brand_price_averag... NaN 0 NaN
7 (3, 9, 19, 22, 30, 198, 335) [0.6941981569972937] 0.694198 (kilometer, v_3, v_13, used_time, brand_price_... NaN 0 NaN
8 (3, 9, 10, 19, 22, 30, 198, 335) [0.6990798224753535] 0.69908 (kilometer, v_3, v_4, v_13, used_time, brand_p... NaN 0 NaN
9 (3, 9, 10, 19, 22, 30, 198, 335, 363) [0.7036045618841336] 0.703605 (kilometer, v_3, v_4, v_13, used_time, brand_p... NaN 0 NaN
10 (3, 9, 10, 19, 20, 22, 30, 198, 335, 363) [0.7073329162983002] 0.707333 (kilometer, v_3, v_4, v_13, v_14, used_time, b... NaN 0 NaN
11 (3, 9, 10, 12, 19, 20, 22, 30, 198, 335, 363) [0.7116225332630737] 0.711623 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
12 (3, 9, 10, 12, 19, 20, 22, 30, 198, 295, 335, ... [0.7152839215589477] 0.715284 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
13 (3, 9, 10, 12, 19, 20, 22, 30, 198, 295, 335, ... [0.7183533830790547] 0.718353 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
14 (3, 9, 10, 12, 19, 20, 22, 30, 144, 198, 295, ... [0.7210323407042653] 0.721032 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
15 (3, 9, 10, 12, 19, 20, 22, 30, 75, 144, 198, 2... [0.7235732490774848] 0.723573 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
16 (3, 9, 10, 12, 19, 20, 22, 30, 75, 144, 198, 2... [0.726091372443646] 0.726091 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
17 (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... [0.7286164680329102] 0.728616 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
18 (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... [0.7309480347784469] 0.730948 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
19 (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... [0.7332378240942985] 0.733238 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN
20 (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... [0.7352137419490058] 0.735214 (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... NaN 0 NaN

3) 嵌入式

# 下一章介绍,Lasso 回归和决策树可以完成嵌入式特征选择
# 大部分情况下都是用嵌入式做特征筛选

3.4 经验总结

特征工程是比赛中最至关重要的的一块,特别的传统的比赛,大家的模型可能都差不多,调参带来的效果增幅是非常有限的,但特征工程的好坏往往会决定了最终的排名和成绩。

特征工程的主要目的还是在于将数据转换为能更好地表示潜在问题的特征,从而提高机器学习的性能。比如,异常值处理是为了去除噪声,填补缺失值可以加入先验知识等。

特征构造也属于特征工程的一部分,其目的是为了增强数据的表达。

有些比赛的特征是匿名特征,这导致我们并不清楚特征相互直接的关联性,这时我们就只有单纯基于特征进行处理,比如装箱,groupby,agg 等这样一些操作进行一些特征统计,此外还可以对特征进行进一步的 log,exp 等变换,或者对多个特征进行四则运算(如上面我们算出的使用时长),多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理,当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。

对于知道特征含义(非匿名)的特征工程,特别是在工业类型比赛中,会基于信号处理,频域提取,丰度,偏度等构建更为有实际意义的特征,这就是结合背景的特征构建,在推荐系统中也是这样的,各种类型点击率统计,各时段统计,加用户属性的统计等等,这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理,从而才能更好的找到 magic。

当然特征工程其实是和模型结合在一起的,这就是为什么要为 LR NN 做分桶和特征归一化的原因,而对于特征的处理效果和特征重要性等往往要通过模型来验证。

总的来说,特征工程是一个入门简单,但想精通非常难的一件事。

Task 3-特征工程 END.

— By: 阿泽

PS:复旦大学计算机研究生
知乎:阿泽 https://www.zhihu.com/people/is-aze(主要面向初学者的知识整理)

关于Datawhale:

Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。

本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:

学习笔记

  1. 特征工程就是把数据转化成能够更好的表示潜在问题的特征,特征工程决定了你预测的上限。

  2. 数据理解:定性的数据和定量的数据了解数据性质方便进行进行数据的的处理

  3. 数据清洗(提高数据质量):我们进行了缺失值跟异常值的处理对于长尾分布的一些特征可以用截断法来代替直接删除异常值,然后如果要用线性的模型的话我们还要对数据进行标准化归一化

  4. 特征构造(为了增强数据表达,添加先验知识):在这里我们构造了时间差的特征,但我们发现这里面会存在一些Nat的值,我的处理之给他们赋予了新构造出来的时间差这列数据的平均值来填补这些缺失值,同时我们对power进行了数据分桶添加了power_pin这个新特征。同时关于为何kilometer推测是已经进行过数据分桶我们可以通过观察原数据很明显的可以看到kilometer只有有限的几类
    ,对于地理信息我们因为有先验知识所以我们取出我们去除后三位留下城市信息,但又由于可能会处理后存在空值,我们用0来replace组成新的一类。通过one_hot来进行了一些非线性变化,好处写在上面了。

  5. 特征选择:
    过滤式-Filter(通过特征与price之间的相关性筛选出一些特征)
    包裹式-Wrapper(用贪心法来找出比较优的一组特征)
    嵌入式-Embedding(学习器自动选择特征)
    个人感觉对于这个题来说特征没有特别多没太有必要进行特征选择

  6. 还有要记的对一些不符合正态分布的数据进行一下取Log的处理,使其尽量的来接近正态分布

发布了154 篇原创文章 · 获赞 11 · 访问量 5872

猜你喜欢

转载自blog.csdn.net/weixin_45569785/article/details/105165801