Datawhale 零基础入门数据挖掘-Baseline

Baseline-v1.0 版

Tip:这是一个最初始baseline版本,抛砖引玉,为大家提供一个基本Baseline和一个竞赛流程的基本介绍，欢迎大家多多交流。

赛题：零基础入门数据挖掘 - 二手车交易价格预测

地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

# 查看数据文件目录  list datalab files
!ls datalab/

'ls' 不是内部或外部命令，也不是可运行的程序
或批处理文件。

Step 1:导入函数工具箱

## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time

warnings.filterwarnings('ignore')
%matplotlib inline

## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb

## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.

Step 2:数据读取

## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('used_car_train_20200313.csv', sep=',')
TestA_data = pd.read_csv('used_car_testA_20200313.csv', sep=',')

## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)

Train data shape: (150000, 31)
TestA data shape: (50000, 30)

1) 数据简要浏览

## 通过.head() 简要浏览读取数据的形式
Train_data.head()

	SaleID	name	regDate	model	brand	bodyType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30	6	1	0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914763
1	1	2262	20030301	40	1	2	0	0	15	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115	15	1	0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109	10	0	1	193	15	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110	5	1	0	68	5	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482

5 rows × 31 columns

2) 数据信息查看

## 通过 .info() 简要可以看到对应一些数据列名，以及NAN缺失信息
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                150000 non-null int64
brand                150000 non-null int64
bodyType             150000 non-null int64
fuelType             150000 non-null float64
gearbox              150000 non-null object
power                150000 non-null object
kilometer            150000 non-null object
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null float64
creatDate            150000 non-null float64
price                150000 non-null float64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 148531 non-null float64
v_13                 146417 non-null float64
v_14                 135884 non-null float64
dtypes: float64(19), int64(8), object(4)
memory usage: 35.5+ MB

## 通过 .columns 查看列名
Train_data.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')

TestA_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null int64
brand                50000 non-null int64
bodyType             50000 non-null int64
fuelType             50000 non-null float64
gearbox              50000 non-null object
power                50000 non-null object
kilometer            50000 non-null object
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null float64
offerType            50000 non-null float64
creatDate            50000 non-null float64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 49548 non-null float64
v_13                 48880 non-null float64
v_14                 45356 non-null float64
dtypes: float64(19), int64(7), object(4)
memory usage: 11.4+ MB

3) 数据统计信息浏览

## 通过 .describe() 可以查看数值特征列的一些统计信息
Train_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	regionCode	seller	offerType	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	150000.000000	150000.000000	1.500000e+05	150000.000000	150000.000000	150000.000000	150000.000000	1.500000e+05	1.500000e+05	1.500000e+05	...	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	148531.000000	146417.000000	135884.000000
mean	74999.500000	68349.172873	2.003417e+07	47.128953	8.052527	1.870747	1.394827	1.997783e+05	2.841478e+05	1.415701e+06	...	0.246643	0.062381	0.174574	0.296920	0.406928	-0.164371	-0.446352	-0.085471	0.022190	0.008456
std	43301.414527	61103.875095	5.364988e+04	49.535881	7.864603	5.221312	15.676749	1.985073e+06	2.376421e+06	5.151321e+06	...	0.116636	0.133581	0.927042	1.773396	1.962003	3.758661	2.002930	2.257730	1.267140	1.050417
min	0.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	...	0.000000	0.000000	-0.273510	-8.206004	-8.399672	-9.168192	-9.404106	-9.639552	-6.113291	-6.546556
25%	37499.750000	11156.000000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	7.390000e+02	0.000000e+00	0.000000e+00	...	0.241064	0.000161	0.055272	0.036050	0.035225	-3.666042	-2.026105	-1.745234	-0.999703	-0.426907
50%	74999.500000	51638.000000	2.003091e+07	30.000000	6.000000	1.000000	0.000000	2.010000e+03	0.000000e+00	0.000000e+00	...	0.256928	0.001547	0.090081	0.058523	0.063335	1.240603	-0.457218	-0.160305	0.008602	0.155026
75%	112499.250000	118841.250000	2.007111e+07	66.000000	13.000000	3.000000	1.000000	3.719000e+03	0.000000e+00	0.000000e+00	...	0.265170	0.104255	0.120590	0.081996	0.094738	2.691063	1.115744	1.572130	0.929041	0.700543
max	149999.000000	196812.000000	2.015121e+07	247.000000	39.000000	999.000000	3500.000000	2.016041e+07	2.016041e+07	2.016041e+07	...	1.401999	1.387847	12.357011	18.819042	18.801218	18.802072	13.562011	11.147669	8.658418	2.743993

8 rows × 27 columns

TestA_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	regionCode	seller	offerType	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	50000.000000	50000.000000	5.000000e+04	50000.000000	50000.000000	50000.00000	50000.000000	5.000000e+04	5.000000e+04	5.000000e+04	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	49548.000000	48880.000000	45356.000000
mean	174999.500000	68542.223280	2.003393e+07	46.844520	8.056240	1.82978	1.310420	1.846027e+05	2.693423e+05	1.420901e+06	...	0.245942	0.062157	0.167631	0.278298	0.394680	-0.180498	-0.441886	-0.093695	0.018620	0.010720
std	14433.901067	61052.808133	5.368870e+04	49.469548	7.819477	4.38703	11.319999	1.907934e+06	2.314646e+06	5.160173e+06	...	0.113548	0.132116	0.889236	1.705061	1.904670	3.749435	2.011260	2.264558	1.262383	1.039922
min	150000.000000	0.000000	1.991000e+07	0.000000	0.000000	0.00000	0.000000	0.000000e+00	0.000000e+00	-4.137733e+00	...	0.000000	0.000000	0.000000	-7.481381	-8.088973	-9.160049	-8.916949	-8.249206	-5.881834	-6.112667
25%	162499.750000	11203.500000	1.999091e+07	10.000000	1.000000	0.00000	0.000000	7.500000e+02	0.000000e+00	0.000000e+00	...	0.241186	0.000157	0.055647	0.035926	0.034963	-3.656769	-2.033607	-1.744112	-0.999841	-0.428772
50%	174999.500000	52248.500000	2.003091e+07	29.000000	6.000000	1.00000	0.000000	2.025000e+03	0.000000e+00	0.000000e+00	...	0.257005	0.005279	0.090068	0.058519	0.063502	1.208642	-0.447549	-0.165747	0.009142	0.152347
75%	187499.250000	118856.500000	2.007110e+07	65.000000	13.000000	3.00000	1.000000	3.739000e+03	0.000000e+00	0.000000e+00	...	0.265163	0.104231	0.120749	0.081606	0.094614	2.675705	1.136973	1.567727	0.924715	0.702953
max	199999.000000	196805.000000	2.015121e+07	246.000000	39.000000	500.00000	610.000000	2.016041e+07	2.016041e+07	2.016041e+07	...	1.339804	1.332522	12.338872	18.761276	18.811053	18.856218	12.950498	7.430223	5.228962	2.624622

8 rows × 26 columns

Step 3:特征与标签构建

1) 提取数值类型特征列名

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1',
       'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11',
       'v_12', 'v_13', 'v_14'],
      dtype='object')

categorical_cols = Train_data.select_dtypes(include = 'object').columns
print(categorical_cols)

Index(['gearbox', 'power', 'kilometer', 'notRepairedDamage'], dtype='object')

2) 构建训练和测试样本

## 选择特征列
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']]
feature_cols = [col for col in feature_cols if 'Type' not in col]

## 提前特征列，标签列构造训练样本和测试样本
X_data = Train_data[feature_cols]
Y_data = Train_data['price']

X_test  = TestA_data[feature_cols]

print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
print('Y data shape:',Y_data.shape)

X train shape: (150000, 15)
X test shape: (50000, 15)
Y data shape: (150000,)

## 定义了一个统计函数，方便后续信息统计
def Sta_inf(data):
    print('_min',np.min(data))
    print('_max:',np.max(data))
    print('_mean',np.mean(data))
    print('_ptp',np.ptp(data))
    print('_std',np.std(data))
    print('_var',np.var(data))

3) 统计标签的基本分布信息

print('Sta of label:')
Sta_inf(Y_data)

Sta of label:
_min -4.236904217
_max: 99999.0
_mean 5635.615473908368
_ptp 100003.236904217
_std 7481.058399218618
_var 55966234.77251944

## 绘制标签的统计图，查看标签分布
plt.hist(Y_data)
plt.show()
plt.close()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5b165xDG-1602235515574)(output_24_0.png)]

4) 缺省值用-1填补

X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)

Step 4:模型训练与预测

1) 利用xgb进行五折交叉验证查看模型的参数效果



Y_data=Y_data.astype(int)
print(Y_data)

0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int32

# print(X_data)

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'

scores_train = []
scores = []

## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)

for train_ind,val_ind in sk.split(X_data,Y_data):
    print(train_ind,val_ind)
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))

[     1      2      4 ... 149994 149996 149997] [     0      3     11 ... 149995 149998 149999]
[16:52:52] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      1      2 ... 149997 149998 149999] [     6      7     10 ... 149987 149990 149993]
[16:53:23] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      1      3 ... 149997 149998 149999] [     2      4      8 ... 149980 149986 149991]
[16:53:54] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      2      3 ... 149997 149998 149999] [     1      9     14 ... 149971 149973 149983]
[16:54:25] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      1      2 ... 149995 149998 149999] [     5     16     22 ... 149994 149996 149997]
[16:54:56] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Train mae: 658.4843255196383
Val mae 751.1093641226322

2）定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
    param_grid = {
    
    
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

3）切分数据集（Train,Val）进行模型训练，评价和预测

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)

print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)

Train lgb...
MAE of val with lgb: 733.8377157667358
Predict lgb...
Sta of Predict lgb:
_min -4653.411185581689
_max: 89013.32963882982
_mean 5640.487279687575
_ptp 93666.7408244115
_std 7358.320067626515
_var 54144874.21763509

print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)

print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)

Train xgb...
[16:56:15] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MAE of val with xgb: 759.8445985911052
Predict xgb...
[16:56:47] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Sta of Predict xgb:
_min -847.2674
_max: 90753.81
_mean 5640.081
_ptp 91601.08
_std 7335.928
_var 53815840.0

4）进行两模型的结果加权融合

## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数，而真实情况下，price为负是不存在的，由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))

MAE of val with Weighted ensemble: 732.8346989009693

sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb

## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()

在这里插入图片描述

5）输出结果

sub = pd.DataFrame()
sub['SaleID'] = TestA_data.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)

sub.head()

	SaleID	price
0	150000	40535.122371
1	150001	349.752006
2	150002	6698.887589
3	150003	11373.901882
4	150004	562.043999

Baseline END.

— By: ML67

    Email: [email protected]
    PS: 华中科技大学研究生, 长期混迹Tianchi等，希望和大家多多交流。
    github: https://github.com/mlw67 （近期会做一些书籍推导和代码的整理）

— By: AI蜗牛车

    PS：东南大学研究生，研究方向主要是时空序列预测和时间序列数据挖掘
    公众号： AI蜗牛车
    知乎： https://www.zhihu.com/people/seu-aigua-niu-che
    github: https://github.com/chehongshu

— By: 阿泽

    PS：复旦大学计算机研究生
    知乎：阿泽 https://www.zhihu.com/people/is-aze（主要面向初学者的知识整理）

— By: 小雨姑娘

    PS：数据挖掘爱好者，多次获得比赛TOP名次。
    知乎：小雨姑娘的机器学习笔记：https://zhuanlan.zhihu.com/mlbasic

关于Datawhale：

Datawhale是一个专注于数据科学与AI领域的开源组织，汇集了众多领域院校和知名企业的优秀学习者，聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner，和学习者一起成长”为愿景，鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案，赋能人才培养，助力人才成长，建立起人与人，人与知识，人与企业和人与未来的联结。

本次数据挖掘路径学习，专题知识将在天池分享，详情可关注Datawhale：

零基础入门数据挖掘-Baseline