零基础入门数据挖掘-Baseline

Datawhale 零基础入门数据挖掘-Baseline

Baseline-v1.0 版

Tip:这是一个最初始baseline版本,抛砖引玉,为大家提供一个基本Baseline和一个竞赛流程的基本介绍,欢迎大家多多交流。

赛题:零基础入门数据挖掘 - 二手车交易价格预测

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

# 查看数据文件目录  list datalab files
!ls datalab/
'ls' 不是内部或外部命令,也不是可运行的程序
或批处理文件。

Step 1:导入函数工具箱

## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time

warnings.filterwarnings('ignore')
%matplotlib inline

## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb

## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.

Step 2:数据读取

## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('used_car_train_20200313.csv', sep=',')
TestA_data = pd.read_csv('used_car_testA_20200313.csv', sep=',')

## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)
Train data shape: (150000, 31)
TestA data shape: (50000, 30)

1) 数据简要浏览

## 通过.head() 简要浏览读取数据的形式
Train_data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30 6 1 0.0 0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914763
1 1 2262 20030301 40 1 2 0.0 0 0 15 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115 15 1 0.0 0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109 10 0 0.0 1 193 15 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110 5 1 0.0 0 68 5 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482

5 rows × 31 columns

2) 数据信息查看

## 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                150000 non-null int64
brand                150000 non-null int64
bodyType             150000 non-null int64
fuelType             150000 non-null float64
gearbox              150000 non-null object
power                150000 non-null object
kilometer            150000 non-null object
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null float64
creatDate            150000 non-null float64
price                150000 non-null float64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 148531 non-null float64
v_13                 146417 non-null float64
v_14                 135884 non-null float64
dtypes: float64(19), int64(8), object(4)
memory usage: 35.5+ MB
## 通过 .columns 查看列名
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')
TestA_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null int64
brand                50000 non-null int64
bodyType             50000 non-null int64
fuelType             50000 non-null float64
gearbox              50000 non-null object
power                50000 non-null object
kilometer            50000 non-null object
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null float64
offerType            50000 non-null float64
creatDate            50000 non-null float64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 49548 non-null float64
v_13                 48880 non-null float64
v_14                 45356 non-null float64
dtypes: float64(19), int64(7), object(4)
memory usage: 11.4+ MB

3) 数据统计信息浏览

## 通过 .describe() 可以查看数值特征列的一些统计信息
Train_data.describe()
SaleID name regDate model brand bodyType fuelType regionCode seller offerType ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 150000.000000 150000.000000 1.500000e+05 150000.000000 150000.000000 150000.000000 150000.000000 1.500000e+05 1.500000e+05 1.500000e+05 ... 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 148531.000000 146417.000000 135884.000000
mean 74999.500000 68349.172873 2.003417e+07 47.128953 8.052527 1.870747 1.394827 1.997783e+05 2.841478e+05 1.415701e+06 ... 0.246643 0.062381 0.174574 0.296920 0.406928 -0.164371 -0.446352 -0.085471 0.022190 0.008456
std 43301.414527 61103.875095 5.364988e+04 49.535881 7.864603 5.221312 15.676749 1.985073e+06 2.376421e+06 5.151321e+06 ... 0.116636 0.133581 0.927042 1.773396 1.962003 3.758661 2.002930 2.257730 1.267140 1.050417
min 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000 0.000000 -0.273510 -8.206004 -8.399672 -9.168192 -9.404106 -9.639552 -6.113291 -6.546556
25% 37499.750000 11156.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 7.390000e+02 0.000000e+00 0.000000e+00 ... 0.241064 0.000161 0.055272 0.036050 0.035225 -3.666042 -2.026105 -1.745234 -0.999703 -0.426907
50% 74999.500000 51638.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 2.010000e+03 0.000000e+00 0.000000e+00 ... 0.256928 0.001547 0.090081 0.058523 0.063335 1.240603 -0.457218 -0.160305 0.008602 0.155026
75% 112499.250000 118841.250000 2.007111e+07 66.000000 13.000000 3.000000 1.000000 3.719000e+03 0.000000e+00 0.000000e+00 ... 0.265170 0.104255 0.120590 0.081996 0.094738 2.691063 1.115744 1.572130 0.929041 0.700543
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 999.000000 3500.000000 2.016041e+07 2.016041e+07 2.016041e+07 ... 1.401999 1.387847 12.357011 18.819042 18.801218 18.802072 13.562011 11.147669 8.658418 2.743993

8 rows × 27 columns

TestA_data.describe()
SaleID name regDate model brand bodyType fuelType regionCode seller offerType ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 50000.000000 50000.000000 5.000000e+04 50000.000000 50000.000000 50000.00000 50000.000000 5.000000e+04 5.000000e+04 5.000000e+04 ... 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 49548.000000 48880.000000 45356.000000
mean 174999.500000 68542.223280 2.003393e+07 46.844520 8.056240 1.82978 1.310420 1.846027e+05 2.693423e+05 1.420901e+06 ... 0.245942 0.062157 0.167631 0.278298 0.394680 -0.180498 -0.441886 -0.093695 0.018620 0.010720
std 14433.901067 61052.808133 5.368870e+04 49.469548 7.819477 4.38703 11.319999 1.907934e+06 2.314646e+06 5.160173e+06 ... 0.113548 0.132116 0.889236 1.705061 1.904670 3.749435 2.011260 2.264558 1.262383 1.039922
min 150000.000000 0.000000 1.991000e+07 0.000000 0.000000 0.00000 0.000000 0.000000e+00 0.000000e+00 -4.137733e+00 ... 0.000000 0.000000 0.000000 -7.481381 -8.088973 -9.160049 -8.916949 -8.249206 -5.881834 -6.112667
25% 162499.750000 11203.500000 1.999091e+07 10.000000 1.000000 0.00000 0.000000 7.500000e+02 0.000000e+00 0.000000e+00 ... 0.241186 0.000157 0.055647 0.035926 0.034963 -3.656769 -2.033607 -1.744112 -0.999841 -0.428772
50% 174999.500000 52248.500000 2.003091e+07 29.000000 6.000000 1.00000 0.000000 2.025000e+03 0.000000e+00 0.000000e+00 ... 0.257005 0.005279 0.090068 0.058519 0.063502 1.208642 -0.447549 -0.165747 0.009142 0.152347
75% 187499.250000 118856.500000 2.007110e+07 65.000000 13.000000 3.00000 1.000000 3.739000e+03 0.000000e+00 0.000000e+00 ... 0.265163 0.104231 0.120749 0.081606 0.094614 2.675705 1.136973 1.567727 0.924715 0.702953
max 199999.000000 196805.000000 2.015121e+07 246.000000 39.000000 500.00000 610.000000 2.016041e+07 2.016041e+07 2.016041e+07 ... 1.339804 1.332522 12.338872 18.761276 18.811053 18.856218 12.950498 7.430223 5.228962 2.624622

8 rows × 26 columns

Step 3:特征与标签构建

1) 提取数值类型特征列名

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1',
       'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11',
       'v_12', 'v_13', 'v_14'],
      dtype='object')
categorical_cols = Train_data.select_dtypes(include = 'object').columns
print(categorical_cols)
Index(['gearbox', 'power', 'kilometer', 'notRepairedDamage'], dtype='object')

2) 构建训练和测试样本

## 选择特征列
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']]
feature_cols = [col for col in feature_cols if 'Type' not in col]

## 提前特征列,标签列构造训练样本和测试样本
X_data = Train_data[feature_cols]
Y_data = Train_data['price']

X_test  = TestA_data[feature_cols]

print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
print('Y data shape:',Y_data.shape)
X train shape: (150000, 15)
X test shape: (50000, 15)
Y data shape: (150000,)
## 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):
    print('_min',np.min(data))
    print('_max:',np.max(data))
    print('_mean',np.mean(data))
    print('_ptp',np.ptp(data))
    print('_std',np.std(data))
    print('_var',np.var(data))

3) 统计标签的基本分布信息

print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min -4.236904217
_max: 99999.0
_mean 5635.615473908368
_ptp 100003.236904217
_std 7481.058399218618
_var 55966234.77251944
## 绘制标签的统计图,查看标签分布
plt.hist(Y_data)
plt.show()
plt.close()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5b165xDG-1602235515574)(output_24_0.png)]

4) 缺省值用-1填补

X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)

Step 4:模型训练与预测

1) 利用xgb进行五折交叉验证查看模型的参数效果



Y_data=Y_data.astype(int)
print(Y_data)
0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int32
# print(X_data)
## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'

scores_train = []
scores = []

## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)

for train_ind,val_ind in sk.split(X_data,Y_data):
    print(train_ind,val_ind)
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
[     1      2      4 ... 149994 149996 149997] [     0      3     11 ... 149995 149998 149999]
[16:52:52] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      1      2 ... 149997 149998 149999] [     6      7     10 ... 149987 149990 149993]
[16:53:23] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      1      3 ... 149997 149998 149999] [     2      4      8 ... 149980 149986 149991]
[16:53:54] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      2      3 ... 149997 149998 149999] [     1      9     14 ... 149971 149973 149983]
[16:54:25] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[     0      1      2 ... 149995 149998 149999] [     5     16     22 ... 149994 149996 149997]
[16:54:56] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Train mae: 658.4843255196383
Val mae 751.1093641226322

2) 定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
    param_grid = {
    
    
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

3)切分数据集(Train,Val)进行模型训练,评价和预测

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)

print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
Train lgb...
MAE of val with lgb: 733.8377157667358
Predict lgb...
Sta of Predict lgb:
_min -4653.411185581689
_max: 89013.32963882982
_mean 5640.487279687575
_ptp 93666.7408244115
_std 7358.320067626515
_var 54144874.21763509
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)

print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
Train xgb...
[16:56:15] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MAE of val with xgb: 759.8445985911052
Predict xgb...
[16:56:47] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Sta of Predict xgb:
_min -847.2674
_max: 90753.81
_mean 5640.081
_ptp 91601.08
_std 7335.928
_var 53815840.0

4)进行两模型的结果加权融合

## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
MAE of val with Weighted ensemble: 732.8346989009693
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb

## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()

在这里插入图片描述

5)输出结果

sub = pd.DataFrame()
sub['SaleID'] = TestA_data.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)
sub.head()
SaleID price
0 150000 40535.122371
1 150001 349.752006
2 150002 6698.887589
3 150003 11373.901882
4 150004 562.043999

Baseline END.

— By: ML67

    Email: [email protected]
    PS: 华中科技大学研究生, 长期混迹Tianchi等,希望和大家多多交流。
    github: https://github.com/mlw67 (近期会做一些书籍推导和代码的整理)

— By: AI蜗牛车

    PS:东南大学研究生,研究方向主要是时空序列预测和时间序列数据挖掘
    公众号: AI蜗牛车
    知乎: https://www.zhihu.com/people/seu-aigua-niu-che
    github: https://github.com/chehongshu

— By: 阿泽

    PS:复旦大学计算机研究生
    知乎:阿泽 https://www.zhihu.com/people/is-aze(主要面向初学者的知识整理)

— By: 小雨姑娘

    PS:数据挖掘爱好者,多次获得比赛TOP名次。
    知乎:小雨姑娘的机器学习笔记:https://zhuanlan.zhihu.com/mlbasic

关于Datawhale:

Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。

本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:

猜你喜欢

转载自blog.csdn.net/dualvencsdn/article/details/108982772