数据挖掘入门学习

二手车价格预测

Baseline + EDA阶段

流程:解读赛题背景-下载数据-导入必要的工具包-读取数据(简单查看数据特征)-EDA-特征工程-算法/模型构建-调参-模型融合-结果输出

  • 首先由赛题背景可知:
    SaleID - 销售样本
    ID name - 汽车编码
    regDate - 汽车注册时间
    model - 车型编码
    brand - 品牌
    bodyType - 车身类型
    fuelType - 燃油类型
    gearbox - 变速箱
    power - 汽车功率
    kilometer - 汽车行驶公里
    notRepairedDamage - 汽车有尚未修复的损坏
    regionCode - 看车地区编码
    seller - 销售方
    offerType - 报价类型
    creatDate - 广告发布时间
    price - 汽车价格
    v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’ 【匿名特征,包含v0-14在内15个匿名特征】   数字全都脱敏处理,都为label encoding形式,即数字形式
    下载数据以及基本查看数据特征:
import pandas as pd
import numpy as np

#载入数据
Train_data = pd.read_csv("used_car_train_20200313.csv",sep=" ")
Test_data = pd.read_csv("used_car_testA_20200313.csv",sep=" ")

print("Train data shape:",Train_data.shape)
print("Test data shape:",Test_data.shape)
Train_data.head()###查看具体数据实例
Train_data.describe()###查看数据的分布(最小最大四分位数等)
Train_data.info()###查看数据的类型

Baseline 构造:

  1. 导入基础工具包:
###基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display,clear_output
import time

warnings.filterwarnings('ignore')
%matplotlib inline

##模型预测
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import  RandomForestRegressor,GradientBoostingRegressor

##数据降维处理
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb

##参数搜索和评价
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error
  1. 读入数据
#载入数据
Train_data = pd.read_csv("used_car_train_20200313.csv",sep=" ")
Test_data = pd.read_csv("used_car_testA_20200313.csv",sep=" ")

print("Train data shape:",Train_data.shape)
print("Test data shape:",Test_data.shape)
  1. 数据统计特征查看
Train_data.head()
Train_data.info()
Train_data.columns
Test_data.info()
Train_data.describe()
Test_data.describe()
  1. 特征工程
###提取数值类型特征列名
numerical_cols = Train_data.select_dtypes(exclude='object').columns
print(numerical_cols)
categorical_cols = Train_data.select_dtypes(include= 'object').columns
print(categorical_cols)
###选择特征列
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate',
                                                               'price','model','brand','regionCode','seller']]
feature_cols = [col for col in feature_cols if 'Type' not in col]

X_data = Train_data[feature_cols]
Y_data = Train_data['price']

X_test = Test_data[feature_cols]
print(X_data.columns)
###定义一个统计函数,方便后续信息统计
def Sta_inf(data):
    print('min',np.min(data))
    print('max',np.max(data))
    print('mean',np.mean(data))
    print('ptp[极差]',np.ptp(data))
    print('std',np.std(data))
    print('var',np.var(data))
    ###预测标签的分布
Sta_inf(Y_data)
plt.hist(Y_data)
plt.show()
plt.close()
###缺省值用-1填补
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)

5 模型构建

###模型训练与预测
1----###xgb-Mode
xgr = xgb.XGBRegressor(n_estimators=120,learning_rate=0.1,gamma=0,subsample=0.8,colsample_bytree=0.9,max_depth=7)
score_train = []
scores = []

sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
    
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
2-----###定义xgb和lgb模型函数
def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
    param_grid = {
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm
    #切分数据集
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)


print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)

print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
  1. 模型融合
## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
plt.hist(val_Weighted)
plt.show()
plt.close()
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb

## 查看预测值的统计进行
plt.hist(sub_Weighted)
plt.show()
plt.close()
  1. 结果输出
#输出结果
sub = pd.DataFrame()
sub['SaleID'] = Test_data.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)

总结:一个baseline 就是一个完整的数据发掘流程,接下来的操作就是围绕着它做一些“增删改”操作。尽管每一步都显得非常粗糙,却可以为接下来的工作提供一个方向和思路。

EDA阶段
探索数据分析阶段,也可以认为是数据预处理。基本上要完成数据缺省值查看、处理,数据分布情况(取对数/指数等,近似化为正太分布)、数据一致性判决、错误值和异常值检测等

#coding:utf-8
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
#统计缺省值

Train_data.isnull().sum()
Test_data.isnull().sum()
# nan可视化 
missing = Train_data.isnull().sum() 
missing = missing[missing > 0] 
missing
missing.sort_values(inplace=True) 
missing.plot.bar()
# 可视化看下缺省值 
# msno.matrix(Train_data.sample(250))
# 可视化看下缺省值 
msno.matrix(Train_data)
msno.bar(Train_data)
#错误值处理
Train_data['notRepairedDamage'].value_counts()
Train_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Train_data['notRepairedDamage'].value_counts()
Test_data['notRepairedDamage'].value_counts()
Test_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
#对于一些严重偏斜的数据可以考虑剔除
Train_data['seller'].value_counts()
Train_data["offerType"].value_counts()
del Train_data["seller"] 
del Train_data["offerType"] 
del Test_data["seller"] 
del Test_data["offerType"]
#查看预测数据的分布
#总体分布情况
#对预测变量做变化 fit=st.johnsonsu fit=st.norm  fit=st.lognorm
import scipy.stats as st
y=Train_data['price']
plt.figure(1);plt.title('Johnson SU')
sns.distplot(y,kde=False,fit=st.johnsonsu)
sns.distplot(y);
#偏度---数据分布对称程度
#峰度---数据尖端突起/平缓程度分布
print("skewness:%f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Train_data.skew(),Train_data.kurt()
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
## 3) 查看预测值的具体频数 
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()
# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick 
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()
sns.distplot(np.log(y))

目前就学习到这一步,还有接下来的分析,展示和特征工程。

PS:补充一下直播课get到的几个点:

  • 可以看一下测试数据的分布情况,若偏斜等,可以考虑对训练数据做处理,使之与测试数据尽可能一致。
  • 线下模型评价指标和线上尽量一致,辨析各种指标的特征 i.e.MSE RMSE MAE R2 accuracy precision recall等

猜你喜欢

转载自blog.csdn.net/wenkang2261/article/details/105081477