Used car transaction price forecast - [exploratory data analysis (EDA)]

This article is participating Tianchi game record, [exploratory data analysis (EDA) involving a total of], [feature] engineering data, modeling and parameter adjustment [], [] fusion model results into four parts, the first part of this article.

Game link: https://tianchi.aliyun.com/competition/entrance/231784/information

Tutorial links: https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.12.1cd8593aw4bbL5&postId=95457

First, the title match data

The title match to predict second-hand car transaction price for the task, the registration data set is visible and can be downloaded, the data from a second-hand car trading platform transactions, the total amount of data than 40w, contains 31 variable information, of which 15 listed as anonymous variable. In order to ensure the fairness of the competition, which will extract 150,000 as a training set 50000 as the test set A, 50000 as the test set B, at the same time would name, model, brand and other information regionCode desensitization.

Table field

Field Description
SaleID Transaction ID, a unique coding
name Automotive trade name, desensitized
regDate Car registration dates, such as 20160101, January 1, 2016
model Coding models, desensitized
brand Car brand, has been desensitized
bodyType Body type: Limousine: 0, mini-cars: 1, vans: 2, bus: 3, convertible cars: 4, two-door cars: 5, commercial vehicles: 6, mixer: 7
fuelType Fuel type: Gasoline: 0, diesel oil: 1, LPG: 2, Gas: 3, hybrid: 4, Other: 5, power: 6
gearbox Transmission: Manual: 0, automatically: 1
power Engine power: the range [0, 600]
kilometer Car kilometers traveled, the unit Wan km
notRepairedDamage The car has not been repaired damage: Yes: 0 No: 1
regionCode Area code, desensitized
seller Seller: individual: 0, non-individual: 1
offerType Offer type: Providing: 0,: ​​1
creatDate Car on-line time, began selling time
price Used car prices (predicted target)
v Series Features Anonymous features include v0-14 including 15 anonymous feature

Second, the evaluation criteria

Evaluation criteria for the MAE (Mean Absolute Error).
158401047136251171584010471248.png
MAE is smaller, the more accurate the prediction model.

 

Three, EDA- exploratory data analysis

3.1 EDA objectives and experimental content

Exploratory data analysis should be the first step of data mining analysis because data mining typically involves large amounts of data, only data is difficult to find hidden relationships through direct observation data, we can visually see the characteristic method of training data by EDA the relationship between the target fields and can detect anomalies in the data, this step is for the back of the data preprocessing and feature works has very important significance.

 

Code Example 3.2

I system environment for win10, Anaconda is recommended to use as a development environment, you can create a separate test environment and can add various dependencies directly in the software, very convenient, you can activate the virtual environment for the package can not be found in the list in cmd, pip install installation and use, such as required for experimental missingno this package needs to be installed in this manner.

3.2.1 Loading data

## Step1:载入各种数据科学以及可视化库

#coding:utf-8
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

## Step2:载入数据
# 通过Pandas对于数据进行读取 
Train_data = pd.read_csv('data/used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('data/used_car_testA_20200313.csv', sep=' ')

## Step3:简略观察前10条和后10条数据
Train_data.head().append(Train_data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482
149995 163978 20000607 121.0 10 4.0 0.0 1.0 163 15.0 ... 0.280264 0.000310 0.048441 0.071158 0.019174 1.988114 -2.983973 0.589167 -1.304370 -0.302592
149996 184535 20091102 116.0 11 0.0 0.0 0.0 125 10.0 ... 0.253217 0.000777 0.084079 0.099681 0.079371 1.839166 -2.774615 2.553994 0.924196 -0.272160
149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 6.0 ... 0.233353 0.000705 0.118872 0.100118 0.097914 2.439812 -1.630677 2.290197 1.891922 0.414931
149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 15.0 ... 0.256369 0.000252 0.081479 0.083558 0.081498 2.075380 -2.633719 1.414937 0.431981 -1.659014
149999 177672 19990204 19.0 28 6.0 0.0 1.0 193 12.5 ... 0.284475 0.000000 0.040072 0.062543 0.025819 1.978453 -3.179913 0.031724 -1.483350 -0.342674

10 rows × 31 columns

## Step4:通过shape查看数据量和特征量
Train_data.shape
(150000, 31)    训练数据总数据量为15w条,共31个维度。
## Step5:同样方式可以查看测试集的数据信息
Test_data.head().append(Test_data.tail())
Test_data.shape

Through the head (), shape method can visually see the basic data of the situation, we advise you to perform a look.

3.2.2 Overview Data Overview

Overview of Data There are two ways to describe and info:

  1. describe can view statistics for each column, including: total count, the average mean, variance std, the minimum value min, median 25% 50% 75%, and the maximum value max, look at this information is mainly about direct control of data and abnormality determination values ​​for each range of values, such as sometimes find another expression 9999999-1 nan these are actually equivalent manner, to be noted.
  2. info used to understand the type of data in each column, in addition to help to understand whether there is abnormality nan special symbols.
## Step1: 使用describe()查看训练集的基本情况
Train_data.describe()
  SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 150000.000000 150000.000000 1.500000e+05 149999.000000 150000.000000 145494.000000 141320.000000 144019.000000 150000.000000 150000.000000 ... 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000
mean 74999.500000 68349.172873 2.003417e+07 47.129021 8.052733 1.792369 0.375842 0.224943 119.316547 12.597160 ... 0.248204 0.044923 0.124692 0.058144 0.061996 -0.001000 0.009035 0.004813 0.000313 -0.000688
std 43301.414527 61103.875095 5.364988e + 04 49.536040 7.864956 1.760640 0.548677 0.417546 177.168419 3.919576 ... 0.045804 0.051743 0.201410 0.029186 0.035692 3.772386 3.286071 2.517478 1.288988 1.038685
me 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.168192 -5.558207 -9.639552 -4.153899 -6.546556
25% 37499.750000 11156.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243615 0.000038 0.062474 0.035334 0.033930 -3.722303 -1.951543 -1.871846 -1.057789 -0.437034
50% 74999.500000 51638.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 0.000000 110.000000 15.000000 ... 0.257798 0.000812 0.095866 0.057014 0.058484 1.624076 -0.358053 -0.130753 -0.036245 0.141246
75% 112499.250000 118841.250000 2.007111e+07 66.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265297 0.102009 0.125243 0.079382 0.087491 2.844357 1.255022 1.776933 0.942813 0.680378
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 7.000000 6.000000 1.000000 19312.000000 15.000000 ... 0.291838 0.151420 1.404936 0.160791 0.222787 12.357011 18.819042 13.847792 11.147669 8.658418

8 rows × 30 columns

## Step2: 使用info()查看训练集的数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
## Step3: 查看测试集的数据情况
Test_data.describe()
Test_data.info()

3.2.3 判断数据缺失和异常情况

 

## Step1:查看每列的缺失值情况
Train_data.isnull().sum()
Test.isnull.sum()

##图中仅展示测试集结果
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

可以看到bodyType、fuelType、gearbox三个字段有大量缺失情况,model存在一条缺失数据。

## Step2:缺失值可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

通过该方法可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺让树自己去优化,但如果nan存在的过多、可以考虑删掉

## Step2:缺失值可视化
msno.matrix(Train_data.sample(250))

 

## Step2:缺失值可视化
msno.bar(Train_data.sample(1000))

使用同样方式可以查看测试集的缺省情况。可以发现存在4类缺省数据,其中fuelType缺省最多。

## Step3:异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现只有notRepairedDamage为object类型,其他均为数值类型,可以用value_counts()查看具体类型及数量。

## Step3:异常值检测
Train_data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

通过开头的字段表可知,‘notRepairedDamage’是“汽车有尚未修复的损坏:是:0,否:1”,因此“ - ”也属于缺失值,因为很多模型对nan有直接的处理,这里我们先不做处理,先替换成nan。

## Step4:异常值处理
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看到目前仅剩两种数据类型,符合数据要求。

## Step4:异常值处理
Train_data.isnull().sum()
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64

可以看到目前存在5类缺失值,其中notRepairedDamage最多。

## Step4:异常值处理
## 使用同样方法对测试集进行处理
Test_data['notRepairedDamage'].value_counts()
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
  • 存在‘seller’和‘offerType’两类数据,具有严重的数据倾斜【待补充发现方法】,可进行数据删除
0    149999
1         1
Name: seller, dtype: int64
0    150000
Name: offerType, dtype: int64
## Step5:对严重倾斜字段进行删除
Train_data["seller"].value_counts()
Train_data["offerType"].value_counts()

del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]

3.2.4 了解预测值分布

## Step1:查看预测值分布情况
Train_data['price']
Train_data['price'].value_counts() #通过统计发现存在大量为1的数据

# 通过绘图比较目标字段的最佳拟合分布
import scipy.stats as st
y = Train_data['price']

plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)

plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)

plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)

…………

3.2.5 特征分析

# 分离label即预测值
Y_train = Train_data['price']

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

# 查看训练集特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

# 查看测试集特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())

3.2.6数字特征分析

numeric_features.append('price')

## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64 
f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)

del price_numeric['price']

## 2) 查看几个特征得 偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())  
         )
## 3) 每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

 

## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)

3.2.7 类别特征分析

包括unique分布、箱型图、小提琴图、柱形图、类别频数等特征可视化实现。

3.2.8 用pandas_profiling生成数据报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")

四、总结

跟随这次教程学习了解到了很多数据可视化的方法,有很多东西对于数据预处理和特征选择具有较大帮助,如通过describe()、info()查看数据总览情况;使用isnull()查看数据缺失值情况、value_counts()查看数据倾斜情况;将特征数据分为数字特征和类别特征的思想。

The course involves a lot of data visualization methods, but the personal feeling practical training model not need all use, it can be used as an auxiliary means of verification, and some of the special circumstances of discovery. When using EDA must first understand the field because some fields subject to post-treatment in order to have meaning, such as on-line time and regDate creatDate cars car registration time, the difference between two time fields need to be treated in order to reflect the impact of the price of car use for time ; at the same time when using data visualization methods need to better understand the meaning of the chart data filtering.

Through this study, questions for the game, the idea is to use my method EDA data cleaning operation (before a special value of the processing time necessary to refer to the same Field Description Field provided in the table, the engine power is defined as power range [0,600 ]), and the feature classification, merging, deleting (business scenarios simultaneously to be combined), after which the processing proceeds to a step wherein Engineering, model selection, model tuning, model fusion.

 

Released two original articles · won praise 0 · Views 25

Guess you like

Origin blog.csdn.net/u010446489/article/details/105070010