DW-Task 2: Data exploratory analysis (EDA)

Task2: Data exploratory analysis (EDA)

  • What is EDA
  • EDA goals
  • The main work
  • Import and observe data
  • Data overview
  • Relevant statistics
  • type of data
  • Data detection
  • Missing value detection
  • Outlier detection
  • Forecast distribution
  • General distribution overview (unbounded Johnson distribution, etc.)
  • View skewness and kurtosis
  • Check the specific frequency of the predicted value
  • Characteristics
  • Category characteristics
  • unique distribution
  • Visualization
  • Box plot
  • Violin illustration
  • Bar chart
  • Visualize the feature frequency of each category
  • Digital features
  • Correlation analysis
  • Feature skewness and peak
  • Digital feature distribution visualization
  • Visualization of digital features
  • Visualization of multivariate mutual regression relationship

What is EDA

       Exploratory Data Analysis (EDA) refers to the exploration of existing data (especially the original data obtained from investigations or observations) with as few a priori assumptions as possible , through mapping, tabulation, A data analysis method that explores the structure and laws of data by means of equation fitting and calculation of feature quantities . Especially when we are faced with all kinds of messy "dirty data", we are often at a loss and don't know where to start to understand the data currently in hand, exploratory data analysis is very effective. Exploratory data analysis was proposed in the 1960s, and its method was proposed by American statistician John Tukey.

  1. Qualitative data: Descriptive nature
    a) Classification: Classification by name-blood type, city
    b) Sequence: Ordered classification-performance (ABC)
  2. Quantitative data: describe the quantity
    a) fixed distance: can add and subtract-temperature, date
    b) fixed ratio: can multiply and divide-price, weight

Check the Tianchi documentation to see the practical significance of the data.

Field Description
SaleID Transaction ID, unique code
name Automobile transaction name, desensitized
regDate Car registration date, such as 20160101, January 1, 2016
model Model code, desensitized
brand Car brand, desensitized
bodyType Body type: luxury car: 0, mini car: 1, van: 2, bus: 3, convertible: 4, two-door car: 5, commercial vehicle: 6, mixer truck: 7
fuelType Fuel type: gasoline: 0, diesel: 1, liquefied petroleum gas: 2, natural gas: 3, hybrid: 4, other: 5, electric: 6
gearbox Transmission: manual: 0, automatic: 1
power Engine power: range [0, 600]
kilometer The car has traveled kilometers, the unit is ten thousand kilometers
notRepairedDamage The car has unrepaired damage: Yes: 0, No: 1
regionCode Area code, desensitized
seller Seller: Individual: 0, Non-individual: 1
offerType Quotation Type: Offer: 0, Request: 1
creatDate When the car goes online, that is, when it starts to sell
price Used car transaction price (forecast target)
v series features Anonymous features, 15 anonymous features including v0-14

EDA goals


     1. Understand the data set and make preliminary analysis;
     2. Understand the relationship between the variables in the comparison data set;
     3. Data processing/feature engineering;
     4. Make a graphic summary of the data set and analyze the correlation between features and labels.

Exploratory analysis steps

  1. Form hypotheses, determine themes to explore
  2. Clean up data, deal with dirty data
  3. Evaluate the quality of the data, you can weight the data of different quality
  4. data report
  5. Explore and analyze each variable
  6. Explore the relationship between each independent variable and the dependent variable
  7. Explore the correlation between each independent variable
  8. Analyze data from different dimensions

The main work

Import and observe data

Import related libraries

import warnings
warnings.filterwarnings('ignore')
# 避免出现:代码正常运行,但提示警告

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
## 通过Pandas载入数据
Train_data = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')
# 源文件指定为' '分隔符,以防报错
Train_data.head(3).append(Train_data.tail(3))
# 查看训练集前3行和后3行的数据,结果略
Train_data.shape # 输出:(150000, 31)
Test_data.shape # 输出:(50000, 30)
# 训练集比测试集多1列:'price'	
Train_data.describe()
# 通过describe查看相关统计量,掌握数据的大概范围以及对异常值的判断
Train_data.info()
# 熟悉数据类型,其中'notRepairedDamage'是object类型,其余都是数值
Train_data.isnull().sum()
# 查看每个字段的缺失情况(汇总)

 

 

About missing values :
1. If the number of missing values ​​is small, generally choose to fill;
2. If you use tree models such as lgb, you can directly vacate, let the tree optimize by itself;
3. If there are too many nans , you can delete them.

Train_data['notRepairedDamage'].value_counts()
# 对'notRepairedDamage'的值类型进行计数
# 输出:0.0    111361
#      -       24324
#      1.0     14315

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
# 将‘ - ’替换成nan

Among them, we found that the "seller" and "offerType" features are severely skewed and generally will not help predictions, so we delete them first:

Train_data["seller"].value_counts()
Train_data["offerType"].value_counts()

View the distribution of the predicted value'price':

# 查看偏度和峰度
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
# 输出:Skewness: 3.346487
#      Kurtosis: 18.995183

å¨è¿éæå ¥ å¾çæè¿ °

Plot the probability distribution of the skewness and kurtosis of each variable in the training set:

sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

# 查看预测值price的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.title( 'price')
plt.show()

The distribution after logarithmic transformation is more uniform:

plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.title( 'log-price')
plt.show()

1. Category characteristics

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

2. Digital features

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

3. Feature nunique distribution

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有{}个不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

result:

name特征有99662个不同的值
model特征有248个不同的值
brand特征有40个不同的值
bodyType特征有8个不同的值
fuelType特征有7个不同的值
gearbox特征有2个不同的值
notRepairedDamage特征有2个不同的值
regionCode特征有7905个不同的值

View the category feature distribution of the training set:

for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

Relevance visualization:

price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Visualization of the distribution of various digital features:

f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

Visualization of the relationship between various digital features:

sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

Generate a data preview report:

Use pandas_profiling to generate data reports

Use pandas_profiling to generate a more comprehensive visualization and data report (relatively simple and convenient) and finally open the html file

import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")

 

 

 

Guess you like

Origin blog.csdn.net/adminkeys/article/details/108673400