Data mining TASK2_ data exploration

Exploratory data analysis EDA

EDA goal
familiar with the data collection, validation data set and make sure it can be used in machine learning.
Understand the relationship between the variables and the relationship that exists between the predicted value of the variable.
Complete analysis of the data exploration, for text or summary chart.

SUMMARY Example
1, the data processing library and loaded visualization library
PANDAS, numpy, SciPy
matplotlib, seabon

2, data loading
observations head, shape

3. Overview of data
by relevant statistics describe familiar with the data, through familiar data type info

4, and determines missing data abnormality
detecting an abnormal value

5, understand the distribution of the predicted value of
the overall distribution profile, see skewness skewness and kurtosis kurtosis

6, characteristic features are divided into categories and digital features, view the unique characteristics of the distribution of categories

7, wherein analyzing the digital
correlation analysis, visualization of the distribution of each digital signature, digital signature mutual relationship between the visual

8, wherein the type of analysis
unique distribution box FIG class distinction, violin, Column visualization

9, pandas_profiling report data

The sample code
loaded the relevant library

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1, loading data

path = 'C:/Users/lenovo/Desktop/data/'
Train_data = pd.read_csv(path+'used_car_train_20200313.csv',sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv',sep=' ')

2, the observed data and the data end to end dimension

print(Train_data.head().append(Train_data.tail()))
print(Train_data.shape)

3. Overview Overview data

overview = Train_data.describe()
datatype = Train_data.info() 

describe description of the various features of each column statistics, including the number COUNT, Mean average, STD variance, the minimum value min, the median 25% 50% 75% and a maximum value.
info Learn feature type for each column, in addition to understand nan special symbols other than the exception.

4, and abnormality judgment data deletion

#统计缺失值isnull
missing = Train_data.isnull().sum()
print(Train_data.isnull().sum())
miss = missing[missing > 0]
miss.sort_values(inplace=True)
miss.plot.bar()
#统计异常值
print(Train_data.info())
print(Train_data['notRepairedDamage'].value_counts())
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
print(Train_data['notRepairedDamage'].value_counts())
Train_data.isnull().sum()
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
#删除严重倾斜的特征
print(Train_data['seller'].value_counts())
print(Train_data['offerType'].value_counts())
del Train_data['seller']
del Train_data['offerType']
del Test_data['seller']
del Test_data['offerType']
print(Train_data.shape)
print(Test_data.shape)

The '-' value is replaced nan. For heavily skewed categories can be deleted.

5, to understand the distribution of predicted values

print(Train_data['price'])
print(Train_data['price'].value_counts())
#总体分布概况
import scipy.stats as st
y = Train_data['price']
plt.figure(1);plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
#分析预测值价格分布的skewness and kurtosis
sns.distplot(Train_data['price'])
print('skewness:%f'%Train_data['price'].skew())
print('kurtosis:%f'%Train_data['price'].kurt())
#分析所有特征分布的skewnes and kurtosis
Train_data.skew()
Train_data.kurt()
sns.distplot(Train_data.skew(),color='blue',axlabel='Skewness')
sns.distplot(Train_data.kurt(),color='orange',axlabel='Kurtness')
#分析预测值的具体频数
plt.hist(Train_data['price'], orientation='vertical', histtype='bar',color='red')
plt.show()
#将预测值进行log变换
plt.hist(np.log(Train_data['price']), orientation='vertical', histtype='bar', color='red')
plt.show()

Predicted values ​​of logarithmic conversion so that a normal distribution.

6, see the unique characteristics of the distribution of categories

Y_train = Train_data['price']
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
#特征unique分布:
for cat_fea in categorical_features:
    print(cat_fea+'特征的分布如下:')
    print('{}特征有{}个不同的值'.format(cat_fea, Train_data[cat_fea].unique()))
    print(Train_data[cat_fea].value_counts())

7, see the unique characteristics of the distribution of digital

numeric_features.append('price')
#相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending=False), '\n')
f,ax = plt.subplots(figsize=(7,7))
#plt.title('Correlation of Numeric Features with Price',y=1, size=16)
#sns.heatmap(correlation, square=True, vmax=0.8)
del price_numeric['price']
#查看每个特征的偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())  
         )
#每个数字特征的分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')
#数字特征相互之间的可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

8, category feature analysis

#unique分布
for fea in categorical_features:
    print(Train_data[fea].unique())
#类别特这个箱型图可视化
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x = plt.xticks(rotation=90)
f=pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g=sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False, size=5)
g=g.map(boxplot, "value", "price")
#类别特征的小提琴可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list:
    sns.violinplot(x=catg, y=target, data=Train_data)
    plt.show()
#类别特征的频数可视化
def count_plot(x, **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)
f=pd.melt(Train_data, value_vars=categorical_features)
g=sns.FacetGrid(f, col='variable', col_warp=2, sharex=False, sharey=False, size=5)
g=g.map(count_plot, "value")

9, data may be generated by pandas_profiling visualization report

import pandas_profiling
pfr = pandas_profiling.PrifileReport(Train_data)
pfr.to_file("./example.html")

Released six original articles · won praise 0 · Views 94

Guess you like

Origin blog.csdn.net/weixin_43959248/article/details/105073623