数据挖掘TASK2_数据探索

数据探索性分析EDA

EDA目标
熟悉数据集，验证数据集并确定它可以用于机器学习。
了解变量间的相互关系以及变量与预测值之间存在的关系。
完成对数据的探索分析，进行文字或者图表总结。

内容示例
1、载入数据处理库以及可视化库
pandas, numpy, scipy
matplotlib, seabon

2、载入数据
观察数据head,shape

3、数据总览
通过describe熟悉数据相关统计量，通过info熟悉数据类型

4、判断数据缺失和异常
检测异常值

5、了解预测值的分布
总体分布概况，查看偏度skewness和峰度kurtosis

6、特征分为类别特征和数字特征，对类别特征查看unique分布

7、数字特征分析
相关性分析，每个数字特征分布可视化，数字特征相互之间关系可视化

8、类型特征分析
unique分布，类别特征箱型图、小提琴图、柱形图可视化

9、pandas_profiling生成数据报告

代码示例
载入相关库

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1、载入数据

path = 'C:/Users/lenovo/Desktop/data/'
Train_data = pd.read_csv(path+'used_car_train_20200313.csv',sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv',sep=' ')

2、观察数据头尾以及数据维度

print(Train_data.head().append(Train_data.tail()))
print(Train_data.shape)

3、总览数据概况

overview = Train_data.describe()
datatype = Train_data.info()

describe描述每列特征的各个统计量，包括个数count、平均值mean、方差std、最小值min、中位数25% 50% 75%以及最大值.
info了解每列特征的type,了解除了nan以外的特殊符号异常。

4、判断数据缺失及异常

#统计缺失值isnull
missing = Train_data.isnull().sum()
print(Train_data.isnull().sum())
miss = missing[missing > 0]
miss.sort_values(inplace=True)
miss.plot.bar()
#统计异常值
print(Train_data.info())
print(Train_data['notRepairedDamage'].value_counts())
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
print(Train_data['notRepairedDamage'].value_counts())
Train_data.isnull().sum()
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
#删除严重倾斜的特征
print(Train_data['seller'].value_counts())
print(Train_data['offerType'].value_counts())
del Train_data['seller']
del Train_data['offerType']
del Test_data['seller']
del Test_data['offerType']
print(Train_data.shape)
print(Test_data.shape)

将‘-’值替换成nan。对于严重倾斜的类别，可以删除。

5、了解预测值的分布

print(Train_data['price'])
print(Train_data['price'].value_counts())
#总体分布概况
import scipy.stats as st
y = Train_data['price']
plt.figure(1);plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
#分析预测值价格分布的skewness and kurtosis
sns.distplot(Train_data['price'])
print('skewness:%f'%Train_data['price'].skew())
print('kurtosis:%f'%Train_data['price'].kurt())
#分析所有特征分布的skewnes and kurtosis
Train_data.skew()
Train_data.kurt()
sns.distplot(Train_data.skew(),color='blue',axlabel='Skewness')
sns.distplot(Train_data.kurt(),color='orange',axlabel='Kurtness')
#分析预测值的具体频数
plt.hist(Train_data['price'], orientation='vertical', histtype='bar',color='red')
plt.show()
#将预测值进行log变换
plt.hist(np.log(Train_data['price']), orientation='vertical', histtype='bar', color='red')
plt.show()

对预测值进行对数变换使其满足正态分布。

6、对类别特征查看unique分布

Y_train = Train_data['price']
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
#特征unique分布：
for cat_fea in categorical_features:
    print(cat_fea+'特征的分布如下：')
    print('{}特征有{}个不同的值'.format(cat_fea, Train_data[cat_fea].unique()))
    print(Train_data[cat_fea].value_counts())

7、对数字特征查看unique分布

numeric_features.append('price')
#相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending=False), '\n')
f,ax = plt.subplots(figsize=(7,7))
#plt.title('Correlation of Numeric Features with Price',y=1, size=16)
#sns.heatmap(correlation, square=True, vmax=0.8)
del price_numeric['price']
#查看每个特征的偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())  
         )
#每个数字特征的分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')
#数字特征相互之间的可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

8、类别特征分析

#unique分布
for fea in categorical_features:
    print(Train_data[fea].unique())
#类别特这个箱型图可视化
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x = plt.xticks(rotation=90)
f=pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g=sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False, size=5)
g=g.map(boxplot, "value", "price")
#类别特征的小提琴可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list:
    sns.violinplot(x=catg, y=target, data=Train_data)
    plt.show()
#类别特征的频数可视化
def count_plot(x, **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)
f=pd.melt(Train_data, value_vars=categorical_features)
g=sns.FacetGrid(f, col='variable', col_warp=2, sharex=False, sharey=False, size=5)
g=g.map(count_plot, "value")

9、用pandas_profiling可生成数据可视化报告

import pandas_profiling
pfr = pandas_profiling.PrifileReport(Train_data)
pfr.to_file("./example.html")

北海星

发布了6 篇原创文章 · 获赞 0 · 访问量 94

私信关注

数据挖掘TASK2_数据探索

数据探索性分析EDA

猜你喜欢