Data mining-EDA (data exploratory analysis)

Data mining-EDA (data exploratory analysis)


Through the visualization of the data, we can have an intuitive feeling about the data. But what I used for them was simply through scatter plots and histograms.

Data preprocessing is also a prerequisite for use. Use some common statistical methods in pandas.

Let’s expand on my new gains and learnings

  • Understand the distribution of predicted values

    • General distribution overview (unbounded Johnson distribution, etc.)
    • View skewness and kurtosis
    • Check the specific frequency of the predicted value
  • Features are divided into category features and digital features, and check the unique distribution of category features

    • Digital feature analysis
    • Correlation analysis
    • Check the skewness and peak value of several features
    • Visualize the distribution of each digital feature
    • Visualization of the relationship between digital features and visualization of multivariate mutual regression relationship
  • Type feature analysis

  • unique distribution

  • Visualization of category feature box plot

  • Violin graph visualization of categorical features

  • Column chart visualization of category characteristics

  • Frequency visualization of each category of features (count_plot)

  • Use pandas_profiling to generate data reports

Handling of loaded data and missing values

dataframe.head(10)
#查看维数
dataframe.shape()
#查看详细信息
dataframe.info()
#查看列名
dataframe.columns
#查看统计信息,注意是统计信息
dataframe.describe()

The data is not necessarily perfect, it is normal to have a Nan value

#通过pandas中isnull来判断并统计
dataframe.isnull().sum()
#进行柱状图的可视化
# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

#修改一个对象时:inplace=True:不创建新的对象,直接对原始对象进行修改;inplace=False:对数据进行修改,创建并返回新的对象承载其修改结果。

missingno

This missing value visualization package missingno provides more powerful functions for NULL processing, and is used with pandas

python import missingno as msno

# 可视化看下缺省值
msno.matrix(Train_data.sample(250))

msno.bar(Train_data.sample(1000))

Remove the missing values, because it may not be explicitly specified

Train_data['notRepairedDamage'].value_counts()

Missing value processing

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
#再次查看,是否替换处理成功
Train_data['notRepairedDamage'].value_counts()

For feature tilt, if it is particularly severe, it is likely to be of no effect to our prediction.

Train_data["seller"].value_counts()

Understand the distribution of data

General distribution overview (unbounded Johnson distribution, etc.)

#总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)

Skewness and kurtosis of statistical characteristics

View skewness and kurtosis

查看skewness(偏度) and kurtosis(峰度)
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())

Train_data.skew(), Train_data.kurt()

Visualize with seaborn

sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

Skew, kurt description reference 1

Skew, Kurt description reference 2

Check the specific frequency of the predicted value

plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

Features are divided into category features and digital features, and check the unique distribution of category features

Pandas nunique() is used to obtain the statistical times of unique values . The default parameter of dropna is set to True, so NULL values ​​are excluded when calculating unique values.

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())
# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())

Digital feature analysis

Correlation analysis

corr() Returns the correlation coefficient matrix, so it is very suitable for heat map visualization

price_numeric = Train_data[numeric_features]

correlation = price_numeric.corr()

print(correlation['price'].sort_values(ascending = False),'\n')
f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True, vmax=0.8)

Check the skewness and peak value of several features

for col in numeric_features:
 print('{:15}'.format(col), 
 'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
 ' ' ,
 'Kurtosis: {:06.2f}'.format(Train_data[col].kurt()) 
 )

Visualize the distribution of each digital feature

f =pd.melt(Train_data,value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

Visualize the relationship between digital features

sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

Category feature analysis

Box plot

It consists of five numerical points: minimum (min), lower quartile (Q1), median (median), upper quartile (Q3), and maximum (max). You can also add the mean value to the box plot. As shown above. The lower quartile, median, and upper quartile form a "box with compartments". An extension line is established between the upper quartile and the maximum value. This extension line is called a "whisker".

Since there are always various kinds of "dirty data" in real data, they also become "outliers", so in order not to shift the overall characteristics due to these few outliers, these outliers are exported separately , And the two levels of the whiskers in the box plot are modified to the minimum observation value and the maximum observation value.

# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage']
for c in categorical_features:
 Train_data[c] = Train_data[c].astype('category')
 if Train_data[c].isnull().any():
 Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
 Train_data[c] = Train_data[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
 sns.boxplot(x=x, y=y)
 x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

Violin graph visualization of categorical features

The Violin Plot is used to display the distribution status and probability density of multiple sets of data. This kind of chart combines the characteristics of box plots and density plots, and is mainly used to show the distribution shape of the data. Similar to the box plot, but better displayed at the density level. The violin chart is especially suitable when the amount of data is very large and it is not convenient to display one by one.

catg_list = categorical_features
target = 'price'
for catg in catg_list :
 sns.violinplot(x=catg, y=target, data=Train_data)
 plt.show()

Column chart visualization of category characteristics

def bar_plot(x, y, **kwargs):
 sns.barplot(x=x, y=y)
 x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")

Visualization of the frequency of each category of category features (count_plot)

def count_plot(x, **kwargs):
 sns.countplot(x=x)
 x=plt.xticks(rotation=90)
f = pd.melt(Train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

Use pandas_profiling to generate data reports

pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")

Guess you like

Origin blog.csdn.net/qq_45175218/article/details/105079559