[A Preliminary Exploration of Machine Learning Hand in Hand (1) Exploratory Analysis and Learning of EDA Data]

foreword

  • I haven't updated my blog for a long time, which also means that I haven't calmed down to learn new things for a long time. In the second year of research, I basically spent my time participating in competitions, and I gained some results, but in the later stage, I deeply felt my ignorance and superficiality, and I was a little weak. Before getting a job, I hope to learn some really useful things: ROS2, Linux embedded development and machine learning here, even if I just understand it, the key learning process will take time to update github and blog.

Main references for article ideas:
[1] Kaggle classic project - housing price forecast
[2] Exploratory data analysis EDA (1) - variable identification and analysis
[3] School mathematical modeling training courses

From the perspective of the entire process of machine learning that we have learned before: data processing, feature selection, model selection, model training, model testing, and model prediction. It seems that there is no emphasis on the EDA process. This may be the reason why almost all the data at the beginning of learning has been processed. For most of the actual data, it is difficult for us to directly see the trend and situation of the data. Therefore, EDA (Exploratory Data Analysis) data exploratory analysis should be the first step we need to do in the actual data analysis problem. Does the location precede data processing, or does it belong to the first step of data processing? In general, the role of EDA in personal understanding is to understand the data more clearly before starting to process the data. The main functions are:

  • Overall summary of the data
  • Distribution of data
  • The relevance of the data
  • visual analysis

Here I use the kaggle classic project housing price prediction data set for learning and analysis. The data set and code can be downloaded from my github: GITHUB .

1. Library import

#导入数据库
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

import missingno as mnso

import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

2. Data import

#train = pd.read_csv('dataset\\train.csv')#利用pandas包进行文件数据导入
train = pd.read_csv('./dataset/train.csv')#两种文件路径读取方法
test = pd.read_csv('./dataset/test.csv')

3. Data scale, type and distribution

3.1 Data scale view

print('训练集数据的规模:', train.shape)
print("----------------------------------")
print('测试集数据的规模:', test.shape)
test.head()#展示前五行

insert image description here
Therefore, in this data set, the training set has a total of 1460 records and the number of features is 81; the test set has a total of 1459 records and the number of features is 80 (the training set already contains the label column)

3.2 View data type

train.info()
#test.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Taking the training set in the dataset as an example, the data types in the training set include int64, object, and float64.
At the same time, it is worth noting that the number of feature columns that are less than 1460 indicates that there are null values ​​in the column

3.3 View missing null values

3.3.1 Numerical statistics of missing null values

isnull_array=train.isnull()  #isnull_array是一个布尔型二维数组
train.isnull().sum(axis=0)

Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
Street 0
Alley 1369
LotShape 0
LandContour 0
Utilities 0
LotConfig 0
LandSlope 0
Neighborhood 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 0
Exterior2nd 0
MasVnrType 8
MasVnrArea 8
ExterQual 0
ExterCond 0
Foundation 0

BedroomAbvGr 0
KitchenAbvGr 0
KitchenQual 0
TotRmsAbvGrd 0
Functional 0
Fireplaces 0
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageCars 0
GarageArea 0
GarageQual 81
GarageCond 81
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
3SsnPorch 0
ScreenPorch 0
PoolArea 0
PoolQC 1453
Fence 1179
MiscFeature 1406
MiscVal 0
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64

3.3.2 Visualization of missing null values

The visualization of missing null values ​​can more intuitively show the missing data. Visualization can be performed through missing value matrix plot and missing value bar chart .

# 缺失值矩阵图,白色代表缺失空值
#需要命令行安装 pip install missingno
mnso.matrix(train)
plt.show()

insert image description here

#缺失值条形图-各属性记录数
mnso.bar(titanic)
plt.show()

Please add a picture description

3.4 View the scale of numerical data

train.describe().T#查看数据集中数值型数据规模:数量、均值、标准差、最大、小值、四分位、中位数情况

insert image description here

4. Analysis of research objectives

Generally speaking, the research goal is our label data, which is generally divided into data type data and category type data for label data.

  • Data type Single data can be visualized intuitively by using histograms , box plots , violin plots , etc.
  • Data-type multi-data can use sn.pairplot(data) and correlation coefficient method to characterize the relationship between variables
  • Categorical data can be calculated and visualized by means of data statistics value_counts and sn.countplot(data['row'])

The research goal of this data set - house price is: data type single data using frequency histogram visual analysis:

4.1 Distribution of research targets

#绘制目标值分布
sns.distplot(train['SalePrice'])
train['SalePrice'].describe()

count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
insert image description here

4.2 Relationship between research objectives and characteristics

Afterwards, we need to analyze the pairwise relationship
between feature data and research objectives . At the same time, we need to consider the numerical data and categorical data of feature data to analyze numerical data and research objectives:

4.2.1 Delete the ID column

#删掉ID列
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)
print('The shape of training data:', train.shape)
print('The shape of testing data:', test.shape)

4.2.2 Separation of numerical and categorical data in feature data

#分离数字特征和类别特征
num_features = []
cate_features = []
for col in test.columns:#使用测试集不需要去除训练集的标签列
    if test[col].dtype == 'object':
        cate_features.append(col)
    else:
        num_features.append(col)
print('number of numeric features:', len(num_features))
print('number of categorical features:', len(cate_features))

insert image description here

4.2.3 View the relationship between numerical data and research objectives

#查看数字特征与目标值的关系
plt.figure(figsize=(16, 20))
plt.subplots_adjust(hspace=0.3, wspace=0.3)
for i, feature in enumerate(num_features):
    plt.subplot(9, 4, i+1)
    sns.scatterplot(x=feature, y='SalePrice', data=train, alpha=0.5)
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
plt.show()

insert image description here
From this, it can be roughly seen whether there is an obvious linear relationship between the numerical characteristics and the research goal.
At the same time, it is observed that the characteristics: YrSold, Fullbath, etc. have no obvious relationship with the research goal . You can draw a box plot for further observation.
Take YrSold as an example:

plt.figure(figsize=(16, 10))
sns.boxplot(x='YrSold', y='SalePrice', data=train)
plt.xlabel('YrSold', fontsize=14)
plt.ylabel('SalePrice', fontsize=14)
plt.xticks(rotation=90, fontsize=12)

insert image description here
Ok, it really doesn’t have a big impact, so don’t use this feature

4.2.3 View the relationship between category data and research objectives

#查看类别特征对目标值的影响情况
plt.figure(figsize=(30, 40))
plt.subplots_adjust(hspace=0.3, wspace=0.3)
for i, feature in enumerate(cate_features):
    plt.subplot(11, 4, i+1)
    sns.boxplot(x=feature, y='SalePrice', data=train)
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
plt.show()

insert image description hereHere we can provide us with the general impact of category-type feature data on the research target. If there is feature data like YrSold that has no obvious impact, data screening can be considered .

4.3 Study the relationship between variables (numerical)

Use the correlation coefficient heat map to characterize the correlation between variables in the data

corrs = train.corr()
plt.figure(figsize=(16, 16))
sns.heatmap(corrs)

insert image description here
Further, the ten numerical variables with the highest correlation with the target value can be analyzed

#分析与目标值相关度最高的十个变量
cols_10 = corrs.nlargest(11, 'SalePrice')['SalePrice'].index
corrs_10 = train[cols_10].corr()
plt.figure(figsize=(6, 6))
sns.heatmap(corrs_10, annot=True)

insert image description here
Seaborn is actually a more advanced API package based on matplotlib, which makes drawing easier. Among them, pairplot mainly shows the relationship between variables ( linear or nonlinear , whether there is a more obvious correlation )

sns.pairplot(train[cols_10])#对角线上是各个属性的直方图(分布图),而非对角线上是两个不同属性之间的相关图

insert image description here
After the overall related situation is visualized, outliers can be preliminarily excluded, for example:
[1] In the Kaggle classic project - housing price prediction , the author found outliers in 'TotalBsmtSF' and 'GrLivArea' by observing the overall image, and carried out manually excluded

insert image description here
insert image description here

5. Summary

The exploratory analysis of EDA data helps us further understand the scale, trend, and general correlation between the data, which is convenient for the next step of data cleaning and feature selection.
For the housing price evaluation problem in this paper, it is clear through the EDA process analysis:

  • Data size and distribution
  • missing data
  • data dependencies
  • Data anomalies

Next, we need to focus on the analysis situation:

  • Judgment and elimination of data outliers
  • Data missing value processing

Guess you like

Origin blog.csdn.net/ONERYJHHH/article/details/125466645