Task 2 data analysis
Tip: This part is the Task2 EDA-data exploratory analysis part of zero-based introductory data mining. It will take you to understand the data, get familiar with the data, and be friends with the data. Welcome to follow-up exchanges.
Contest question: Multi-class prediction of ECG heartbeat signal
2.1 EDA goals
- The value of EDA is mainly to be familiar with the data set, understand the data set, and verify the data set to determine that the obtained data set can be used for subsequent machine learning or deep learning.
- After understanding the data set, our next step is to understand the relationship between variables and the relationship between variables and predicted values.
- Guide data science practitioners in the steps of data processing and feature engineering, so that the structure and feature set of the data set make the next prediction problem more reliable.
- Complete the exploratory analysis of the data, and make some charts or text summaries for the data and punch in.
2.2 Content introduction
- Load various data science and visualization libraries:
- Data science libraries pandas, numpy, scipy;
- Visualization libraries matplotlib, seabon;
- Load data:
- Load training set and test set;
- Observe the data briefly (head()+shape);
- Data overview:
- Use describe() to familiarize yourself with the relevant statistics of the data
- Get familiar with data types through info()
- Determine missing and abnormal data
- View the existence of nan in each column
- Outlier detection
- Understand the distribution of predicted values
- Overall distribution overview
- View skewness and kurtosis
- Check the specific frequency of the predicted value
2.3 Code example
2.3.1 Load various data science and visualization libraries
#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import missingno as msno
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
2.3.2 Load training set and test set
Import the training set train.csv
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
Train_data = pd.read_csv('./train.csv')
Import test set testA.csv
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
Test_data = pd.read_csv('./testA.csv')
All feature sets are desensitized (for everyone to watch)
- id-the unique identifier of the heartbeat signal distribution
- heartbeat_signals-Heartbeat signal sequence
- label-Heartbeat signal category (0, 1, 2, 3)
data.head().append(data.tail())
——Observe the beginning and end data
data.shape
-Observe the ranks of the data set
Observe the first and last data of train
Train_data.head().append(Train_data.tail())
<bound method DataFrame.info of id heartbeat_signals label
0 0 0.9912297987616655,0.9435330436439665,0.764677... 0.0
1 1 0.9714822034884503,0.9289687459588268,0.572932... 0.0
2 2 1.0,0.9591487564065292,0.7013782792997189,0.23... 2.0
3 3 0.9757952826275774,0.9340884687738161,0.659636... 0.0
4 4 0.0,0.055816398940721094,0.26129357194994196,0... 2.0
... ... ... ...
99995 99995 1.0,0.677705342021188,0.22239242747868546,0.25... 0.0
99996 99996 0.9268571578157265,0.9063471198026871,0.636993... 2.0
99997 99997 0.9258351628306013,0.5873839035878395,0.633226... 3.0
99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45... 2.0
99999 99999 0.9259994004527861,0.916476635326053,0.4042900... 0.0
[100000 rows x 3 columns]>
Observe the row and column information of the train data set
Train_data.shape
(100000, 3)
Observe the beginning and end data of testA
Test_data.head().append(Test_data.tail())
id heartbeat_signals
0 100000 0.9915713654170097,1.0,0.6318163407681274,0.13...
1 100001 0.6075533139615096,0.5417083883163654,0.340694...
2 100002 0.9752726292239277,0.6710965234906665,0.686758...
3 100003 0.9956348033996116,0.9170249621481004,0.521096...
4 100004 1.0,0.8879490481178918,0.745564725322326,0.531...
19995 119995 1.0,0.8330283177934747,0.6340472606311671,0.63...
19996 119996 1.0,0.8259705825857048,0.4521053488322387,0.08...
19997 119997 0.951744840752379,0.9162611283848351,0.6675251...
19998 119998 0.9276692903808186,0.6771898159607004,0.242906...
19999 119999 0.6653212231837624,0.527064114047737,0.5166625...
Observe the ranks of the testA data set
Test_data.shape
(20000, 2)
Develop the habit of looking at the head() and shape of the data set. This will make you feel more at ease every step, leading to a series of errors in the next step. If you are not at ease with your pandas and other operations, it is recommended to take a look This will effectively facilitate your understanding of functions and operations
2.3.3 Overview data overview
- The describe category has statistics for each column, count, average mean, variance std, minimum min, median 25% 50% 75%, and maximum value. This information is mainly to grasp the approximate range of the data instantly and Judgment of the abnormal value of each value, for example, sometimes you will find that 999 9999 -1 etc. are actually another way of expressing nan. Sometimes you need to pay attention to it.
- info Use info to understand the type of each column of the data, which helps to understand whether there are special symbol exceptions other than nan
data.describe()
——Get the relevant statistics of the data
data.info()
——Get data type
Get relevant statistics of train data
Train_data.describe()
id label
count 100000.000000 100000.000000
mean 49999.500000 0.856960
std 28867.657797 1.217084
min 0.000000 0.000000
25% 24999.750000 0.000000
50% 49999.500000 0.000000
75% 74999.250000 2.000000
max 99999.000000 3.000000
Get train data type
Train_data.info
<bound method DataFrame.info of id heartbeat_signals label
0 0 0.9912297987616655,0.9435330436439665,0.764677... 0.0
1 1 0.9714822034884503,0.9289687459588268,0.572932... 0.0
2 2 1.0,0.9591487564065292,0.7013782792997189,0.23... 2.0
3 3 0.9757952826275774,0.9340884687738161,0.659636... 0.0
4 4 0.0,0.055816398940721094,0.26129357194994196,0... 2.0
... ... ... ...
99995 99995 1.0,0.677705342021188,0.22239242747868546,0.25... 0.0
99996 99996 0.9268571578157265,0.9063471198026871,0.636993... 2.0
99997 99997 0.9258351628306013,0.5873839035878395,0.633226... 3.0
99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45... 2.0
99999 99999 0.9259994004527861,0.916476635326053,0.4042900... 0.0
[100000 rows x 3 columns]>
Get relevant statistics of testA data
Test_data.describe()
id
count 20000.000000
mean 109999.500000
std 5773.647028
min 100000.000000
25% 104999.750000
50% 109999.500000
75% 114999.250000
max 119999.000000
Get testA data type
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20000 non-null int64
1 heartbeat_signals 20000 non-null object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
2.3.4 Judging data missing and abnormal
data.isnull().sum()
——View the existence of nan in each column
View the existence of nan in each column of trian
Train_data.isnull().sum()
id 0
heartbeat_signals 0
label 0
dtype: int64
View the existence of nan in each column of testA
Test_data.isnull().sum()
id 0
heartbeat_signals 0
dtype: int64
2.3.5 Understanding the distribution of predicted values
Train_data['label']
0 0.0
1 0.0
2 4.0
3 0.0
4 0.0
...
99995 4.0
99996 0.0
99997 0.0
99998 0.0
99999 1.0
Name: label, Length: 100000, dtype: float64
Train_data['label'].value_counts()
0.0 58883
4.0 19660
2.0 12994
1.0 6522
3.0 1941
Name: label, dtype: int64
## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['label']
plt.figure(1); plt.title('Default')
sns.distplot(y, rug=True, bins=20)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
# 2)查看skewness and kurtosis
sns.distplot(Train_data['label']);
print("Skewness: %f" % Train_data['label'].skew())
print("Kurtosis: %f" % Train_data['label'].kurt())
Skewness: 0.917596
Kurtosis: -0.825276
Train_data.skew(), Train_data.kurt()
(id 0.000000
label 0.917596
dtype: float64, id -1.200000
label -0.825276
dtype: float64)
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
## 3) 查看预测值的具体频数
plt.hist(Train_data['label'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
2.3.7 Generate data report with pandas_profiling
import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")
2.4 Summary
Data exploratory analysis is the stage where we have a preliminary understanding of the data and are familiar with the data to prepare for feature engineering. Even in many cases, the features extracted in the EDA stage can be directly used as rules. It can be seen the importance of EDA. The main work at this stage is to use simple statistics to understand the overall data, analyze the relationship between various types of variables, and visualize intuitive observations with appropriate graphics. I hope that the content in this section will help beginners, and I look forward to your suggestions on the shortcomings.