Datawhale Zero Foundation Entry Data Mining-Data Analysis

Task 2 data analysis

Tip: This part is the Task2 EDA-data exploratory analysis part of zero-based introductory data mining. It will take you to understand the data, get familiar with the data, and be friends with the data. Welcome to follow-up exchanges.

Contest question: Multi-class prediction of ECG heartbeat signal

2.1 EDA goals

  • The value of EDA is mainly to be familiar with the data set, understand the data set, and verify the data set to determine that the obtained data set can be used for subsequent machine learning or deep learning.
  • After understanding the data set, our next step is to understand the relationship between variables and the relationship between variables and predicted values.
  • Guide data science practitioners in the steps of data processing and feature engineering, so that the structure and feature set of the data set make the next prediction problem more reliable.
  • Complete the exploratory analysis of the data, and make some charts or text summaries for the data and punch in.

2.2 Content introduction

  1. Load various data science and visualization libraries:
    • Data science libraries pandas, numpy, scipy;
    • Visualization libraries matplotlib, seabon;
  2. Load data:
    • Load training set and test set;
    • Observe the data briefly (head()+shape);
  3. Data overview:
    • Use describe() to familiarize yourself with the relevant statistics of the data
    • Get familiar with data types through info()
  4. Determine missing and abnormal data
    • View the existence of nan in each column
    • Outlier detection
  5. Understand the distribution of predicted values
    • Overall distribution overview
    • View skewness and kurtosis
    • Check the specific frequency of the predicted value

2.3 Code example

2.3.1 Load various data science and visualization libraries

#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import missingno as msno
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np

2.3.2 Load training set and test set

Import the training set train.csv

import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
Train_data = pd.read_csv('./train.csv')

Import test set testA.csv

import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt 
Test_data = pd.read_csv('./testA.csv')

All feature sets are desensitized (for everyone to watch)

  • id-the unique identifier of the heartbeat signal distribution
  • heartbeat_signals-Heartbeat signal sequence
  • label-Heartbeat signal category (0, 1, 2, 3)

data.head().append(data.tail())——Observe the beginning and end data

data.shape-Observe the ranks of the data set

Observe the first and last data of train

Train_data.head().append(Train_data.tail())
<bound method DataFrame.info of           id                                  heartbeat_signals  label
0          0  0.9912297987616655,0.9435330436439665,0.764677...    0.0
1          1  0.9714822034884503,0.9289687459588268,0.572932...    0.0
2          2  1.0,0.9591487564065292,0.7013782792997189,0.23...    2.0
3          3  0.9757952826275774,0.9340884687738161,0.659636...    0.0
4          4  0.0,0.055816398940721094,0.26129357194994196,0...    2.0
...      ...                                                ...    ...
99995  99995  1.0,0.677705342021188,0.22239242747868546,0.25...    0.0
99996  99996  0.9268571578157265,0.9063471198026871,0.636993...    2.0
99997  99997  0.9258351628306013,0.5873839035878395,0.633226...    3.0
99998  99998  1.0,0.9947621698382489,0.8297017704865509,0.45...    2.0
99999  99999  0.9259994004527861,0.916476635326053,0.4042900...    0.0

[100000 rows x 3 columns]>

Observe the row and column information of the train data set

Train_data.shape
(100000, 3)

Observe the beginning and end data of testA

Test_data.head().append(Test_data.tail())
id	heartbeat_signals
0	100000	0.9915713654170097,1.0,0.6318163407681274,0.13...
1	100001	0.6075533139615096,0.5417083883163654,0.340694...
2	100002	0.9752726292239277,0.6710965234906665,0.686758...
3	100003	0.9956348033996116,0.9170249621481004,0.521096...
4	100004	1.0,0.8879490481178918,0.745564725322326,0.531...
19995	119995	1.0,0.8330283177934747,0.6340472606311671,0.63...
19996	119996	1.0,0.8259705825857048,0.4521053488322387,0.08...
19997	119997	0.951744840752379,0.9162611283848351,0.6675251...
19998	119998	0.9276692903808186,0.6771898159607004,0.242906...
19999	119999	0.6653212231837624,0.527064114047737,0.5166625...

Observe the ranks of the testA data set

Test_data.shape
(20000, 2)

Develop the habit of looking at the head() and shape of the data set. This will make you feel more at ease every step, leading to a series of errors in the next step. If you are not at ease with your pandas and other operations, it is recommended to take a look This will effectively facilitate your understanding of functions and operations

2.3.3 Overview data overview

  1. The describe category has statistics for each column, count, average mean, variance std, minimum min, median 25% 50% 75%, and maximum value. This information is mainly to grasp the approximate range of the data instantly and Judgment of the abnormal value of each value, for example, sometimes you will find that 999 9999 -1 etc. are actually another way of expressing nan. Sometimes you need to pay attention to it.
  2. info Use info to understand the type of each column of the data, which helps to understand whether there are special symbol exceptions other than nan

data.describe()——Get the relevant statistics of the data

data.info()——Get data type

Get relevant statistics of train data

Train_data.describe()
id	label
count	100000.000000	100000.000000
mean	49999.500000	0.856960
std	28867.657797	1.217084
min	0.000000	0.000000
25%	24999.750000	0.000000
50%	49999.500000	0.000000
75%	74999.250000	2.000000
max	99999.000000	3.000000

Get train data type

Train_data.info
<bound method DataFrame.info of           id                              heartbeat_signals  label
0          0  0.9912297987616655,0.9435330436439665,0.764677...    0.0
1          1  0.9714822034884503,0.9289687459588268,0.572932...    0.0
2          2  1.0,0.9591487564065292,0.7013782792997189,0.23...    2.0
3          3  0.9757952826275774,0.9340884687738161,0.659636...    0.0
4          4  0.0,0.055816398940721094,0.26129357194994196,0...    2.0
...      ...                                                ...    ...
99995  99995  1.0,0.677705342021188,0.22239242747868546,0.25...    0.0
99996  99996  0.9268571578157265,0.9063471198026871,0.636993...    2.0
99997  99997  0.9258351628306013,0.5873839035878395,0.633226...    3.0
99998  99998  1.0,0.9947621698382489,0.8297017704865509,0.45...    2.0
99999  99999  0.9259994004527861,0.916476635326053,0.4042900...    0.0

[100000 rows x 3 columns]>

Get relevant statistics of testA data

Test_data.describe()
 					id
count	20000.000000
mean	109999.500000
std	5773.647028
min	100000.000000
25%	104999.750000
50%	109999.500000
75%	114999.250000
max	119999.000000

Get testA data type

Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 20000 non-null  int64 
 1   heartbeat_signals  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB

2.3.4 Judging data missing and abnormal

data.isnull().sum()——View the existence of nan in each column

View the existence of nan in each column of trian

Train_data.isnull().sum()
id                   0
heartbeat_signals    0
label                0
dtype: int64

View the existence of nan in each column of testA

Test_data.isnull().sum()
id                   0
heartbeat_signals    0
dtype: int64

2.3.5 Understanding the distribution of predicted values

Train_data['label']
0        0.0
1        0.0
2        4.0
3        0.0
4        0.0
        ... 
99995    4.0
99996    0.0
99997    0.0
99998    0.0
99999    1.0
Name: label, Length: 100000, dtype: float64
Train_data['label'].value_counts()
0.0    58883
4.0    19660
2.0    12994
1.0     6522
3.0     1941
Name: label, dtype: int64
## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['label']
plt.figure(1); plt.title('Default')
sns.distplot(y, rug=True, bins=20)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)

DefaultNormal
Log Normal

# 2)查看skewness and kurtosis
sns.distplot(Train_data['label']);
print("Skewness: %f" % Train_data['label'].skew())
print("Kurtosis: %f" % Train_data['label'].kurt())
Skewness: 0.917596
Kurtosis: -0.825276

Insert picture description here

Train_data.skew(), Train_data.kurt()
(id       0.000000
 label    0.917596
 dtype: float64, id      -1.200000
 label   -0.825276
 dtype: float64)
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

Insert picture description here

## 3) 查看预测值的具体频数
plt.hist(Train_data['label'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

Insert picture description here

2.3.7 Generate data report with pandas_profiling

import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")

2.4 Summary

Data exploratory analysis is the stage where we have a preliminary understanding of the data and are familiar with the data to prepare for feature engineering. Even in many cases, the features extracted in the EDA stage can be directly used as rules. It can be seen the importance of EDA. The main work at this stage is to use simple statistics to understand the overall data, analyze the relationship between various types of variables, and visualize intuitive observations with appropriate graphics. I hope that the content in this section will help beginners, and I look forward to your suggestions on the shortcomings.

Guess you like

Origin blog.csdn.net/weixin_46435816/article/details/115003723