Data Mining Tianchi Competition-Multi-class Prediction of ECG Heartbeat Signal Task2 Data Analysis

1. EDA goals

  • The value of EDA is mainly to be familiar with the data set, understand the data set, and verify the data set to determine that the obtained data set can be used for subsequent machine learning or deep learning.
  • After understanding the data set, our next step is to understand the relationship between variables and the relationship between variables and predicted values.
  • Guide data science practitioners in the steps of data processing and feature engineering, so that the structure and feature set of the data set make the next prediction problem more reliable.

2. Content introduction

  1. Load various data science and visualization libraries:
    • Data science libraries pandas, numpy, scipy;
    • Visualization libraries matplotlib, seabon;
  2. Load data:
    • Load training set and test set;
    • Observe the data briefly (head()+shape);
  3. Data overview:
    • Use describe() to familiarize yourself with the relevant statistics of the data
    • Get familiar with data types through info()
  4. Determine missing and abnormal data
    • View the existence of nan in each column
    • Outlier detection
  5. Understand the distribution of predicted values
    • Overall distribution overview
    • View skewness and kurtosis
    • Check the specific frequency of the predicted value

3. Code practice

3.1 Load various libraries

#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import missingno as msno #缺失值可视化
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np

3.2 Load training set and test set

path='./data/'
train_data=pd.read_csv(path+'train.csv')
test_data=pd.read_csv(path+'testA.csv')

3.3 First look at the data

Observe the rank information

print('Train_data shape:',train_data.shape)
print('Test_data shape:',test_data.shape)
Train_data shape: (100000, 3)
Test_data shape: (20000, 2)

View the beginning and end data

train_data.head().append(train_data.tail())

Insert picture description here

  • id-the unique identifier of the heartbeat signal distribution
  • heartbeat_signals-Heartbeat signal sequence
  • label-Heartbeat signal category (0, 1, 2, 3)

3.5 Data overview

train_data.describe()

Insert picture description here

  • data.describe()——Get the relevant statistics of the data

  • The describe category has statistics for each column, count, average mean, variance std, minimum min, median 25% 50% 75%, and maximum value. This information is mainly to grasp the approximate range of the data instantly and Judgment of the abnormal value of each value, for example, sometimes you will find that 999 9999 -1 etc. are actually another way of expressing nan. Sometimes you need to pay attention to it.

train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 100000 non-null  int64  
 1   heartbeat_signals  100000 non-null  object 
 2   label              100000 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.3+ MB
  • data.info()——Get data type

  • info Use info to understand the type of each column of the data, which helps to understand whether there are special symbol exceptions other than nan

3.6 Judging data missing and abnormal

train_data.isnull().sum()
id                   0
heartbeat_signals    0
label                0
dtype: int64
test_data.isnull().sum()
id                   0
heartbeat_signals    0
dtype: int64
  • Know that there are no missing values ​​in the training set and test set

3.7 Distribution of predicted values

1> Frequency analysis of predicted value

  • Frequency statistics
train_data['label'].value_counts()
0.0    64327
3.0    17912
2.0    14199
1.0     3562
Name: label, dtype: int64
  • Visualized statistics of histogram and pie chart
# 目标变量分布可视化
fig,axs=plt.subplots(1,2,figsize=(14,7))
# 柱状图
sns.countplot(x='label',data=train_data,ax=axs[0])
axs[0].set_title('Frequency of each Class')

# 饼图
train_data['label'].value_counts().plot(x=None,y=None,kind='pie',ax=axs[1],autopct='%1.2f%%')
axs[1].set_title('Percentage of each Class')

Insert picture description here

2> Forecast value distribution

  • Why analyze the distribution of predicted values?
  • Check whether the data obey a normal distribution
  • Why should it be transformed into a normal distribution?
  • The data as a whole obeys a normal distribution, so that the mean and variance are independent of each other. The normal distribution has many good properties, and many models assume that the data obey the normal distribution. For example, linear regression, which assumes that the error obeys a normal distribution, so that the probability of each sample point can be expressed in the form of a normal distribution. Multiplying multiple sample points and then taking the logarithm is all training set samples The conditional probability that appears, maximizing the conditional probability is the problem finally solved by LR. The form of the final expression of this conditional probability here is the familiar error sum of squares. In short, many models in ML assume that data or parameters follow a normal distribution.
  • Transformation method
  • Linear change z-scores
    use Boxcox transformation
    Use yeo-johnson transformation
    Note: Blindly assuming that variables follow a normal distribution may lead to inaccurate results, which must be combined with analysis. For example: it cannot be assumed that the stock price follows a normal distribution, because the price cannot be negative, so we can assume that the stock price follows a log-normal distribution to ensure that its value is ≥0; and the stock return may be a negative number, so the return can be assumed obbey normal distribution
  • Unbounded Johnson distribution?
    When the sample data shows that the distribution of the quality characteristics is non-normal, the application of the method based on the normal distribution will make an incorrect judgment. The Johnson distribution family is the probability distribution of random variables that obey the normal distribution after John's transformation. The Johnson distribution system establishes three families of distributions, namely bounded SB, lognormal SL and unbounded SU.
import scipy.stats as st
y=train_data['label']
plt.figure(1); plt.title('Default')
sns.distplot(y, rug=True, bins=20)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)

Insert picture description here
Insert picture description here
Insert picture description here
3) View skewness and kurtosis

Skewness:

  • It is a measure of the direction and degree of skewness of statistical data distribution, and a numerical feature of the degree of asymmetry of statistical data distribution. By definition, skewness is the third-order normalized moment of the sample.
  • The definition of skewness includes normal distribution (skewness=0), right skew distribution (also called positive skew distribution, with skewness> 0), and left skew distribution (also called negative skew distribution, with skewness <0).

kurtosis峰度

  • Kurtosis (peakedness; kurtosis) is also known as the coefficient of kurtosis. The characteristic number that characterizes the peak height of the probability density distribution curve at the average value. Intuitively, kurtosis reflects the sharpness of the peak. The kurtosis of a random variable is calculated as the ratio of the fourth-order central moment of the random variable to the square of the variance.
  • Kurtosis includes normal distribution (kurtosis value=3), thick tail (kurtosis value>3), thin tail (kurtosis value<3)
print('Skewness: %f' % train_data['label'].skew())
print('Kurtosis: %f' % train_data['label'].kurt())
Skewness: 0.871005
Kurtosis: -1.009573

4> Use pandas_profiling to generate data report

import pandas_profiling
pfr = pandas_profiling.ProfileReport(train)
pfr.to_file('./example.html')

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/115022787