Data Mining: Heart Disease Prediction (Evaluation Index; EDA)

Table of contents

1. Preliminary preparation

2. Practical exercises

2.1 Example of classification index evaluation calculation

 2.2 Exploratory Data Analysis (EDA)

2.2.1 Import Function Toolbox

2.2.2 View related data such as data information

Judging missing and abnormal data

Visualization of the relationship between digital features

 Categorical feature analysis (boxplots, violin plots, histograms)

 2.2.3 Feature and label construction

2.3 Model Training and Prediction

2.3.1 Use xgb to perform five-fold cross-validation to view the parameter effect of the model

2.3.2 Define xgb and lgb model functions

2.3.3 Segment the data set (Train, Val) for model training, evaluation and prediction

​edit

2.3.4 Perform weighted fusion of the results of the two models


Continuing from the previous chapter: Data Mining: Automobile Transaction Price Prediction (Evaluation Indicators; EDA)_Niu Da Le 2023 Blog-CSDN Blog Come to a practical exercise.

1. Preliminary preparation

The data set is the heart disease data set I posted in the resource before, you can manually divide the training set and test set.

https://download.csdn.net/download/m0_62237233/87694444?spm=1001.2014.3001.5503

2. Practical exercises

2.1 Example of classification index evaluation calculation

import pandas as pd
import numpy as np
import os, PIL, random, pathlib
data_dir = './data/'
data_dir = pathlib.Path(data_dir)
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)
data_paths = list(data_dir.glob('*'))
classeNames = [str(path).split("\\")[1] for path in data_paths]
print(classeNames)

Train_data = pd.read_csv('data/trainC.csv', sep=',')
Test_data = pd.read_csv('data/testC.csv', sep=',')
print('Train data shape:',Train_data.shape) #包含了标签所以多一列
print('TestA data shape:',Test_data.shape)

 Print related indicators

from sklearn.metrics import accuracy_score
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
print('ACC:',accuracy_score(y_true, y_pred))
## Precision,Recall,F1-score
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred))
print('Recall',metrics.recall_score(y_true, y_pred))
print('F1-score:',metrics.f1_score(y_true, y_pred))
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',roc_auc_score(y_true, y_scores))

Regression index evaluation calculation is also involved

# coding=utf-8
import numpy as np
from sklearn import metrics
 
# MAPE需要自己实现
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred - y_true) / y_true))
 
y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])
 
# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))

 2.2 Exploratory Data Analysis (EDA)

2.2.1 Import Function Toolbox

## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
 
warnings.filterwarnings('ignore')
 
 
## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
 
## 数据降维处理的
from sklearn.decomposition import PCA, FastICA, FactorAnalysis, SparsePCA
 
import lightgbm as lgb
import xgboost as xgb
 
## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
# coding:utf-8
# 导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
 
warnings.filterwarnings('ignore')
 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

2.2.2 View related data such as data information

## 2) 简略观察数据(head()+shape)
print(Train_data.head().append(Train_data.tail()))

 Familiarize yourself with the relevant statistics of the data through describe()

The describe type has the statistics of each column, the number count, the mean mean, the variance std, the minimum value min, the median 25% 50% 75%, and the maximum value. This  information is mainly to grasp the approximate range of the data in an instant and Judgment of outliers for each value

 There is nothing wrong with this data set, but you still have to do these pre-work and develop this habit.

Judging missing and abnormal data

print(Train_data.isnull().sum())

 nothing missing

Understanding the Predicted Value Distribution

Analysis of the predicted value + statistics of the predicted value + verification of the distribution, taking the time variable as an example

## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
 
y = Train_data['time']
plt.figure(1);
plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2);
plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3);
plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
plt.show()

 

 It fits the normal distribution quite well. Transformation can be done before regression is performed. While the log transformation does a good job, the best fit is an unbounded Johnson distribution.

 View frequency

plt.hist(Train_data['time'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

 The log transformation of time tends to a normal distribution, which can be used for prediction.

Features are divided into categorical features and digital features, and look at the unique distribution of categorical features

# 分离label即预测值
Y_train = Train_data['time']

# 这个区别方式适用于没有直接label coding的数据
# 这里不适用,需要人为根据实际含义来区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns

#数字特征
numeric_features = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
#类型特征
categorical_features = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking',  'DEATH_EVENT']

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

Correlation analysis:

## 1) 相关性分析

numeric_features.append('DEATH_EVENT')
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['DEATH_EVENT'].sort_values(ascending = False),'\n')

 

Visualization of the relationship between digital features

## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

Get time and other things to see

## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['age', 'creatinine_phosphokinase' , 'ejection_fraction', 'platelets', 'serum_creatinine',  'time']
age_scatter_plot = pd.concat([Y_train, Train_data['age']], axis=1)
sns.regplot(x='age', y='time', data=age_scatter_plot, scatter=True, fit_reg=True, ax=ax1)

creatinine_phosphokinase_scatter_plot = pd.concat([Y_train, Train_data['creatinine_phosphokinase']], axis=1)
sns.regplot(x='creatinine_phosphokinase', y='time', data=creatinine_phosphokinase_scatter_plot, scatter=True,
            fit_reg=True, ax=ax2)

ejection_fraction_scatter_plot = pd.concat([Y_train, Train_data['ejection_fraction']], axis=1)
sns.regplot(x='ejection_fraction', y='time', data=ejection_fraction_scatter_plot, scatter=True, fit_reg=True, ax=ax3)

platelets_scatter_plot = pd.concat([Y_train, Train_data['platelets']], axis=1)
sns.regplot(x='platelets', y='time', data=platelets_scatter_plot, scatter=True, fit_reg=True, ax=ax4)

serum_creatinine_scatter_plot = pd.concat([Y_train, Train_data['serum_creatinine']], axis=1)
sns.regplot(x='serum_creatinine', y='time', data=serum_creatinine_scatter_plot, scatter=True, fit_reg=True, ax=ax5)

# time_scatter_plot = pd.concat([Y_train, Train_data['time']], axis=1)
# sns.regplot(x='time', y='time', data=time_scatter_plot, scatter=True, fit_reg=True, ax=ax6)

plt.show()

 Categorical feature analysis (boxplots, violin plots, histograms)

# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['anaemia',
                        'diabetes',
                        'high_blood_pressure',
                        'sex',
                        'smoking']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')


def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x = plt.xticks(rotation=90)


f = pd.melt(Train_data, id_vars=['DEATH_EVENT'], value_vars=categorical_features)  # 预测值
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, height=5)
g = g.map(boxplot, "value", "DEATH_EVENT")
plt.show()

Because it is all 0-1 data, it seems that it cannot be viewed directly... no other types of graphs will be shown

 Per-category frequency visualization of categorical features (count_plot)

##  5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)
 
f = pd.melt(Train_data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, height=5)
g = g.map(count_plot, "value")
plt.show()

 2.2.3 Feature and label construction

  • Extract numeric type feature column names
numerical_cols = Train_data.select_dtypes(exclude='object').columns
print(numerical_cols)
 
 
categorical_cols = Train_data.select_dtypes(include='object').columns
print(categorical_cols)

  • Build training and testing samples
## 提前特征列,标签列构造训练样本和测试样本
X_data = Train_data[feature_cols]
Y_data = Train_data['time']

X_test = Test_data[feature_cols]

print('X train shape:', X_data.shape)
print('X test shape:', X_test.shape)

X train shape: (209, 13)
X test shape: (90, 13)

  • Basic distribution information for statistical labels
## 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):
    print('_min', np.min(data))
    print('_max:', np.max(data))
    print('_mean', np.mean(data))
    print('_ptp', np.ptp(data))
    print('_std', np.std(data))
    print('_var', np.var(data))


print('Sta of label:')
Sta_inf(Y_data)
## 绘制标签的统计图,查看标签分布
plt.hist(Y_data)
plt.show()

2.3 Model Training and Prediction

2.3.1 Use xgb to perform five-fold cross-validation to view the parameter effect of the model

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'
#簇120,学习率0.1 ,深度为7
scores_train = []
scores = []
 
## 5折交叉验证方式,防止过拟合
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
    
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)
 
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))

Train mae: 0.04781590756915864
Val mae 1.206481080991189

2.3.2 Define xgb and lgb model functions

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model
 
def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
    param_grid = {
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
    }
    gbm = GridSearchCV(estimator, param_grid)  #网格搜索
    gbm.fit(x_train, y_train)
    return gbm

Grid search automatic parameter adjustment method, correct the parameters in param_grid, you can add learning rate and other parameters

 param_grid = {
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'n_estimators': [100, 140, 120, 130],
         
    }

2.3.3 Segment the data set (Train, Val) for model training, evaluation and prediction

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

 Proportionally divided, it can also be 4:1, that is, test_size=0.2

 print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
 
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)

print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
 
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)

2.3.4 Perform weighted fusion of the results of the two models

## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))

MAE of val with Weighted ensemble: 3.1147994422143657

Guess you like

Origin blog.csdn.net/m0_62237233/article/details/130292887