Machine learning in practice: Stroke patient prediction based on three major classification models

Public number: You Er Hut
Author: Peter
Editor: Peter

Hello everyone, my name is Peter~

What I am sharing with you today is the data set modeling of stroke disease cases on kaggle. The main content of the article refers to the map:

Original data address: www.kaggle.com/datasets/fe…

import library

basic data

Import the data first and view the basic information of the data

Let's see the basic information of the data below

In [3]:

df.shape

Out[3]:

(5110, 12)

In [4]:

df.dtypes

Out[4]:

id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object  # 字符型
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

In [5]:

df.describe()  # 描述统计信息

Out[5]:

Field distribution

gender statistics

In [6]:

plt.figure(1, figsize=(12,5))

sns.countplot(y="gender", data=df)
plt.show()

age distribution

In [7]:

px.violin(y=df["age"])

fig = px.histogram(df,
                   x="age",
                   color_discrete_sequence=['firebrick'])

fig.show()

ever_married

In [9]:

plt.figure(1, figsize=(12,5))

sns.countplot(y="ever_married", data=df)

plt.show()

There are about twice as many married people in this dataset as unmarried people.

work-type

View the number of people with different work status

In [10]:

plt.figure(1, figsize=(12,8))

sns.countplot(y="work_type", data=df)

plt.show()

Residence_type

In [11]:

plt.figure(1, figsize=(12,8))

sns.countplot(y="Residence_type", data=df)

plt.show()

avg_glucose_level

Distribution of blood sugar levels

fig = px.histogram(df,
                   x="avg_glucose_level",
                   color_discrete_sequence=['firebrick'])

fig.show()

It can be seen that most people's blood sugar is still below 100, indicating that it is normal

bmi

Distribution of bmi indicators

The mean of the bmi indicator is about 28, showing a certain normal distribution

smoking_status

Statistics on smoking

plt.figure(1, figsize=(12,8))

sns.countplot(y="smoking_status", data=df)

plt.show()

It can be seen that there are relatively few people who smoke or have ever smoked

Missing value case

Missing value statistics

df.isnull().sum()
id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64
201 / len(df)  # 缺失比例
0.03933463796477495

Missing value visualization

Missing value handling

Using Decision Tree Regression to Predict Missing BMI Values: Prediction Fill by Age, Gender, and Existing BMI Values

dt_bmi = Pipeline(steps=[("scale",StandardScaler()), # 数据标准化
                         ("lr",DecisionTreeRegressor(random_state=42))
                        ])

Take out 3 indicators for prediction filling:

X = df[["age","gender","bmi"]].copy()

dic = {"Male":0, "Female":1, "Other":-1}

X["gender"] = X["gender"].map(dic).astype(np.uint8)
X.head()

Take out the part of non-missing values ​​for training:

# 缺失值部分
missing = X[X.bmi.isna()]

# 非缺失值部分
X = X[~X.bmi.isna()]
y = X.pop("bmi")
# 模型训练

dt_bmi.fit(X,y)
Pipeline(steps=[('scale', StandardScaler()),
                ('lr', DecisionTreeRegressor(random_state=42))])

In [23]:

# 模型预测

y_pred = dt_bmi.predict(missing[["age","gender"]])
y_pred[:5]

Out[23]:

array([29.87948718, 30.55609756, 27.24722222, 30.84186047, 33.14666667])

Convert the predicted value to a Series, and note the index number:

predict_bmi = pd.Series(y_pred, index=missing.index)
predict_bmi
1       29.879487
8       30.556098
13      27.247222
19      30.841860
27      33.146667
          ...    
5039    32.716000
5048    28.313636
5093    31.459322
5099    28.313636
5105    28.476923
Length: 201, dtype: float64

Fill into the original df data:

df.loc[missing.index, "bmi"] = predict_bmi

After the above prediction and filling, we check the missing value situation again and find that there are no missing values:

df.isnull().sum()
id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

Data EDA

variables = [variable for variable in df.columns if variable not in ['id','stroke']]

# 除去id号和是否中风外的全部字段
variables
['gender',
 'age',
 'hypertension',
 'heart_disease',
 'ever_married',
 'work_type',
 'Residence_type',
 'avg_glucose_level',
 'bmi',
 'smoking_status']

continuous variable

A few conclusions:

  • Age: The overall distribution is relatively balanced, and the differences in the number of different age groups are small
  • Blood sugar level: mainly concentrated below 100
  • BMI indicator: showing a certain normal distribution

Stroke and not stroke

上面我们查看了连续型变量的分布情况;可以看到bmi呈现明显的左偏态的分布。下面我们对比中风和未中风的情况:

从3个密度图中能够观察到:从上面的密度图中可以看出来:对于是否中风,年龄age是一个最主要的因素

对比不同年龄段的血糖和BMI指数

px.scatter(df,x="age",
           y="avg_glucose_level",
           color="stroke",
           trendline='ols'
          )

年龄和血糖、bmi关系

px.scatter(df,x="age",
           y="bmi",
           color="stroke",
           trendline='ols'
          )

年龄和患病几率

从散点分布图中看到:年龄可能真的是一个比较重要的因素,和BMI以及平均的血糖水平有着一定的关系。

可能随着年龄的增长,风险在增加。果真如此吗?

上面的图形说明了两点:

  1. 年龄越大,中风的几率的确越来越高
  2. 中风的几率是非常低的(y轴的值很低),这是由于中风和未中风的样本不均衡造成的

原数据5000个样本中只有249个中风样本,比例接近1:20

样本不均衡

整体属性分布

首先我们剔除gender中为Other的情况

In [34]:

str_only = df[df['stroke'] == 1]   # 中风
no_str_only = df[df['stroke'] == 0]  # 未中风

In [35]:

len(str_only) 

Out[35]:

249

In [36]:

# 剔除other
no_str_only = no_str_only[(no_str_only['gender'] != 'Other')]

比较在不同的属性下中风和未中风的情况:

建模

模型baseline

In [38]:

len(str_only)

Out[38]:

249

In [39]:

249 / len(df)  

Out[39]:

0.0487279843444227

说明总共有249个人是中风的。本数据的总人数是len(df),根据下面的表达式能够得到本次模型的baseline。

也就说,对于阳性中风患者的召回率,一个好的目标是4.8%。

字段编码

对4个字符型的字段进行编码工作:

In [40]:

df['gender'] = df['gender'].replace({'Male':0,
                                     'Female':1,
                                     'Other':-1}
                                   ).astype(np.uint8)

df['Residence_type'] = df['Residence_type'].map({'Rural':0,
                                                 'Urban':1}
                                               ).astype(np.uint8)

df['work_type'] = df['work_type'].map({'Private':0,
                                       'Self-employed':1,
                                       'Govt_job':2,
                                       'children':-1,
                                       'Never_worked':-2}
                                     ).astype(np.uint8)

df['ever_married'] = df['ever_married'].map({'No':0,'Yes':1}).astype(np.uint8)

df.head()

抽烟状态的独热码转换:

In [41]:

df["smoking_status"].value_counts()

Out[41]:

never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64

In [42]:

df = df.join(pd.get_dummies(df["smoking_status"]))
df.drop("smoking_status",axis=1,inplace=True)

数据分割

In [43]:

# 选取特征
X  = df.drop("stroke",axis=1)
# 目标变量
y = df['stroke']
from sklearn.model_selection import train_test_split

# 3-7比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)

上采样

前文中提到,本案例中风和未中风的数据比例接近1:20,在这里我们采样基于SMOTE的上采样方法

In [44]:

oversample = SMOTE()
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train.ravel())

In [45]:

len(y_train_smote)

Out[45]:

2914

In [46]:

len(X_train_smote)

Out[46]:

2914

建模

采用3种不同的分类模型来建立模型:Random Forest, SVM, Logisitc Regression

In [47]:

rf_pipeline = Pipeline(steps = [('scale',StandardScaler()), # 标准化
                                ('RF',RandomForestClassifier(random_state=42))]  # 模型
                      )
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),
                                 ('SVM',SVC(random_state=42))])
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),
                                    ('LR',LogisticRegression(random_state=42))])

10折交叉验证

3种模型得分对比

In [49]:

print('随机森林:', rf_cv.mean())
print('支持向量机:',svm_cv.mean())
print('逻辑回归:', logreg_cv.mean())
随机森林: 0.9628909366701726
支持向量机: 0.9363667907023254
逻辑回归: 0.8859930523017683

很明显:随机森林表现的最好!

模型训练fit

In [50]:

rf_pipeline.fit(X_train_smote,y_train_smote)

svm_pipeline.fit(X_train_smote,y_train_smote)

logreg_pipeline.fit(X_train_smote,y_train_smote)

Out[50]:

Pipeline(steps=[('scale', StandardScaler()),
                ('LR', LogisticRegression(random_state=42))])

In [51]:

# 3种模型预测

rf_pred =rf_pipeline.predict(X_test)
svm_pred = svm_pipeline.predict(X_test)
logreg_pred = logreg_pipeline.predict(X_test)

评价指标

In [52]:

# 1、混淆矩阵

rf_cm  = confusion_matrix(y_test, rf_pred )
svm_cm = confusion_matrix(y_test, svm_pred)
logreg_cm  = confusion_matrix(y_test, logreg_pred)

In [53]:

print(rf_cm)
print("----")
print(svm_cm)
print("----")
print(logreg_cm)
[[3338   66]
 [ 164    9]]
----
[[3196  208]
 [ 148   25]]
----
[[3138  266]
 [ 116   57]]

print('RF mean :',rf_f1)
print('SVM mean :',svm_f1)
print('LR mean :',logreg_f1)
RF mean : 0.07258064516129033
SVM mean : 0.1231527093596059
LR mean : 0.22983870967741934

随机森林模型的分类报告:

from sklearn.metrics import plot_confusion_matrix, classification_report

print(classification_report(y_test,rf_pred))

print('Accuracy Score: ',accuracy_score(y_test,rf_pred))
              precision    recall  f1-score   support

           0       0.95      0.98      0.97      3404
           1       0.12      0.05      0.07       173

    accuracy                           0.94      3577
   macro avg       0.54      0.52      0.52      3577
weighted avg       0.91      0.94      0.92      3577

Accuracy Score:  0.9357003075202683

随机森林模型调参

Grid search-based parameter tuning:

from sklearn.model_selection import GridSearchCV

n_estimators =[64,100,128,200]
max_features = [2,3,5,7]
bootstrap = [True,False]

param_grid = {'n_estimators':n_estimators,
             'max_features':max_features,
             'bootstrap':bootstrap}

rfc = RandomForestClassifier()
grid = GridSearchCV(rfc,param_grid)

grid.fit(X_train,y_train)
GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'bootstrap': [True, False],
                         'max_features': [2, 3, 5, 7],
                         'n_estimators': [64, 100, 128, 200]})
grid.best_params_  # 找到最优的参数
{'bootstrap': False, 'max_features': 3, 'n_estimators': 200}
# 再次建立随机森林模型

rfc = RandomForestClassifier(
    max_features=3,
    n_estimators=200,
    bootstrap=False)

rfc.fit(X_train_smote,y_train_smote)

rfc_tuned_pred = rfc.predict(X_test)
# 新的分类报告得分

print(classification_report(y_test,rfc_tuned_pred))

print('Accuracy Score: ',accuracy_score(y_test,rfc_tuned_pred))
print('F1 Score: ',f1_score(y_test,rfc_tuned_pred))
              precision    recall  f1-score   support

           0       0.95      0.98      0.97      3404
           1       0.05      0.02      0.03       173

    accuracy                           0.94      3577
   macro avg       0.50      0.50      0.50      3577
weighted avg       0.91      0.94      0.92      3577

Accuracy Score:  0.9362594352809617
F1 Score:  0.025641025641025644

Logistic regression model tuning

penalty = ['l1','l2']
C = [0.001, 0.01, 0.1, 1, 10, 100] 

log_param_grid = {'penalty': penalty, 
                  'C': C}

logreg = LogisticRegression()
grid = GridSearchCV(logreg,log_param_grid)
grid.fit(X_train_smote,y_train_smote)
GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
                         'penalty': ['l1', 'l2']})
grid.best_params_
{'C': 1, 'penalty': 'l2'}
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),
                                    ('LR',LogisticRegression(C=1,penalty='l2',random_state=42))])

logreg_pipeline.fit(X_train_smote,y_train_smote)

Out[65]:

Pipeline(steps=[('scale', StandardScaler()),
                ('LR', LogisticRegression(C=1, random_state=42))])

In [66]:

logreg_new_pred   = logreg_pipeline.predict(X_test) # 新预测

In [67]:

print(classification_report(y_test,logreg_new_pred))

print('Accuracy Score: ',accuracy_score(y_test,logreg_new_pred))
print('F1 Score: ',f1_score(y_test,logreg_new_pred))
              precision    recall  f1-score   support

           0       0.96      0.92      0.94      3404
           1       0.18      0.33      0.23       173

    accuracy                           0.89      3577
   macro avg       0.57      0.63      0.59      3577
weighted avg       0.93      0.89      0.91      3577

Accuracy Score:  0.8932065977075762
F1 Score:  0.22983870967741934

Support vector machine tuning

In [68]:

svm_param_grid = {
            'C': [0.1, 1, 10, 100, 1000],  
            'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
            'kernel': ['rbf']} 

svm = SVC(random_state=42)

grid = GridSearchCV(svm, svm_param_grid)

In [69]:

grid.fit(X_train_smote,y_train_smote)

Out[69]:

GridSearchCV(estimator=SVC(random_state=42),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']})

In [70]:

grid.best_params_

Out[70]:

{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}

In [71]:

svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(C=100,gamma=0.0001,kernel='rbf',random_state=42))])

svm_pipeline.fit(X_train_smote,y_train_smote)

svm_tuned_pred   = svm_pipeline.predict(X_test)

In [72]:

print(classification_report(y_test,svm_tuned_pred))

print('Accuracy Score: ',accuracy_score(y_test,svm_tuned_pred))
print('F1 Score: ',f1_score(y_test,svm_tuned_pred))
              precision    recall  f1-score   support

           0       0.96      0.93      0.94      3404
           1       0.16      0.27      0.20       173

    accuracy                           0.90      3577
   macro avg       0.56      0.60      0.57      3577
weighted avg       0.92      0.90      0.91      3577

Accuracy Score:  0.8951635448700028
F1 Score:  0.19700214132762314

in conclusion

  1. During cross-validation, random forests performed the best.
  2. Comparison of 3 models: random forest has the best accuracy, but the F1-score is missing the lowest
  3. Possible features of the model: Better predictors of who will have a stroke than who won't

Guess you like

Origin juejin.im/post/7114303398379257870