Public number: You Er Hut
Author: Peter
Editor: Peter
Hello everyone, my name is Peter~
What I am sharing with you today is the data set modeling of stroke disease cases on kaggle. The main content of the article refers to the map:
Original data address: www.kaggle.com/datasets/fe…
import library
basic data
Import the data first and view the basic information of the data
Let's see the basic information of the data below
In [3]:
df.shape
Out[3]:
(5110, 12)
In [4]:
df.dtypes
Out[4]:
id int64
gender object
age float64
hypertension int64
heart_disease int64
ever_married object # 字符型
work_type object
Residence_type object
avg_glucose_level float64
bmi float64
smoking_status object
stroke int64
dtype: object
In [5]:
df.describe() # 描述统计信息
Out[5]:
Field distribution
gender statistics
In [6]:
plt.figure(1, figsize=(12,5))
sns.countplot(y="gender", data=df)
plt.show()
age distribution
In [7]:
px.violin(y=df["age"])
fig = px.histogram(df,
x="age",
color_discrete_sequence=['firebrick'])
fig.show()
ever_married
In [9]:
plt.figure(1, figsize=(12,5))
sns.countplot(y="ever_married", data=df)
plt.show()
There are about twice as many married people in this dataset as unmarried people.
work-type
View the number of people with different work status
In [10]:
plt.figure(1, figsize=(12,8))
sns.countplot(y="work_type", data=df)
plt.show()
Residence_type
In [11]:
plt.figure(1, figsize=(12,8))
sns.countplot(y="Residence_type", data=df)
plt.show()
avg_glucose_level
Distribution of blood sugar levels
fig = px.histogram(df,
x="avg_glucose_level",
color_discrete_sequence=['firebrick'])
fig.show()
It can be seen that most people's blood sugar is still below 100, indicating that it is normal
bmi
Distribution of bmi indicators
The mean of the bmi indicator is about 28, showing a certain normal distribution
smoking_status
Statistics on smoking
plt.figure(1, figsize=(12,8))
sns.countplot(y="smoking_status", data=df)
plt.show()
It can be seen that there are relatively few people who smoke or have ever smoked
Missing value case
Missing value statistics
df.isnull().sum()
id 0
gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 201
smoking_status 0
stroke 0
dtype: int64
201 / len(df) # 缺失比例
0.03933463796477495
Missing value visualization
Missing value handling
Using Decision Tree Regression to Predict Missing BMI Values: Prediction Fill by Age, Gender, and Existing BMI Values
dt_bmi = Pipeline(steps=[("scale",StandardScaler()), # 数据标准化
("lr",DecisionTreeRegressor(random_state=42))
])
Take out 3 indicators for prediction filling:
X = df[["age","gender","bmi"]].copy()
dic = {"Male":0, "Female":1, "Other":-1}
X["gender"] = X["gender"].map(dic).astype(np.uint8)
X.head()
Take out the part of non-missing values for training:
# 缺失值部分
missing = X[X.bmi.isna()]
# 非缺失值部分
X = X[~X.bmi.isna()]
y = X.pop("bmi")
# 模型训练
dt_bmi.fit(X,y)
Pipeline(steps=[('scale', StandardScaler()),
('lr', DecisionTreeRegressor(random_state=42))])
In [23]:
# 模型预测
y_pred = dt_bmi.predict(missing[["age","gender"]])
y_pred[:5]
Out[23]:
array([29.87948718, 30.55609756, 27.24722222, 30.84186047, 33.14666667])
Convert the predicted value to a Series, and note the index number:
predict_bmi = pd.Series(y_pred, index=missing.index)
predict_bmi
1 29.879487
8 30.556098
13 27.247222
19 30.841860
27 33.146667
...
5039 32.716000
5048 28.313636
5093 31.459322
5099 28.313636
5105 28.476923
Length: 201, dtype: float64
Fill into the original df data:
df.loc[missing.index, "bmi"] = predict_bmi
After the above prediction and filling, we check the missing value situation again and find that there are no missing values:
df.isnull().sum()
id 0
gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 0
smoking_status 0
stroke 0
dtype: int64
Data EDA
variables = [variable for variable in df.columns if variable not in ['id','stroke']]
# 除去id号和是否中风外的全部字段
variables
['gender',
'age',
'hypertension',
'heart_disease',
'ever_married',
'work_type',
'Residence_type',
'avg_glucose_level',
'bmi',
'smoking_status']
continuous variable
A few conclusions:
- Age: The overall distribution is relatively balanced, and the differences in the number of different age groups are small
- Blood sugar level: mainly concentrated below 100
- BMI indicator: showing a certain normal distribution
Stroke and not stroke
上面我们查看了连续型变量的分布情况;可以看到bmi呈现明显的左偏态的分布。下面我们对比中风和未中风的情况:
从3个密度图中能够观察到:从上面的密度图中可以看出来:对于是否中风,年龄age是一个最主要的因素
对比不同年龄段的血糖和BMI指数
px.scatter(df,x="age",
y="avg_glucose_level",
color="stroke",
trendline='ols'
)
年龄和血糖、bmi关系
px.scatter(df,x="age",
y="bmi",
color="stroke",
trendline='ols'
)
年龄和患病几率
从散点分布图中看到:年龄可能真的是一个比较重要的因素,和BMI以及平均的血糖水平有着一定的关系。
可能随着年龄的增长,风险在增加。果真如此吗?
上面的图形说明了两点:
- 年龄越大,中风的几率的确越来越高
- 中风的几率是非常低的(y轴的值很低),这是由于中风和未中风的样本不均衡造成的
原数据5000个样本中只有249个中风样本,比例接近1:20
样本不均衡
整体属性分布
首先我们剔除gender中为Other的情况
In [34]:
str_only = df[df['stroke'] == 1] # 中风
no_str_only = df[df['stroke'] == 0] # 未中风
In [35]:
len(str_only)
Out[35]:
249
In [36]:
# 剔除other
no_str_only = no_str_only[(no_str_only['gender'] != 'Other')]
比较在不同的属性下中风和未中风的情况:
建模
模型baseline
In [38]:
len(str_only)
Out[38]:
249
In [39]:
249 / len(df)
Out[39]:
0.0487279843444227
说明总共有249个人是中风的。本数据的总人数是len(df),根据下面的表达式能够得到本次模型的baseline。
也就说,对于阳性中风患者的召回率,一个好的目标是4.8%。
字段编码
对4个字符型的字段进行编码工作:
In [40]:
df['gender'] = df['gender'].replace({'Male':0,
'Female':1,
'Other':-1}
).astype(np.uint8)
df['Residence_type'] = df['Residence_type'].map({'Rural':0,
'Urban':1}
).astype(np.uint8)
df['work_type'] = df['work_type'].map({'Private':0,
'Self-employed':1,
'Govt_job':2,
'children':-1,
'Never_worked':-2}
).astype(np.uint8)
df['ever_married'] = df['ever_married'].map({'No':0,'Yes':1}).astype(np.uint8)
df.head()
抽烟状态的独热码转换:
In [41]:
df["smoking_status"].value_counts()
Out[41]:
never smoked 1892
Unknown 1544
formerly smoked 885
smokes 789
Name: smoking_status, dtype: int64
In [42]:
df = df.join(pd.get_dummies(df["smoking_status"]))
df.drop("smoking_status",axis=1,inplace=True)
数据分割
In [43]:
# 选取特征
X = df.drop("stroke",axis=1)
# 目标变量
y = df['stroke']
from sklearn.model_selection import train_test_split
# 3-7比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)
上采样
前文中提到,本案例中风和未中风的数据比例接近1:20,在这里我们采样基于SMOTE的上采样方法
In [44]:
oversample = SMOTE()
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train.ravel())
In [45]:
len(y_train_smote)
Out[45]:
2914
In [46]:
len(X_train_smote)
Out[46]:
2914
建模
采用3种不同的分类模型来建立模型:Random Forest, SVM, Logisitc Regression
In [47]:
rf_pipeline = Pipeline(steps = [('scale',StandardScaler()), # 标准化
('RF',RandomForestClassifier(random_state=42))] # 模型
)
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),
('SVM',SVC(random_state=42))])
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),
('LR',LogisticRegression(random_state=42))])
10折交叉验证
3种模型得分对比
In [49]:
print('随机森林:', rf_cv.mean())
print('支持向量机:',svm_cv.mean())
print('逻辑回归:', logreg_cv.mean())
随机森林: 0.9628909366701726
支持向量机: 0.9363667907023254
逻辑回归: 0.8859930523017683
很明显:随机森林表现的最好!
模型训练fit
In [50]:
rf_pipeline.fit(X_train_smote,y_train_smote)
svm_pipeline.fit(X_train_smote,y_train_smote)
logreg_pipeline.fit(X_train_smote,y_train_smote)
Out[50]:
Pipeline(steps=[('scale', StandardScaler()),
('LR', LogisticRegression(random_state=42))])
In [51]:
# 3种模型预测
rf_pred =rf_pipeline.predict(X_test)
svm_pred = svm_pipeline.predict(X_test)
logreg_pred = logreg_pipeline.predict(X_test)
评价指标
In [52]:
# 1、混淆矩阵
rf_cm = confusion_matrix(y_test, rf_pred )
svm_cm = confusion_matrix(y_test, svm_pred)
logreg_cm = confusion_matrix(y_test, logreg_pred)
In [53]:
print(rf_cm)
print("----")
print(svm_cm)
print("----")
print(logreg_cm)
[[3338 66]
[ 164 9]]
----
[[3196 208]
[ 148 25]]
----
[[3138 266]
[ 116 57]]
print('RF mean :',rf_f1)
print('SVM mean :',svm_f1)
print('LR mean :',logreg_f1)
RF mean : 0.07258064516129033
SVM mean : 0.1231527093596059
LR mean : 0.22983870967741934
随机森林模型的分类报告:
from sklearn.metrics import plot_confusion_matrix, classification_report
print(classification_report(y_test,rf_pred))
print('Accuracy Score: ',accuracy_score(y_test,rf_pred))
precision recall f1-score support
0 0.95 0.98 0.97 3404
1 0.12 0.05 0.07 173
accuracy 0.94 3577
macro avg 0.54 0.52 0.52 3577
weighted avg 0.91 0.94 0.92 3577
Accuracy Score: 0.9357003075202683
随机森林模型调参
Grid search-based parameter tuning:
from sklearn.model_selection import GridSearchCV
n_estimators =[64,100,128,200]
max_features = [2,3,5,7]
bootstrap = [True,False]
param_grid = {'n_estimators':n_estimators,
'max_features':max_features,
'bootstrap':bootstrap}
rfc = RandomForestClassifier()
grid = GridSearchCV(rfc,param_grid)
grid.fit(X_train,y_train)
GridSearchCV(estimator=RandomForestClassifier(),
param_grid={'bootstrap': [True, False],
'max_features': [2, 3, 5, 7],
'n_estimators': [64, 100, 128, 200]})
grid.best_params_ # 找到最优的参数
{'bootstrap': False, 'max_features': 3, 'n_estimators': 200}
# 再次建立随机森林模型
rfc = RandomForestClassifier(
max_features=3,
n_estimators=200,
bootstrap=False)
rfc.fit(X_train_smote,y_train_smote)
rfc_tuned_pred = rfc.predict(X_test)
# 新的分类报告得分
print(classification_report(y_test,rfc_tuned_pred))
print('Accuracy Score: ',accuracy_score(y_test,rfc_tuned_pred))
print('F1 Score: ',f1_score(y_test,rfc_tuned_pred))
precision recall f1-score support
0 0.95 0.98 0.97 3404
1 0.05 0.02 0.03 173
accuracy 0.94 3577
macro avg 0.50 0.50 0.50 3577
weighted avg 0.91 0.94 0.92 3577
Accuracy Score: 0.9362594352809617
F1 Score: 0.025641025641025644
Logistic regression model tuning
penalty = ['l1','l2']
C = [0.001, 0.01, 0.1, 1, 10, 100]
log_param_grid = {'penalty': penalty,
'C': C}
logreg = LogisticRegression()
grid = GridSearchCV(logreg,log_param_grid)
grid.fit(X_train_smote,y_train_smote)
GridSearchCV(estimator=LogisticRegression(),
param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2']})
grid.best_params_
{'C': 1, 'penalty': 'l2'}
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),
('LR',LogisticRegression(C=1,penalty='l2',random_state=42))])
logreg_pipeline.fit(X_train_smote,y_train_smote)
Out[65]:
Pipeline(steps=[('scale', StandardScaler()),
('LR', LogisticRegression(C=1, random_state=42))])
In [66]:
logreg_new_pred = logreg_pipeline.predict(X_test) # 新预测
In [67]:
print(classification_report(y_test,logreg_new_pred))
print('Accuracy Score: ',accuracy_score(y_test,logreg_new_pred))
print('F1 Score: ',f1_score(y_test,logreg_new_pred))
precision recall f1-score support
0 0.96 0.92 0.94 3404
1 0.18 0.33 0.23 173
accuracy 0.89 3577
macro avg 0.57 0.63 0.59 3577
weighted avg 0.93 0.89 0.91 3577
Accuracy Score: 0.8932065977075762
F1 Score: 0.22983870967741934
Support vector machine tuning
In [68]:
svm_param_grid = {
'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}
svm = SVC(random_state=42)
grid = GridSearchCV(svm, svm_param_grid)
In [69]:
grid.fit(X_train_smote,y_train_smote)
Out[69]:
GridSearchCV(estimator=SVC(random_state=42),
param_grid={'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']})
In [70]:
grid.best_params_
Out[70]:
{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
In [71]:
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(C=100,gamma=0.0001,kernel='rbf',random_state=42))])
svm_pipeline.fit(X_train_smote,y_train_smote)
svm_tuned_pred = svm_pipeline.predict(X_test)
In [72]:
print(classification_report(y_test,svm_tuned_pred))
print('Accuracy Score: ',accuracy_score(y_test,svm_tuned_pred))
print('F1 Score: ',f1_score(y_test,svm_tuned_pred))
precision recall f1-score support
0 0.96 0.93 0.94 3404
1 0.16 0.27 0.20 173
accuracy 0.90 3577
macro avg 0.56 0.60 0.57 3577
weighted avg 0.92 0.90 0.91 3577
Accuracy Score: 0.8951635448700028
F1 Score: 0.19700214132762314
in conclusion
- During cross-validation, random forests performed the best.
- Comparison of 3 models: random forest has the best accuracy, but the F1-score is missing the lowest
- Possible features of the model: Better predictors of who will have a stroke than who won't