Hands-on丨We found a diabetes dataset in UCL, using machine learning to predict diabetes (2)

Logistic regression:

Logistic regression is one of the most commonly used classification algorithms.

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression().fit(x_train,y_train)
print("Training set score:{:.3f}".format(logreg.score(x_train,y_train)))
print("Test set score:{:.3f}".format(logreg.score(x_test,y_test)))

 Training set score:0.781

 Test set score:0.771

The model with regularization parameter C=1 (the default value) is 78% accurate on the training set and 77% accurate on the test set.

logreg100=LogisticRegression(C=100).fit(x_train,y_train)
print("Training set score:{:.3f}".format(logreg100.score(x_train,y_train)))
print("Test set score:{:.3f}".format(logreg100.score(x_test,y_test)))

 Training set score:0.785

 Test set score:0.766

When the regularization parameter C is set to 100, the accuracy of the model on the training set is slightly improved but the accuracy on the test set is slightly reduced, indicating that a less regularized and more complex model may not necessarily predict better than the default parameter model. Better results.

Therefore, we choose the default value of C=1.

Let's visualize the coefficients of the model obtained with three different regularization parameters C.

Stronger regularization (C = 0.001) brings the coefficients closer and closer to zero. Looking closely at the graph, we can also find that the feature "DiabetesPedigreeFunction" (diabetes genetic function) has positive coefficients in the cases of C=100, C=1 and C=0.001. This shows that regardless of the model, the eigenvalue of DiabetesPedigreeFunction (diabetes genetic function) is positively correlated with the sample being diabetic.

diabetes_features=[x for i,x in enumerate(diabetes.columns) if i!=8]
plt.figure(figsize=(8,6))
plt.plot(logreg.coef_.T,'o',label="C=1")
plt.plot(logreg100.coef_.T,'^',label="C=100")
plt.plot(logreg001.coef_.T,'v',label="C=0.001")
plt.xticks(range(diabetes.shape[1]),diabetes_features,rotation=90)
plt.hlines(0,0,diabetes.shape[1])
plt.ylim(-5.5)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.legend()

 

 Decision tree:

from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier(random_state=0)
tree.fit(x_train,y_train)
print("Accuracy on training set:{:.3f}".format(tree.score(x_train,y_train)))
print("Accuracy on test set:{:.3f}".format(tree.score(x_test,y_test)))

 Accuracy on training set:1.000

 Accuracy on test set:0.714

The accuracy of the training set can be as high as 100%, while the accuracy of the test set is relatively much worse. This indicates that the decision tree is overfitting and does not perform well on new data. Therefore, we need to pre-prune the tree.

We set max_depth=3 to limit the depth of the tree to reduce overfitting. This reduces the training set accuracy but increases the test set accuracy.

tree=DecisionTreeClassifier(max_depth=3,random_state=0)
tree.fit(x_train,y_train)
print("Accuracy on training set:{:.3f}".format(tree.score(x_train,y_train)))
print("Accuracy on test set:{:.3f}".format(tree.score(x_test,y_test)))

 Accuracy on training set:0.773

 Accuracy on test set:0.740

 

Feature importance in decision tree:

The feature importance in the decision tree is used to measure the importance of each feature to the prediction result. Each feature is given a score from 0 to 1, where 0 means "useless at all" and 1 means "perfect prediction". The sum of the importance of each feature must be 1.

print("Feature importances:\n{}".format(tree.feature_importances_))

 Feature importances: [ 0.04554275 0.6830362 0. 0. 0. 0.27142106 0. 0. ]

Then we can visualize the feature importance:

def plot_feature_importances_diabetes(model):
    plt.figure(figsize=(8,6))
    n_features=8
    plt.barh(range(n_features),model.feature_importances_,align='center')
    plt.yticks(np.arange(n_features),diabetes_features)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1,n_features)
plot_feature_importances_diabetes(tree)

 

The feature "blood sugar" is by far the most important feature.

 Random Forest:

Let's apply a random forest of 100 trees on the diabetes dataset:

from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,random_state=0)
rf.fit(x_train,y_train)
print("Accuracy on training set:{:.3f}".format(rf.score(x_train,y_train)))
print("Accuracy on test set:{:.3f}".format(rf.score(x_test,y_test)))

 Accuracy on training set:1.000

 Accuracy on test set:0.786

Random forest without changing any parameters has an accuracy of 78.6%, which is better than logistic regression and single decision tree. However, we can still tweak the max_features setting to see if the effect improves.

rf1=RandomForestClassifier(max_depth=3,n_estimators=100,random_state=0)
rf1.fit(x_train,y_train)
print("Accuracy on training set:{:.3f}".format(rf1.score(x_train,y_train)))
print("Accuracy on test set:{:.3f}".format(rf1.score(x_test,y_test)))

 Accuracy on training set:0.800

 Accuracy on test set:0.755

The results did not improve, which suggests that random forest with default parameters works well here.

Feature importance of random forest:

plot_feature_importances_diabetes(rf1)

 

Similar to the single decision tree, the random forest results still show that the feature "blood sugar" is the most important, but it also shows that "BMI (Body Mass Index)" is the second most important informative feature overall. The randomness of random forests prompts the algorithm to consider more possible explanations, which results in random forests capturing much larger data than a single tree.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325220571&siteId=291194637