Hands-on丨We found a diabetes dataset at UCL, using machine learning to predict diabetes (1)

Author: Susan Li 

Compilation: Yuan Xueyao, Wu Shuang, Jiang Fanbo

  According to the Centers for Disease Control and Prevention, one in seven adults in the U.S. now has diabetes. But by 2050, this proportion will rapidly increase to as high as one-third. We have a diabetes dataset in the UCL machine learning database, and we hope to use this dataset to understand how machine learning can be used to help us predict diabetes, let's get started!

       https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/diabetes.csv

data:

  The diabetes dataset is available and downloaded from the UCI Machine Learning Repository.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
diabetes=pd.read_csv('C:\Download\Machine-Learning-with-Python-master\Machine-Learning-with-Python-master\diabetes.csv')
print(diabetes.columns)

  

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
    'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
features (pregnancy times, blood sugar, blood pressure, sebum thickness, insulin, BMI, body mass index, diabetes genetic function, age, outcome)
diabetes.head()

  

print(diabetes.groupby('Outcome').size())
Outcome
0    500
1    268
dtype: int64 The "outcome" is the feature we're going to predict, 0 means no diabetes, 1 means diabetes. Of the 768 data points, 500 are marked as 0 and 268 are marked as 1.
print("dimennsion of diabetes data:{}".format(diabetes.shape))
dimension of diabetes data: (768, 9), the diabetes dataset consists of 768 data points, each with 9 features.
import seaborn as sns
sns.countplot(diabetes['Outcome'],label="Count")

  

KNN algorithm:

  The k-NN algorithm is almost arguably the simplest algorithm in machine learning. Building a model simply stores the training dataset. To make predictions about new data points, the algorithm finds the closest data point in the training data set—its “nearest neighbors.” First, let's investigate whether we can confirm the relationship between model complexity and accuracy:

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(diabetes.loc[:,diabetes.columns !='Outcome'],diabetes['Outcome'],stratify=diabetes['Outcome'],random_state=66)
from sklearn.neighbors import KNeighborsClassifier
training_accuracy=[]
test_accuracy=[]
#try n_neighbors from 1 to 10
neighbors_settings=range(1,11)

for n_neighbors in neighbors_settings:
    #build the model
    knn=KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(x_train,y_train)
    #record training set accuracy
    training_accuracy.append(knn.score(x_train,y_train))
    #record test set accuracy
    test_accuracy.append(knn.score(x_test,y_test))
plt.plot(neighbors_settings,training_accuracy,label="training accuracy")
plt.plot(neighbors_settings,test_accuracy,label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.savefig('knn_compare_model')

  

  The figure above shows the relationship between the model prediction accuracy (y-axis) and the number of neighbors settings (x-axis) on the training and test sets. If we pick only one nearest neighbor, then the training set predictions are absolutely correct. However, when more neighbors were selected as references, the accuracy of the training set decreased, indicating that using a single neighbor would lead to an overly complex model. The best solution here can be seen from the figure is to choose 9 nearest neighbors.

  The figure suggests that we should choose n_neighbors=9, which is given below:

knn=KNeighborsClassifier(n_neighbors=9)
knn.fit(x_train,y_train)

print('Accuracy of K-NN classifier on training set:{:.2f}'.format(knn.score(x_train,y_train)))
print('Accuracy of K-NN classifier on training set:{:.2f}'.format(knn.score(x_test,y_test)))
Accuracy of K-NN classifier on training set:0.79
Accuracy of K-NN classifier on training set:0.78

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324821699&siteId=291194637