Machine Learning Modeling the Pima Indian Diabetes Dataset - Paper_Enterprise Research

Diabetes Overview

There are two types of diabetes, one of which is a chronic disease caused by the disorder of insulin secretion by the pancreas or the inability of the human body to effectively use the insulin produced by it. It is one of the health problems faced by human beings in the 21st century. Diabetes is accompanied by diffuse complications, including cardiovascular disease, kidney disease, high blood pressure, stroke, etc., eye diseases, and hundreds of lower limb amputations, which increase the risk of premature death. Therefore, the situation of diabetes prevention and treatment is very serious.

Diabetic retinopathy

 

In 2019, the estimated prevalence of diabetes in China ranked second in the world

 

The number of diabetic patients in China ranks first in the world. China is the largest diabetes drug R&D market. More and more young people are joining the diabetes market and becoming cash cows for pharmaceutical companies.

 

The figure below shows the historical data of the prevalence of diabetes in China

 

Diabetes places a huge burden on the economy Diabetes also places a huge burden on the economy, with the annual cost of diagnosed diabetes being approximately $327 billion, and the combined cost of undiagnosed diabetes and prediabetes approaching $400 billion.

 

diabetes preventable

While there is no cure for diabetes, strategies such as weight loss, healthy eating, physical activity, and access to medication can lessen the severity of the disease for many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making diabetes risk prediction models an important tool for public and public health officials.

Diverse risk factors for diabetes Although there are different types of diabetes, type 2 diabetes is the most common form, and its prevalence varies with age, education, income, location, race, and other social determinants of health. Much of the burden of the disease also falls on those of lower socioeconomic status. This experiment is to predict the probability of diabetes and mine important pathogenic factors of diabetes by establishing an artificial intelligence machine learning model.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Diabetes Modeling Dataset Introduction Diabetes Dataset Source Pima Indian Diabetes Dataset. The data set contains 769 pieces of data and 9 variables. The variables were as follows: pregnancy, blood sugar, blood pressure, skin thickness, insulin, BMI, diabetic system function, age, and whether or not a diabetic.

The experimental data of this study comes from the Pima Indian Diabetes dataset in the University of California, UGI machine learning database, and the research object is the Pima Indians near Phoenix, Arizona. The data set has a total of 768 data items, including 8 medical predictor variables and 1 outcome variable. Head muscle skin fat thickness (SkinThickness), insulin content (Insulin), body mass index (BMI), diabetes genetic coefficient (DiabetesPedigreeFunction) and outcome (Outcome, 1 represents diabetes, 0 represents no diabetes). In the PimaIndianDiabetes data set, there are 268 cases with an outcome of 1, that is, the number of diabetic patients; and 500 cases with an outcome of 0, that is, the number of people without diabetes.

Model value and meaning

Through our artificial intelligence machine learning prediction model, the following research questions can be realized: 1. The model can accurately predict whether an individual has diabetes. 2. The model can mine which risk factors can best predict the risk of diabetes. 3. We can use a subset of risk factors to accurately predict whether a person will have diabetes. 4. We can use screening for several important diabetes-causing traits, then combined to create a short question, to accurately predict whether someone is likely to have diabetes or whether they are at high risk for diabetes.

 

Although the performance of the traditional ensemble tree algorithm is better than that of the decision tree, there is still room for improvement in performance.

 

The model adopts a new generation of symmetric tree algorithm, which can effectively reduce overfitting and improve the prediction speed and prediction ability of the model.

 

The performance of the diabetes prediction model is excellent, and the ROC is greater than 0.84.

 

Through descriptive statistics, we observed the histogram of the variables in the Pima Indian diabetes data set: BMI, blood pressure, and blood sugar showed an obvious normal distribution.

 

The data missing rate of all variables is 0, which is a good scientific research modeling data set.

 

The variable correlation heat map shows that blood sugar, BMI, age are highly correlated with diabetes.

 

Through data mining, we obtained the ranking of variable importance in the Pima Indian dataset.

Model Apocalypse 1 Blood sugar - control the intake of foods with high sugar content, such as white sugar, milk tea, candy, and snacks.

Model Apocalypse 2

BMI - weight control, proper exercise

 

 

Artificial intelligence makes life better!

AI Machine Learning Modeling Pima Indian Diabetes Dataset - Paper

Copyright statement: The article comes from the official account (python bioinformatics), and no plagiarism is allowed without permission. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/129647923