Data Visualization and Machine Learning Modeling: Heart Failure Prediction_Enterprise Research_Thesis Research_Graduation Project

picture

Data Analysis and Visualization

Heart failure or cardiovascular disease (CVD) is the number one cause of death worldwide, claiming approximately 17.9 million lives each year , accounting for 31% of all deaths worldwide .   

Most cardiovascular disease can be prevented by using population-wide strategies to address behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or established diseases) require early detection and management where machine learning models can Great help .  

data visualization

introduce

To better understand the dataset, we represent it in a graphical format. This helps us interpret data and identify patterns. Because of the way the human brain processes information, it's easier to visualize large amounts of complex data using charts or graphs than poring over spreadsheets or reports.

Preprocessing:

First, we first check if there are NaN values ​​in the dataset. This helps us verify the integrity of the data. As you can see in the image below, there are no NaN values, which means this is a complete dataset.

picture

Next, we obtain a summary of the dataset to better understand the dataset. This will give us information about the number of data columns, data type and number of entries.

picture

As can be seen from the figure above, the dataset has 13 columns, 299 data entries, and the data is in numeric format.

Next, we rename the dataset and its data entry points to better convey their meaning. We changed the 1s and 0s in the dataset to their representative strings and renamed the column names for clarity and a better look. Ultimately, we end up with a dataset that looks like this:

picture

Target Data: Patient Status

PatientStatus: If the patient died during follow-up (Boolean)

The goal of this dataset is to predict whether a patient will have a heart attack. We first check whether we have a balanced target variable. Therefore, we draw a pie chart of the target variable.

As shown above, the target variable only accounts for 32.1% of the dataset. This means that the dataset is highly imbalanced. But by the nature of our task, an unbalanced dataset is not a problem. Due to the nature of the job, most patients may not be prone to heart disease.

age and heart failure

Age : represents the age of the patient. (integer)

We start by plotting the first feature variable "Age" on "Patient Status" in a histogram. The range of "age" is 40-95.

  • 49.35% of patients over the age of 70 suffered from heart disease. (38 of 77)

  • 26.13% of patients under the age of 70 had a heart attack. (58 of 222)

sex and heart failure

Gender : represents the gender of the patient. (Boolean value)

Next we plot Gender and Patient Status. As shown in the figure below, males account for 64.9% of the entire dataset.

  • Thirty-one percent of male patients had a heart attack.

  • 32% of female patients had a heart attack.

diabetes and heart failure

Diabetes : Indicates whether the patient has diabetes. (Boolean value)

Next we plot Diabetes and Patient Status. As shown in the figure below, diabetic patients accounted for 41.8% of the entire dataset. This function is relatively balanced.

  • 60 is the average age for people with diabetes, and the average age for a heart attack in non-diabetics is 68.5 years.

  • As you can see from the graph above, there is a group of diabetics who suffer from heart disease between the ages of 59-60.

  • Overall, non-diabetics tend to live longer than diabetics.

smoking and heart failure

Smoking : Indicates whether a person smokes or not. (Boolean value)

Further plotting the smoking patients, they make up 32.1% of the dataset. Therefore, this feature is unbalanced in the dataset, leading to biased results.

  • 50% of smokers develop heart disease between the ages of 60-72.

  • And 25% of non-smokers will have a heart attack between the ages of 60-75.

high blood pressure and heart failure

Hypertension : Indicates whether a person has high blood pressure or high blood pressure. (Boolean value)

Similarly, we plot patients with high blood pressure (hypertension). As shown below, it accounts for 35.1% of the dataset.

  • 50% of people with high blood pressure will have a heart attack between the ages of 50-75.

Ejection Fraction and Heart Failure

Ejection Fraction (EF): The percentage of blood that leaves the heart with each contraction. (integer)

A normal heart may have an ejection fraction between 50% and 70%. An ejection fraction measurement below 40% may be evidence of heart failure.

An EF of 41% to 49% might be considered "borderline." It does not always indicate that a person is developing heart failure. Instead, it could indicate damage, possibly from a previous heart attack.

  • 33.8% of patients with poor ejection function had a heart attack.

  • Heart attack occurred in 19% of patients with normal ejection fraction.

  • It can be seen that for patients with high levels of ejection fraction, the data are too small to make educated inferences.

platelets

platelets : platelets in blood (kiloplatelets/mL) (integer)

A normal platelet count ranges from 150,000 to 350,000.

  • Thirty percent of patients with normal platelet counts had a heart attack.

  • Thirty-seven percent of patients with high platelet counts had a heart attack.

  • Forty-one percent of patients with low platelet counts had a heart attack.

anemia

Anemia : Indicates whether the patient suffers from anemia. (Boolean value)

Anemic patients accounted for about 43.1% of the dataset. So this feature seems to be fairly balanced.

  • 50% of anemic patients have a heart attack between the ages of 58-75.

creatinine phosphokinase

Creatinine Phosphokinase : The level (mcg/L) of the CPK enzyme in the blood. (integer)

In  healthy  adults, serum  CK levels  vary according to several factors (sex, race, and activity), but  the normal range  is 22 to 198 U/L (units per liter).

  • Heart attacks occurred in 32.7% of patients with high CPK values.

  • Heart attacks occurred in 24.7% of patients with normal CPK values.

serum creatinine

Serum Creatinine : Serum creatinine level in the blood (mg/dL). (float)

 The normal range for creatinine in the blood may be 0.84 to 1.21 mg per deciliter (74.3 to 107 micromoles per liter), although this may vary by laboratory, male and female,  and  age .

  • 25.7% of patients with normal creatinine levels had a heart attack.

  • 52.8% of patients with high creatinine levels had a heart attack.

serum sodium

Serum Na : Serum sodium level in blood (mEq/L) (integer)

Normal  blood sodium levels  are between 135 and 145 milliequivalents per liter (mEq/L), although this can vary by laboratory, male and female, and age.

  • Thirty-two percent of patients with normal sodium levels had a heart attack.

  • As can be seen above, the data are too small to make educated inferences about patients with high sodium levels.


Data visualization and machine learning modeling are introduced here. Welcome to learn "python machine learning bioinformatics" to learn more about it.


Copyright statement: The article comes from the official account (python bioinformatics), and no plagiarism is allowed without permission. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/132049654