Can machine learning diagnose conditions and predict how patients will fare after they leave the hospital?

Abstract: Machine learning is gradually changing all walks of life, and the medical industry is also undergoing change. Unexpectedly, machine learning can not only diagnose the patient's condition, but also predict the patient's condition after discharge. This research direction is a bit interesting, and interested readers come and take a look!

 

       With the continuous improvement of data volume and computer performance, machine learning technology is gradually permeating all walks of life. Computer vision, natural language processing, robotics and other fields have basically been monopolized by machine learning algorithms, and are gradually expanding to traditional industries such as education, banking, and medical care. For how machine learning changes the traditional education model, you can refer to the blogger's article " Using AR, AI and Big Data to Reform the Education System - Creating Your Own Personalized Learning Route for Each Student ". The banking industry is currently hyped about artificial intelligence. Most banks take a wait-and-see attitude and will not use artificial intelligence to replace most bank staff in a short period of time . The application of AI in the medical industry is also relatively hot, such as using AI to detect cancer , drive new drug discovery engines , and genetic testing . While sepsis is a common complication in the medical industry, this paper will use machine learning to predict the post-discharge condition of sepsis patients.

       Sepsis is a systemic inflammatory response syndrome caused by infectious factors, which can lead to organ dysfunction or circulatory disturbance in severe cases. It is a common complication of severe trauma, burns, shock, infection, and major surgical operations. Fever, low blood pressure and other common diseases are very similar, and it is difficult to be detected early. If not treated in time, it can further develop into septic shock, and its in-hospital mortality rate exceeds 40%, which is quite dangerous.

       Understanding the highest risk of death in patients with sepsis is helpful for clinicians to prioritize care. The team , in collaboration with researchers at the Geisinger Health Care System, used historical electronic health record data (EHR) to build a model to predict all-cause mortality in hospitalized patients with sepsis or 90 days after discharge. The model can guide medical teams to carefully monitor and take effective preventive measures for those patients predicted to be at high risk of death.

data science environment

       Provides a programming environment for data scientists using IBM Data Science Experience (three popular programming languages: Python, Scala and R, two programming analysis tools: Jupyter and Zeppelin), in addition, IBM Data Science Experience through business applications for real-time or batch scoring to operate the model, integrating feedback loops for continuous model detection and retraining.

Collect and preprocess data

       Geisinger obtained data on more than 10,000 patients diagnosed with sepsis between 2006 and 2016, including records such as demographics, inpatient and outpatient, surgical procedures, medical history, medications, transfers between hospital units, and laboratory results.

       For each patient, select the nearest hospital and the most relevant hospitalization data, including information specific to the hospital stay, such as type of surgery, culture location (bacteria), etc. In addition, summary information before admission was also derived, such as the number of surgical operations in the first 30 days of hospitalization, and post-discharge data was not used. Figure 1 presents these temporal data-based decisions:

 

Figure 1 Prediction based on time series data

 

       After merging the provided datasets, the resulting dataset consists of 10599 rows with 199 attributes (features) per patient.

 

predictive model

 

       After data cleaning and feature selection were completed, the task objective was defined as a binary classification problem: predicting whether a sepsis patient died within 90 days of discharge.

 

       The algorithm of choice is Gradient boosted trees (GBT) and is implemented through the XGBoost package . Due to its good execution speed and robustness , the AI ​​algorithm has always been a popular algorithm used in machine learning competitions. Another motivation for using XGBoots is the ability to fine-tune hyperparameters to improve model performance. In the training data, ten-fold cross-validation and grid search (GridSearchCV) were used to select parameters in an iterative manner to maximize the area under the ROC curve (AUC). An example from IBM's data science experience can be found here .

 

       The data set is divided into training set and test set, of which the training set accounts for 60% and the test set accounts for 40%. Use the training set to train the model, and apply the trained model parameters to the test set. The model performance is shown in Figure 2:

 

Figure 2 Performance of the XGBoost model

 

       Some of the data in Figure 2 are performance evaluation metrics, such as the AUC score. The closer this number is to 1, the better the ability of the model to classify positive predictions (TP), thereby reducing false positives. The AUC data of the test results was 0.8561, indicating that the model was able to identify whether the vast majority of sepsis patients died within 90 days, and if death was predicted, these patients could receive appropriate targeted therapy.

 

       For precision and recall , the closer the number is to 1, the more accurate the model is. The data shown in Figure 2 is close to 0.80, in favor of high recall—the goal is to minimize the number of patients the model misses who may eventually die from sepsis.

 

       For another evaluation metric accuracy (Accuracy) , use bootstrap to generate 1000 variants on training and test data, then run the XGBoost model on these data, and get the model accuracy per run, the accuracy of the 1000 runs results The probability that the degree distribution is between 0.77 and 0.79 is 95%, which means that the built model is able to identify more than three-quarters of the true results.

 

       In addition to the above evaluation indicators, the confusion matrix of the model is shown in Figure 3. As you can see from the figure, for the test data, the model identified 1190 patients as true positives (sepsis patients predicted to die died) and 2087 patients as true negatives (sepsis patients predicted to survive survived).

 

Figure 3 Negative-positive prediction

 

       XGBoost also has the ability to identify features that do not tell whether the selected features are predictors of death or survival, but the information generated by XGBoost is still very useful in knowing which features are used to predict death. As shown in Figure 4, 29.5% of patients used the "admission age" feature to predict death.

 

Figure 4 The 20 most important features of the model

 

       Further exploratory analysis of the features was performed to test how the features corresponded to death outcomes. While the above diagram is helpful to visualize the relationship between features and results, it is more important to understand the mechanism by which XGBoost trains multiple decision trees. Therefore, important features in the XGBoost model may not be significantly related to these outcome variables during exploration.

 

       As shown in Figure 5, features such as "age at admission" may indicate that older patients have a higher rate of death than younger patients, and another example "time of vasopressor use" features may indicate that the Patients have higher mortality rates, but these deaths may also be due to their poor health.

 

Figure 5 Some important characteristics associated with patient death

 

       The decision tree rules output by XGBoost can help doctors further understand how to formulate treatment plans for patients. For example, due to the higher risk of death in elderly patients, medical teams can pay special attention to elderly patients, monitor the duration of vasopressors they are taking, and minimize the number of patient transfers between departments to reduce the impact on susceptible patients. Wait.

 

in conclusion

 

       Predicting all-cause mortality in patients with sepsis can guide health providers to proactively monitor and take preventive measures to improve patient survival. In our model, those important features thought to be associated with death in sepsis patients were selected, i.e., the variables that the machine learning model can help to identify associated with sepsis death. In the future, as the amount of data increases, some more key features will be added to improve the model, and the method can also be applied to the prediction of other diseases, hoping to generate a more actionable model to improve the medical level.

 

Original link

To read more good articles, please scan the following QR code:

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325648228&siteId=291194637