2023 "Huawei Cup" Chinese Graduate Mathematical Modeling Competition (E Question) In-depth Analysis | Mathematical Modeling Complete Code + Full Analysis of the Modeling Process


Insert image description here

Question one

Exploration modeling of factors related to risk of hematoma expansion

Idea:

According to the question requirements, it is first necessary to determine whether a hematoma expansion event has occurred in each patient. By definition, hematoma expansion was judged to have occurred if the hematoma volume in subsequent examinations increased by ≥6 mL or ≥33% compared with the initial examination.
Specific judgment steps:
(1) Extract the serial number of each patient’s first imaging examination upon admission from Table 1;

(2) Find the time point corresponding to the first inspection in Appendix 1 based on the serial number;

(3) Calculate the time interval from onset to first examination;

(4) Find the hematoma volume at each follow-up time point in Table 2;

(5) Calculate the hematoma volume change and change percentage between two consecutive examinations in sequence;

(6) If the change amount is ≥6 mL or the change percentage is ≥33%, it is recorded as hematoma expansion, and the time point when hematoma expansion occurs is recorded.

3. Use logistic regression modeling, taking whether hematoma expansion occurs as the target variable, and personal history, disease history, and first imaging characteristics as independent variables to establish a prediction model.

Target variable: Y = whether hematoma expansion occurs (1 yes, 0 no)

Independent variables: X1, X2, …, Xn (personal history, disease history, etc.)

Modeling formula: P(Y=1|X) = 1 / (1+e^-(b0+b1X1+…+bnXn))

4. Use the training set to fit the logistic regression model

(1) Organize the personal history, disease history and first imaging characteristics of the training set into independent variables X

(2) Use the hematoma expansion mark (1 or 0) of the training set as the target variable Y

(3) Feed the independent variable X and target variable Y into the logistic regression model for fitting

(4) Use maximum likelihood estimation to obtain variable coefficients b0, b1, …, bn

(5) Obtain the fitted model:

P(Y=1|X) = 1 / (1+e^-(b0+b1X1+…+bnXn))

5. Use the fitted model to predict the test set

(1) Perform the same feature engineering on the test set data and extract the independent variable X

(2) Substitute the independent variable X of the test set into the model obtained above

(3) Calculate the hematoma expansion probability P(Y=1|X) of each sample

(4) If P(Y=1|X) ≥ 0.5, it is predicted that the sample has hematoma expansion.

(5) Calculate the evaluation indicators of the model on the test set, such as AUC, etc.

(6) Analyze the correlation between variables and hematoma expansion based on the size of the variable coefficients

import pandas as pd
from sklearn.linear_model import LogisticRegression

# 读取表1和表2中的数据
table1 = pd.read_excel('表1.xlsx') 
table2 = pd.read_excel('表2.xlsx')

# 将表1和表2进行合并
data = pd.merge(table1, table2, on='ID')

# 提取需要的特征
features = ['age', 'gender', 'history', ...] 

# 获得每个患者的首次影像时间和血肿体积
first_scan = data.groupby('ID')['time'].min()
first_volume = data[data['time'] == first_scan]['HM_volume']  

The main steps of the code include:

Read and merge tables
Feature engineering
Label target variables
Split training set and test set
Model training and prediction
Output results
Here we use xgboost to train the model:

The main steps are:

Import xgboost
and set the parameters of xgboost:
eta: learning rate
max_depth: maximum depth of the tree
objective: binary logistic regression
eval_metric: set the evaluation index to AUC
Convert the training data to DMatrix format
Use xgboost to train the model
Convert the test data to DMatrix format
Use the trained model to predict
the output results.
XGBoost is a popular and efficient tree model library that can extract complex feature relationships of data.

Compared with logistic regression, XGBoost can handle various types of features and is also convenient for adjusting parameters and optimizing the model.

Question 2

Model the occurrence and progression of perihematoma edema, and explore the relationship between therapeutic intervention and edema progression.

To construct a model of edema volume changing over time
, you can use the Curve Fitting method, using time as the independent variable and edema volume as the target variable, to fit a curve model of edema volume changing over time:

V E D = f ( t ) V_{ED} = f(t) VED=f(t)

Among them, VED V_{ED}VEDrepresents edema volume, ttt represents time.

You can try different curve fitting methods, such as linear regression, polynomial regression, local weighted regression, etc.

Calculate the residual between the patient's true value and the fitted curve
for the i-th sample:

r i = V E D i − f ( t i ) r_i = V_{ED_i} - f(t_i) ri=VEDif(ti)

Among them, VED i V_{ED_i}VEDiis the true edema volume of the i-th sample, f (ti) f(t_i)f(ti) is the fitting value at the corresponding time point.

Divide patient subgroups and fit the edema volume curves of each subgroup. You
can use a clustering algorithm such as K-means to group patients, and then fit a separate curve for each group.

To analyze the impact of different treatments on the evolution of edema,
treatment methods can be used as category features, different curve models can be constructed, and then the model effects can be compared.

The differences in edema volume changes in different treatment groups can also be compared through statistical methods (such as t test).

Analyze the relationship between the three.
Statistical methods such as correlation analysis can be used to explore the relationship between hematoma volume, edema volume and treatment.

You can also build a prediction model that includes the three as features, and discover the correlation between the three by analyzing coefficients.
Specifically , the correlation analysis method
(1) calculates the 0/1 representation of hematoma volume, edema volume, and various treatment modalities for each sample

(2) Use Pearson correlation coefficient to calculate the linear correlation between hematoma volume and edema volume

(3) Use Spearman’s rank correlation coefficient to calculate the rank correlation between hematoma volume and each treatment method.

(4) Use Spearman’s rank correlation coefficient to calculate the rank correlation between edema volume and each treatment method.

(5) Compare the sizes of different coefficients and analyze the degree of correlation between the three

Modeling method
(1) uses hematoma volume and edema volume as continuous features, and treatment methods as classification features

(2) Construct a regression model with edema volume as the target variable and hematoma volume and treatment as independent variables.

(3) Train the model and obtain the coefficients of each variable

(4) Compare the coefficients of each treatment category to see its impact on edema volume

(5) Select key influencing factors through significance testing of variables

(6) Analyze the overall performance of the model and evaluate the explanatory ability of each variable

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans

# 读取数据
data = pd.read_excel('table2.xlsx') 

# 特征工程:提取时间和水肿体积
X = data[['time']]  
y = data[['ED_volume']]

# 构建线性回归模型
lr = LinearRegression()

# 训练模型
lr.fit(X, y)

# 获取拟合的系数
print('模型Slope:', lr.coef_)  
print('模型Intercept:', lr.intercept_)

# 预测水肿体积
y_pred = lr.predict(X) 

Question 3

Prognosis prediction and key factors exploration in patients with hemorrhagic stroke
1. Predict prognosis based on the first imaging results
. Use a regression model, with the 90-day mRS score as the target variable, and personal history, disease history, and first imaging characteristics as independent variables:

m R S = w 0 + w 1 x 1 + . . . + w n x n mRS = w_0 + w_1x_1 + ... + w_nx_n mRS=w0+w1x1+...+wnxn

Among them, mRS mRSm RS is the prognostic score,xi x_ixiFor each feature, wi w_iwiis the corresponding weight coefficient.

You can try linear regression, LASSO regression and other algorithms.

2. Predict prognosis based on all imaging results.
As above, not only use the first imaging, but also combine the imaging features at subsequent time points to build a regression model for prediction.

3. Analyze the key influencing factors
by analyzing the weight of each variable wi w_iwi, determine the pair of m RS mRSm RS features with the greatest impact.
Use statistical tests to analyze the effects of different features onm RS mRSSignificant impact of m RS .
Use feature selection methods (such as RFE) to select key features.
After deleting irrelevant features, observe the changes in model scores.
Specifically,
1) For the selection of modeling algorithms,
you can try linear regression, LASSO regression, GBDT and other algorithms
to compare the errors and overfitting conditions of different algorithms, select the better algorithm
to adjust parameters and optimize the model, and improve the accuracy
2) Features Engineering
processing of missing values: delete/fill
coding categorical features: One-hot coding
standardization continuous features: mean removal and variance normalization
extraction of time series features: trend, periodicity, etc. Dimensionality
reduction using PCA and other methods
3) Model evaluation and
dividing the training set
multiple cross-validation ofvalidation set and test set
, calculate RMSE, R2, MAE and other evaluation indicators
to draw the learning curve, check the over-fitting problem
4) Key factor analysis
, calculate the influence of features, sort and filter
by adding/deleting features, comparison
Use statistical tests (t-test, etc.) to determine the significance ofchanges in model effects.
Use regularization methods to automatically screen features and analyze
the effects of features on different subgroups.

4. Make suggestions
for the characteristics that have significant impact, analyze the clinical significance, and give intervention suggestions.
Comparative analysis was conducted between the patient groups with good prognosis and poor prognosis to find out the differences in influencing factors.
Code:

# 导入需要的库
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# 读取数据
data = pd.read_csv('data.csv')

# 特征工程
X = data[['age', 'gender', 'treatment', 'image_features']]
y = data['mRS']

# 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2020)

# Lasso回归
model = Lasso()

# 使用网格搜索找到最优参数
from sklearn.model_selection import GridSearchCV
params = {
    
    'alpha': [0.001, 0.01, 0.1, 1]}
gs = GridSearchCV(model, params, scoring='neg_mean_squared_error', cv=5)
gs.fit(X_train, y_train)
print('最优参数:', gs.best_params_)
model = gs.best_estimator_#见完整版

Check out my answer for the full version of the idea~

(5 private messages/2 messages) How to evaluate the 2023 Mathematical Modeling Research Competition? -csdn

Guess you like

Origin blog.csdn.net/qq_25834913/article/details/133157018