【2023 Huashu Cup National College Students Mathematical Contest in Modeling】Python code realization of the influence of mother's physical and mental health on infant growth

[2023 Huashu Cup National College Students Mathematical Contest in Modeling] Question C: Influence of mother's physical and mental health on infant growth

insert image description here

1 topic

A mother is one of the most important people in a baby's life, providing not only nutrition and physical protection but also emotional support and a sense of security. Adverse conditions in the mother's mental health, such as depression, anxiety, and stress, may have negative effects on the baby's cognition, emotion, and social behavior. A stressed mother can negatively affect a baby's physical and psychological development, such as affecting sleep.

The appendix presents data on 390 infants aged 3 to 12 months and their mothers. These data cover a variety of subjects, the mother's physical indicators include age, marital status, education level, gestational duration, mode of delivery, and maternal psychological indicators CBTS (Baby-related Post Traumatic Stress Disorder Questionnaire), EPDS (Edinburgh Postpartum Depression Scale) , HADS (Hospital Anxiety and Depression Scale) and indicators of infant sleep quality including duration of sleep throughout the night, number of awakenings and falling asleep patterns.

Please refer to the relevant literature, understand the professional background, build a mathematical model based on the topic data, and answer the following questions.

  1. Many studies have shown that the mother's physical and psychological indicators have an impact on the baby's behavioral characteristics and sleep quality. I would like to ask whether there is such a rule and conduct research based on the data in the attachment.

  2. The Infant Behavior Questionnaire is a scale used to assess infant behavioral characteristics that includes questions about the infant's emotions and responses. We divide the behavioral characteristics of infants into three types: quiet, moderate, and ambivalent. Please establish a relationship model between the baby's behavioral characteristics and the mother's physical and psychological indicators. In the last 20 groups of infants (No. 391-410) in the data table, the behavior characteristics information has been deleted. Please judge what type they belong to.

  3. Intervention on maternal anxiety can help improve the mother's mental health, improve the quality of mother-infant interactions, and promote the cognitive, emotional, and social development of infants. The change rate of the treatment cost of CBTS, EPDS, and HADS relative to the degree of illness is proportional to the treatment cost. After investigation, the treatment costs corresponding to the two scores are given, as shown in Table 1. There is an infant whose behavioral characteristics are ambivalent, numbered 238. Please build a model and analyze how much treatment costs are required to change the baby's behavioral characteristics from contradictory to moderate?

How would the treatment plan need to be adjusted in order to change his behavioral profile to a quieter type?

Table 1. Disease score and treatment cost

CBTS EPDS HADS
Score Treatment cost (yuan) Score Treatment cost (yuan) Score Treatment cost (yuan)
0 200 0 500 0 300
3 2812 2 1890 5 12500
  1. The baby's sleep quality indicators include the duration of the whole night's sleep, the number of times of waking up, and the way of falling asleep. Please make a comprehensive evaluation of the baby's sleep quality in four categories: excellent, good, medium, and poor, and establish a correlation model between the baby's comprehensive sleep quality and the mother's physical and psychological indicators, and predict the last 20 groups (No. 391-410) of babies comprehensive sleep quality.

  2. On the basis of Question 3, if it is necessary to make the sleep quality of baby No. 238 rated as excellent, does the treatment strategy for Question 3 need to be adjusted? How to adjust?

2 Problem Analysis

2.1 Question 1

This is a regression analysis modeling problem for the effect of mother's physical and psychological indicators on infant's behavioral characteristics and sleep quality. Infant behavioral characteristics and infant sleep duration throughout the night were used as dependent variables, while mother's physical and psychological indicators (maternal age, marital status, education level, gestational duration, mode of delivery, CBTS, EPDS, and HADS) were analyzed as independent variables . Generally, multiple linear regression models are used for modeling. The goal of a multiple linear regression model is to find a set of linear relationships that connect the independent variables to the dependent variable. The form of the regression model can be expressed as:

Y = β 0 + β 1 ∗ X 1 + β 2 ∗ X 2 + . . . + β n ∗ X n + ε Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn + εY=b 0+b 1X 1+b 2X2 _+...+βnXn+ε
where Y represents the dependent variable (i.e. infant behavioral characteristics and infant sleep quality), X1 to Xn represent independent variables (i.e. mother’s physical and psychological indicators), β0 to βn represent regression coefficients, and ε represents the error term.

For the evaluation index of the regression model fitting program, there are adjusted R square, AIC, BIC and so on. Note that in regression analysis, it is necessary to put forward assumptions and analyze the regression results, taking into account the coefficient estimates of each indicator and the results of its significance test. Such as the following three aspects.

  • Estimated value of coefficient: Indicates the influence strength of the independent variable on the dependent variable, for example, β1 represents the influence strength of the independent variable x1 on the dependent variable y, β2 represents the influence strength of the independent variable x2 on the dependent variable y, and so on.
  • T-value and p-value: Used to test the significance degree of each coefficient estimate. The t value can represent the ratio between the standard error of the variable and the coefficient, and the p value can represent the range of the confidence interval of the t value at the significance level. Generally speaking, when the p value is less than 0.05 or 0.01, it means that the estimated value of the coefficient is statistically significant at the significance level, that is, the independent variable has a significant impact on the dependent variable.
  • The fitting degree of regression equation: It can be measured by R square and adjusted R square. The R-square indicates how well the model fits the data and can be interpreted as the percentage of the variance of the independent variable to the dependent variable. Adjusting the R square takes into account the influence of the number of independent variables and has a more robust fitting effect. In addition, since the data set contains many correlation coefficients, we can also consider using principal component analysis (PCA) to reduce the dimension of independent variables, extract main features, and establish a principal component regression model.

In addition, in multiple regression analysis, it is necessary to consider whether there is multicollinearity among the various features. Before regression analysis, principal component analysis (PCA) can be used to reduce the dimension of independent variables, extract main features, and then establish a principal component regression model.

2.2 Question 2

This is a classification problem and needs to predict the class of 20 samples. The steps to build the classification model are as follows:

  1. Data preprocessing: Perform preprocessing steps such as missing value processing, outlier processing, and standardization on the collected data.

  2. Feature engineering: perform operations such as feature transformation, feature combination, or selection to extract more discriminative features.

  3. Model selection: According to the characteristics of the data and the requirements of the model, select a suitable multi-category classification algorithm, such as logistic regression, support vector machine, decision tree, random forest, neural network, etc.

  4. Model evaluation: Use cross-validation, confusion matrix, accuracy, recall, F1 score and other indicators to evaluate and optimize the model.

  5. Model application: use the trained model to predict and classify new samples, and then the baby's behavioral feature classification results can be obtained.

2.3 Question 3

This is a minimal optimization problem. Linear regression models were first used to establish the relationship between CBTS, EPDS, and HADS scores and treatment costs. The scores of CBTS, EPDS, and HADS can be used as independent variables, and treatment costs can be used as dependent variables, and the linear relationship between scores and treatment costs can be estimated by fitting a linear regression model.

. . . slightly

Then use the optimization algorithm to solve the problem of minimizing the total treatment cost, and get the corresponding CBTS, EPDS and HADS scores and the minimum treatment cost.

2.4 Question 4

This is a clustering problem, using cluster analysis to classify infant sleep quality into four categories: excellent, good, fair and poor. According to the indicators of the baby's sleep time throughout the night, the number of times of waking up, and the way of falling asleep, the baby's sleep quality can be classified by clustering algorithms with a specified K value such as K-Means and Birch. All samples in the dataset are divided into four groups using four classes as the number of clusters. Each group represents a category of sleep quality.

. . . slightly

The regression model obtained by training can be used to predict the comprehensive sleep quality score of infants in the last 20 groups (No. 391-410).

2.5 Question Five

On the basis of question three, predict the number 238 after adjustment. After adjusting the diagnosis and treatment plan, that is, under the characteristics of the quiet type, which category is the current sleep quality. If it belongs to excellent, there is no need to adjust it, and if it belongs to others, it needs to be adjusted.

3 code implementation

3.1 Question 1

import pandas as pd
import statsmodels.api as sm
data = pd.read_excel('./data/附件.xlsx')
data = data[0:-20]
data

import re
# 提取自变量和因变量
X = data[['母亲年龄', '婚姻状况', '教育程度', '妊娠时间(周数)', '分娩方式', 'CBTS', 'EPDS', 'HADS']]
Y1 = data['婴儿行为特征'].map({
    
    '中等型':1,'安静型':2,'矛盾型':3})

# 将睡眠时间转为分钟,并剔除其中的异常值
def convert_to_minutes(time_str):
    ...return total_minutes

# 对 Y2 列中的数据进行转换和剔除异常值
Y2 = data['整晚睡眠时间(时:分:秒)'].apply(convert_to_minutes)
# 添加截距项
X = sm.add_constant(X)
# 构建OLS回归模型
model1 = sm.OLS(Y1, X)
model2 = sm.OLS(Y2, X)


# 得到回归结果
result1 = model1.fit()
result2 = model2.fit()

# 输出回归结果概要
print(result1.summary())
# 输出各个指标的系数估计值和显著性检验结果
print(result1.params)
print(result1.pvalues)

According to the results of the regression analysis, the following conclusions can be drawn:

  1. The overall regression equation (R-squared value) showed that the independent variable had a weak explanatory power to the dependent variable, and the proportion of variability explained by the model was 1.4%.
  2. The adjusted R-squared value shows that the explanatory power of the model is weakened by the adjusted parameter effect, which is negative.
  3. The value of F statistic is 0.6728, and the value of Prob (F-statistic) is 0.716, both of which are relatively large, indicating that the overall significance of the model is not high.
  4. The coefficients (coef) of the respective variables did not pass the significance test (P>|t|), indicating that the relationship between them and the dependent variable is not significant.
  5. Through the confidence interval of the coefficient ([0.025 0.975]), it can be inferred that the true value of the parameter has certain uncertainty.
  6. Based on the premise of satisfying the model assumptions of multiple linear regression, the fit test of the model (AIC and BIC values) shows that the model fits well on the given data set.
  7. Other tests (such as Omnibus and Durbin-Watson) revealed problems with some model assumptions and statistical properties that warrant further consideration.

In summary, through the regression analysis of the data, in this data set, the relationship between the independent variable (mother's age, marital status, education level, pregnancy time, delivery method, CBTS, EPDS, HADS) and the dependent variable (baby behavior characteristics) The relationship between them is unclear and not significant, and the explanatory power of the model is weak. Therefore, the independent variable contributes less to predicting infant behavior characteristics.

print(result2.summary())
print(result2.params)
print(result2.pvalues)

The regression conclusions are summarized as follows:
the regression model of sleep time (hour:minute:second) has certain explanatory power, and the R-squared is 0.063, indicating that the regression model can explain 6.3% of the variability in the amount of positive responses. The adjusted R square is 0.043, which is relatively more accurate.

  1. The coefficient of the constant term const is 534.4218, which is significantly different from 0, indicating that even when the other independent variables are 0, the sleep time for the whole night is still about 534 hours.
  2. The coefficient of mother's age is 0.5649, and the p-value is 0.575, which is not significant, indicating that there may not be a linear relationship between mother's age and sleep time throughout the night.
  3. The coefficient of marital status is -17.6649, and the p-value is 0.135, which is not significant, indicating that the impact of marital status on sleep time throughout the night may not be obvious.
  4. The coefficient of education level is -2.6517, and the p-value is 0.557, which is not significant, indicating that the impact of education level on sleep time throughout the night may not be obvious.
  5. The coefficient of gestational time (weeks) is 2.6182, and the p-value is 0.268, which is not significant, indicating that the effect of gestational time on sleep time throughout the night may not be obvious.
  6. The coefficient of the mode of delivery is 21.0835, and the p-value is 0.593, which is not significant, indicating that the mode of delivery may not have an obvious effect on the sleep time throughout the night.
  7. The coefficient of CBTS is 2.5157, and the p value is 0.078, which is close to the significant level, indicating that CBTS may have a certain influence on the sleep time of the whole night.
  8. The coefficient of EPDS is -4.4951, and the p value is 0.000, which is significant, indicating that EPDS has a significant negative impact on the sleep time throughout the night.
  9. The coefficient of HADS is 0.8048, and the p value is 0.633, which is not significant, indicating that the effect of HADS on the sleep time of the whole night may not be obvious.

The statistical test results of the regression model showed that the F statistic was 3.200, and the corresponding p-value was 0.00157, rejecting the null hypothesis (the assumption that none of the independent variables were important), and believed that at least one independent variable had a significant impact on the sleep time throughout the night. The Durbin-Watson statistic is 2.180, which is close to 2, indicating that there may be no autocorrelation problem in the regression model.
Overall, EPDS (Postpartum Depression Index) is an important factor affecting the sleep duration of the whole night, and the influence of other independent variables on the sleep duration of the night may not be obvious. However, the explanatory power of the regression model is weak, and the R-square value is low, and there may be other important factors that have not been considered.

3.2 Question 2

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import re
# 读取数据
import pandas as pd
import statsmodels.api as sm
data = pd.read_excel('./data/附件.xlsx')
data = data[0:-20]

# 数据预处理

...# 划分特征和标签
features = data[['母亲年龄', '婚姻状况', '教育程度', '妊娠时间(周数)', '分娩方式', 'CBTS', 'EPDS', 'HADS', '婴儿性别', '婴儿年龄(月)', '整晚睡眠时间(时:分:秒)', '睡醒次数', '入睡方式']]
labels = data['婴儿行为特征']
features

# 划分训练集和测试集
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2, random_state=0)

# 建立决策树模型
clf = DecisionTreeClassifier()
clf.fit(train_features, train_labels)

# 预测测试集结果
pred_labels = clf.predict(test_features)

# 计算准确率
accuracy = sum(pred_labels == test_labels) / len(test_labels)
print("决策树模型准确率:", accuracy)
决策树模型准确率: 0.4230769230769231
# 建立随机森林模型
rf = RandomForestClassifier()
rf.fit(train_features, train_labels)

# 预测测试集结果
pred_labels_rf = rf.predict(test_features)

# 计算准确率
accuracy_rf = sum(pred_labels_rf == test_labels) / len(test_labels)
print("随机森林模型准确率:", accuracy_rf)
随机森林模型准确率: 0.5512820512820513


# 使用模型预测结果
unknown_samples = data[-20:]  # 获取需要预测的样本
unknown_features = unknown_samples[['母亲年龄', '婚姻状况', '教育程度', '妊娠时间(周数)', '分娩方式', 'CBTS', 'EPDS', 'HADS', '婴儿性别', '婴儿年龄(月)', '整晚睡眠时间(时:分:秒)', '睡醒次数', '入睡方式']]
unknown_pred_labels = clf.predict(unknown_features)
label_dict ={
    
    0:'中等型',1:'安静型',2:'矛盾型'}
print("预测结果:")
for i in range(len(unknown_samples)):
    print("样本编号{},预测类型为{}".format(390+i, label_dict[unknown_pred_labels[i]]))

Prediction results:
sample number 390, prediction type is medium
sample number 391, prediction type is quiet
sample number 392, prediction type is quiet sample
number 393, prediction type is quiet
sample number 394, prediction type is medium sample
number 395, the prediction type is quiet
type sample number 396, the prediction type is quiet
type sample number 397, the prediction type is quiet type sample number
398, the prediction type is quiet type sample number
399, the prediction type is medium type
sample number 400, the prediction type is Quiet
Sample number 401, prediction type is medium
sample number 402, prediction type is medium
sample number 403, prediction type is medium sample number
404, prediction type is medium
sample number 405, prediction type is medium sample
number 406 , the prediction type is medium
sample number 407, the prediction type is quiet
sample number 408, the prediction type is quiet
sample number 409, the prediction type is medium

3.3 Question 3

Please download the complete information

4 Complete data download

Know the download link at the bottom of the article, including the complete word document and python code.

zhuanlan.zhihu.com/p/648093611

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/132092009