2023 Huazhong Cup Mathematical Modeling C Question Complete Model Code

 All model codes have been completed and can be obtained at the end of the article.

Summary

With the rapid development of industrialization and urbanization, air pollution has become a global environmental problem. Pollutants such as fine particulate matter (PM2.5) have had serious impacts on human health, the ecological environment, and social economy. This study aims to deeply explore the main factors affecting PM2.5 concentration and construct an effective multi-step prediction model. To this end, we used methods such as data mining, machine learning, and statistical analysis to conduct a comprehensive analysis of a large amount of pollutant concentration and meteorological data.

First, this paper analyzes and processes the data in Annex 1 and Annex 2, and screens out the factors related to the change of PM2.5 concentration, including PM10, CO, air temperature, wind speed and precipitation. Through correlation analysis and multiple linear regression model, the degree of influence of these factors on PM2.5 concentration was analyzed. The results show that PM10 and CO have the greatest impact on PM2.5 concentration, followed by air temperature and wind speed, and precipitation has relatively little effect on PM2.5 concentration.

In this paper, the data set is divided into training set and test set, and a multi-step prediction model for PM2.5 concentration is constructed based on the results of problem 1. In order to obtain the best prediction effect, various prediction models were tried, such as ARIMA model, support vector machine (SVM) model, artificial neural network (ANN) model and long short-term memory (LSTM) neural network model. By comparing the root mean square error (RMSE) of each model under different prediction step sizes, the model with the best performance was finally selected. The selected models were used to evaluate the prediction effect of 3-step, 5-step, 7-step and 12-step, and the specific results are shown in Table 1. A visual analysis of the test set and its prediction results is carried out to show the prediction effect more intuitively.

In this paper, a multi-step forecasting model of AQI was constructed, and the modeling effect was evaluated using root mean square error (RMSE). Also, by comparing the prediction performance of different models, the best model was selected. A visual analysis of the test set and its prediction results was performed. The model is used to predict the AQI at the given time in Annex 3, and the daily air quality warning level is given according to the prediction results. The prediction results and the classification of early warning levels are shown in Table 3 and Table 4. This study provides a theoretical basis for more accurately predicting PM2.5 concentration and AQI index, analyzing pollution influencing factors and formulating effective control strategies.

In addition, this paper also discusses the response measures under different warning levels, in order to provide reference for local governments and relevant departments. For the blue, yellow, orange and red warning levels, emergency measures such as travel restrictions, production restrictions, work stoppages, and class suspensions were proposed to minimize the impact of polluted weather. In addition, long-term control measures such as energy conservation and emission reduction, scientific scheduling, and publicity and education are proposed for different warning levels, in order to improve the prevention and early warning of heavy pollution weather, emergency response capabilities, and the level of refined environmental management.

Key words: air pollution; multi-step forecasting model; root mean square error; emergency plan

1. Restatement of the problem

1.1 Problem Background

With the acceleration of industrialization and the continuous increase of human activities, air pollution has become a global environmental problem. Air pollution is harmful to human health, ecological environment, and social economy, and its pollution level is affected by many factors, such as PM2.5, PM10, CO, temperature, wind speed, precipitation, etc. PM2.5 refers to particulate matter with a diameter of 2.5 microns or less in the air, which can enter the human respiratory tract and lungs and have adverse effects on human health. AQI is the Air Quality Index, which is a value calculated based on the concentration of pollutants in the air, and it directly reflects the quality of the air.

Therefore, exploring the factors of PM2.5 and other pollutant concentrations, and more accurately predicting PM2.5 concentration and AQI index are issues of common concern to the scientific community and decision makers. In order to improve and aim at perfecting the response and disposal mechanism for heavily polluted weather, improve the prevention and early warning of heavily polluted weather, emergency response capabilities, and the level of refined environmental management, and eliminate heavily polluted weather and above, as an important part of the emergency response plan system for environmental emergencies, A certain place issued an emergency plan for polluted weather, and its warning level is divided into four levels of emergency response, namely blue warning, yellow warning, orange warning and red warning. Predicting air quality and AQI levels is of great significance for analyzing pollution influencing factors and effectively formulating control strategies.

1.2 Question restatement

Question 1 : According to Annex 1 and Annex 2, analyze and process the data, screen out the factors related to the change of PM2.5 concentration, and explain the degree of impact of the screened factors on PM2.5 concentration.

Question 2 : Divide the training set and test set by yourself. According to Annex 1 and Annex 2, build a multi-step PM2.5 concentration prediction model based on Question 1, and use the root mean square error (RMSE) The 12-step prediction effect is evaluated, and the results are given in the main text in Table 1 format, and the test set and its prediction results are visualized.

At the same time, use this model to predict the PM2.5 concentration at the given time in Annex 3. Please use the format of Table 2 to give the results in the main text.

Question 3: Build an AQI multi-step prediction model, evaluate the modeling effect using root mean square error (RMSE), and visualize the test set and its prediction results. At the same time, use this model to predict the AQI at the given time in Annex 3, and give the early warning level of air quality every day. Please use Table 3 and Table 4 to give the results in the main text.

2. Problem Analysis

2.1 Analysis of problem one thinking

Question 1 requires to screen out the factors related to the change of PM2.5 concentration, and explain the degree of influence of the screened factors on PM2.5 concentration. To solve this problem, we first integrated and preprocessed the data in Annex 1 and Annex 2, including missing value processing, outlier detection, and data normalization, etc. Then, use the correlation analysis method to calculate the correlation coefficient between various factors (such as PM10, CO, air temperature, wind speed and precipitation, etc.) and PM2.5 concentration, and find out the factors closely related to the change of PM2.5 concentration. Next, the multiple linear regression model was used to further analyze the influence of the screened factors on the PM2.5 concentration. By calculating the regression coefficient and significance test and other indicators, the degree and significance level of each factor's influence on PM2.5 concentration can be obtained. This study found that PM10 and CO have the greatest impact on PM2.5 concentration, followed by temperature and wind speed, and precipitation has relatively little effect on PM2.5 concentration.

2.2 Analysis of problem 2  train of thought

Question 2 requires building a multi-step prediction model for PM2.5 concentration according to Annex 1 and Annex 2, and evaluating the effect of different prediction steps. To accomplish this task, we first divide the dataset into training and testing sets. Then, based on the results of problem 1, various prediction models were tried, such as ARIMA model, support vector machine (SVM) model, artificial neural network (ANN) model and long short-term memory (LSTM) neural network model. Compared the root mean square error (RMSE) of each model under different prediction step sizes, the LSTM model with the best performance was selected. The 3-step, 5-step, 7-step and 12-step prediction effects were evaluated using the selected model, and the specific results are given in Table 1. At the same time, a visual analysis of the test set and its prediction results is carried out to show the prediction effect more intuitively. Finally, the model is used to predict the PM2.5 concentration at a given time in Annex 3, and the results are given in Table 2.

2. 3  Problems and Three Thinking Analysis

Question 3 requires building an AQI multi-step forecasting model, evaluating the modeling effect, and giving an early warning level for daily air quality. In order to solve this problem, we first selected the LSTM model with the best performance by comparing the root mean square error (RMSE) of each model under different prediction step sizes based on the results of problem 1 and problem 2. Modeling effects were evaluated using selected models for multi-step forecasting of AQI. At the same time, a visual analysis of the test set and its prediction results is carried out to show the prediction effect more intuitively.

After completing the AQI prediction, we need to give the daily air quality warning level based on the prediction results. According to the classification standard of the warning level, the predicted AQI value is converted into the corresponding color of the warning level. Specifically, we divide the AQI value into four warning levels of blue, yellow, orange and red, and count the days of each warning level. Finally, the predicted daily air quality warning levels are given in Table 3, and the summary results of the warning level colors are given in Table 4.

3. Model assumptions

In response to the questions raised in this paper, we made the following model assumptions:

  1. PM2.5 concentration is affected by many factors, including meteorological factors and pollutant concentration factors
  2. Meteorological factors include temperature, wind speed, relative humidity, and precipitation, etc. These factors can reflect air stability, wind direction, and precipitation cleaning, thereby affecting the concentration of PM2.5.
  3. Pollutant concentration factors include PM10, SO2, NO2, etc. There is a certain correlation between these pollutants and PM2.5, which can be explored by establishing a multiple linear regression model.
  4. Based on the factors related to the change of PM2.5 concentration screened out in question 1, a PM2.5 concentration prediction model can be established to predict the future PM2.5 concentration.
  5. Based on the AQI multi-step prediction model constructed on the third question, the future AQI value can be predicted, and the predicted air quality grade can be divided according to the grade of AQI value.

4. Description of symbols

The commonly used symbols in this paper are shown in the table below, and other symbols are explained in the text

 

5. Modeling and solution

5.1 Modeling and solution of problem 1

5.1.1 Data processing and index selection 

Data preprocessing is mainly carried out from the following three aspects:

(1) Abnormal data processing

After the preliminary statistical analysis of the given data, it is found that individual data points appear to be repeated, and the duplicate record values ​​determined by matching the entire record are deleted to ensure the uniqueness of the data points of the entire flow record; During the operation, there was a crash caused by inconsistent data types. It was found that the data types of the attribute values ​​​​in this dataset were inconsistent. Therefore, in order to ensure that the attribute values ​​​​of the data are consistent, the data of a small number of data types were converted into data types. Delete the data that cannot be converted into data types; when visually displaying the data set, it is found that some statistical values ​​have been consistent for a long time. Traffic records longer than one-fifth of the time will be deleted.

(2) Missing value processing of data records

When the statistics of data time points are carried out on the data, it is found that the data is not continuous. In order to satisfy the continuous and smooth characteristics of the time series itself, the missing data is filled. Commonly used missing value filling methods include random filling method, mean method, median method, mode method and other data filling methods, as well as K-nearest neighbor (KNN), regression prediction method, expectation maximization method (EM) and other modeling Data filling method. Considering the low proportion of missing values ​​in this data set and the long period of the time series formed, the missing data of a single data point is filled with the average value of the two series before and after it; the missing data of multiple data points is filled with multiple Random imputation method; data missing for more than 7 consecutive days in the data set were discarded.

(3) Data standardization

Data standardization is mainly to scale the data to a fixed interval according to a certain ratio. On the one hand, it can make the data characteristics of different dimensions dimensionless. On the other hand, data standardization will reduce the complexity of numerical calculations and further accelerate the speed of model convergence. and improve the accuracy of the model. In large data scale or neural network models, data standardization is essential. However, the practical application of data standardization is not only beneficial. Data standardization may also lead to deviations in prediction results. The main reason is that the prediction results after data standardization are also scaled to a fixed range, losing the actual numerical meaning. It needs to be restored through the method of denormalization, and the deviation is generated at this time.

There are two commonly used data standardization methods: min-max standardization and Z-Score standardization. Z-Score standardization is selected according to the characteristics of educational data, also called standard deviation standardization. It is mainly based on mean and standard deviation to standardize data. method works when the maximum and minimum values ​​in the sequence are unknown.

We first merge the data in Attachment 1 and Attachment 2 to obtain the merged data of the new data file. xlsx directly performs data preprocessing, as follows:

import pandas as pd
import numpy as np

# 假设数据存储在名为 data.csv 的文件中
data = pd.read_excel("合并数据.xlsx")

# 删除质量等级列,因为它是分类变量,不适用于线性插值
data = data.drop(columns=["质量等级"])

# 检查缺失值的情况
print("缺失值统计:")
print(data.isnull().sum())

# 使用线性插值填充缺失值
data.interpolate(method='linear', inplace=True)

# 再次检查缺失值的情况
print("\n填充缺失值后的统计:")
print(data.isnull().sum())

# 对数据进行异常值检测和处理
def detect_outliers(data, columns, threshold=1.5):
    for column in columns:
        q1 = data[column].quantile(0.25)
        q3 = data[column].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - threshold * iqr
        upper_bound = q3 + threshold * iqr

        outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
        print(f"{column} 异常值数量:{len(outliers)}")

        # 将异常值替换为缺失值
        data[column] = data[column].apply(lambda x: np.nan if (x < lower_bound) or (x > upper_bound) else x)

# 检测并处理异常值
numeric_columns = ['AQI', 'PM10', 'O3', 'SO2', 'PM2.5', 'NO2', 'CO', 'V13305', 'V10004_700', 'V11291_700', 'V12001_700', 'V13003_700']
detect_outliers(data, numeric_columns)

# 使用线性插值填充处理后的异常值(现已变为缺失值)
data.interpolate(method='linear', inplace=True)

# 将预处理后的数据保存到新的 CSV 文件
data.to_csv("preprocessed_data.csv", index=False)

We perform outlier processing and interpolation processing on it to get a new file preprocessed_data.csv

5.1.2 Data visualization and results 

To visualize the processed data files, we need to use Matplotlib and Seaborn libraries in python to obtain time series plots, box plots and correlation matrix heat maps, and we calculate the correlation between each variable and PM2.5 concentration coefficient to screen out the factors related to the change of PM2.5 concentration. The value range of the correlation coefficient is between -1 and 1, close to 1 means positive correlation, close to -1 means negative correlation, close to 0 means irrelevant.

We enter the following code:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

data = pd.read_csv("preprocessed_data.csv")

# 计算 PM2.5 与其他变量的相关系数
correlations = data.corr()['PM2.5'].sort_values(ascending=False)

# 打印相关系数
print(correlations)

# 绘制 AQI、PM2.5、PM10、O3、SO2、NO2 和 CO 的时间序列图
plt.figure(figsize=(15, 6))
plt.plot(data["AQI"], label="AQI")
plt.plot(data["PM2.5"], label="PM2.5")
plt.plot(data["PM10"], label="PM10")
plt.plot(data["O3"], label="O3")
plt.plot(data["SO2"], label="SO2")
plt.plot(data["NO2"], label="NO2")
plt.plot(data["CO"], label="CO")
plt.xlabel("Time")
plt.ylabel("Value")
plt.title("Air Quality Time Series")
plt.legend()
plt.show()

# 绘制 AQI、PM2.5、PM10、O3、SO2、NO2 和 CO 的箱型图,以查看每种污染物的分布情况
plt.figure(figsize=(12, 6))
sns.boxplot(data=data[["AQI", "PM2.5", "PM10", "O3", "SO2", "NO2", "CO"]])
plt.title("Boxplot for Air Quality Parameters")
plt.show()

# 绘制相关性矩阵热力图
plt.figure(figsize=(12, 10))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

We get the following visualization:

Figure 5.1.1 Time series graph 

 Figure 5.1.2 Box plot

Figure 5.1.3 Correlation matrix heat map

And we get the following result table: ↓↓↓

Guess you like

Origin blog.csdn.net/zzzzzzzxxaaa/article/details/130304476