2023 Huazhong Cup Mathematical Modeling C Problem Air Quality Prediction and Early Warning Problem-solving Documents and Programs for the Whole Process

2023 Huazhong Cup Mathematical Modeling

Question C Air Quality Prediction and Early Warning

Reproduce the original title

  Air pollution is harmful to human health, ecological environment, and social economy. The pollution level is affected by many factors, such as PM2.5, PM10, CO, temperature, wind speed, precipitation, etc., to explore the factors of the concentration of PM2.5 and other pollutants , more accurate prediction of PM2.5 concentration and AQI index is a common concern of the scientific community and decision makers, which is of great significance for analyzing pollution influencing factors and effectively formulating control strategies.
  In order to improve and aim at perfecting the response and disposal mechanism for heavily polluted weather, improve the prevention and early warning of heavily polluted weather, emergency response capabilities, and the level of refined environmental management, and eliminate heavily polluted weather and above, as an important part of the emergency response plan system for environmental emergencies, A certain place issued an emergency plan for polluted weather, which will strengthen monitoring and early warning, energy conservation and emission reduction, and minimize the impact of polluted weather. Its warning level is divided into four levels of emergency response:
  blue warning: forecast daily AQI > 150 or daily AQI > 100 for 48 hours or more.
  Yellow warning: forecast daily AQI > 200 or daily AQI > 150 for 48 hours or more.
  Orange warning: forecast daily AQI > 200 for 48 hours or daily AQI > 150 for 72 hours or more.
  Red alert: forecast daily AQI > 200 for 72 hours and daily AQI > 300 for 24 hours or more.
  The participating teams are asked to complete the following questions (tasks) according to the question requirements:
  Question 1: According to Annex 1 and Annex 2, analyze and process the data, screen out the factors related to the change of PM2.5 concentration, and explain the impact of the screened factors on The degree of influence of PM2.5 concentration.
  Question 2: Divide the training set and test set by yourself. According to Annex 1 and Annex 2, build a multi-step prediction model for PM2.5 concentration based on Question 1. The 12-step prediction effect is evaluated, and the results are given in the main text in Table 1 format, and the test set and its prediction results are visualized. At the same time, use this model to predict the PM2.5 concentration at the given time in Annex 3, and the results should be given in the text in Table 2 format.
insert image description here
insert image description here
  Question 3: Build an AQI multi-step prediction model, evaluate the modeling effect using the root mean square error (RMSE), and visualize the test set and its prediction results. At the same time, use this model to predict the AQI at the given time in Annex 3, and give the early warning level of air quality every day. Please use the format of Table 3 and Table 4 to give the results in the main text.

insert image description here
  Notes on attachments:
  1. Attachments 1 and 2 provide the basic air quality forecast data of the monitoring point in recent years, including pollutant concentration data (see attachment 1) and meteorological data (see attachment 2).
  2. Attachment 3 is the time point to be predicted.

Overview of the overall solution process (abstract)

  Air pollution is one of the most serious problems faced by human beings, which can cause serious damage to human health, ecological environment and other aspects. With the continuous deepening of urbanization and industrialization, my country's current pollution situation is becoming more and more serious. Therefore, accurate and effective prediction of air quality index data and early warning of future air conditions have far-reaching significance for ecological health and social development.
  For problem one, in order to analyze the index factors related to the change of PM2.5 concentration and the degree of influence on the change of the concentration, it is necessary to preprocess the given data. First, analyze the characteristics of the data given in Appendix 1, and use multiple imputation to complete the vacant values. Then remove the outliers in Appendix 2, and use the Lagrange imputation method to fill in the missing values. The processed data were integrated, and then, in order to overcome the influence of excessive dimensional differences between the data, the interpolated data were standardized. Then normality test is used for the standardized data, combined with the significance level and the statistical description of the QQ graph, it can be analyzed that most of the index data do not present a normal distribution and are not within the applicable range of the Pearson correlation coefficient, so the Spearman correlation coefficient is used to analyze The correlation between the indicators was calculated and analyzed. Finally, five variables that are strongly correlated with the concentration of PM2.5 were extracted, namely SO2, NO2, CO, PM10 and average temperature, and the significance test was carried out on them. The results showed that PM10 was strongly correlated with them, and NO2, CO There is a strong correlation, and SO2 and average temperature have a moderate correlation with them.
  For problem 2, in order to predict the PM2.5 concentration in a given period of time, combined with the characteristics of the scatter diagram of its historical data, first establish an ARIMA prediction model to predict it, and obtain the corresponding predictions for each air quality index in the next 12 days Data, and its prediction effect was evaluated, and the analysis showed that its prediction effect was not good. After comprehensive consideration, the LSTM multi-step forecasting model is selected and then forecasted. Extract the five indexes screened out by the correlation analysis in Question 1 as the input factors in the LSTM model, and set the PM2.5 concentration as the output variable. The selected indicator data is imported into Python and standardized. In order to prevent overfitting, the first 80% of the data is selected as the training set, and the last 20% is used as the test set. By setting the number of LSTM layers, number of neurons, optimizer, etc., the LSTM neural network prediction model is established, and the test set and its prediction results are visualized. Then, by changing the forecast step size, the root mean square error (RMSE) of 3-step, 5-step, 7-step and 12-step forecast results are obtained, and the values ​​are 19.4574, 21.6210, 21.4685 and 22.0202 respectively. The two models were compared and analyzed, and finally the LSTM multi-step prediction model with better fitting degree and higher prediction accuracy was selected to predict the PM2.5 concentration in the next 12 days. The obtained prediction data are shown in Table 6-3. By analyzing the MAE, R2, MAPE, residual and loss curves of the LSTM multi-step forecasting model, it is concluded that the model has certain accuracy and robustness.
  For problem three, in order to build a multi-step prediction model for AQI, firstly calculate and analyze the Spearman correlation coefficient between AQI and other detection indicators, and screen out PM10, SO2, PM2.5, NO2 and CO as the main influencing factors. Then, the ARIMA model and LSTM model were used to predict the AQI index of air quality in the next 12 days, and the training set and prediction set were divided independently and the prediction results were visualized. Among them, the root mean square error (RMSE) of the LSTM model was 32.2989, Comprehensive evaluation and analysis show that its prediction effect is better. Furthermore, based on the obtained AQI index in the next 12 days, the corresponding air quality level and warning level are analyzed, and the color times of the warning level are summarized and processed. The above specific results are shown in Table 7-2 and Table 7 respectively. -3.

Model assumptions:

  1. Assume that air pollution is only related to the given detection indicators, without considering the influence of other factors.
  2. Assume that the given data are reliable and highly representative except for missing values ​​and outliers.
  3. Assume that the data after processing missing values ​​and outliers is smooth, which can meet the subsequent model calculation requirements.

problem analysis:

  The analysis of question 1
  is based on the data given in the appendix, and it is found that there are missing values ​​and outliers in the data corresponding to some indicators. Combined with the data characteristics of the data given in each table, consider the missing mechanism in appendix 1. MAR and continuous missing phenomenon Blank values ​​are filled by multiple imputation method, so it is used to interpolate the missing values ​​in Appendix 1 and the outlier 0 values ​​in some indicators. For the outliers with fewer samples in Appendix 2, the Lagrangian difference method with a small amount of calculation is used for filling and completion after elimination. Then, the imputation data were standardized, and the normality test was adopted. According to the results of various air quality detection indicators tested, combined with the statistical description of the QQ graph, it is judged that most of the given data do not obey the normal distribution, which is contrary to the scope of application of the Pearson correlation coefficient. Therefore, the Spearman correlation coefficient is selected for comparison It performs correlation analysis and significance test. According to the obtained results, the factors related to the change of PM2.5 concentration were screened out, and their influence on the change of PM2.5 concentration was analyzed. Its analysis flow chart is shown in Figure 2-1.
insert image description here
  The analysis of problem 2
  is to better predict the concentration of PM2.5. At the same time, the ARIMA model and LSTM model are used to predict it, and the optimal prediction model is obtained after comparing the prediction results. For the ARIMA model, the PM2.5 concentration sequence is first tested for stationarity, and transformed into a stationary sequence that can be used with the ARIMA model through difference processing. Then, the residual white noise test analysis was carried out on the sequence to make the data meet the model fitting requirements. Finally, select the appropriate ARIMA model to fit the data that passed the test, and obtain relevant prediction results. Based on the good forecasting effect of the LSTM neural network model on time series, the LSTM multi-step forecasting model is established by selecting input influencing factors, output variables, the number of neurons, and setting reasonable hyperparameters. Get the root mean square error (RMSE) of 3, 5, 7 and 12-step forecast models by changing the forecast step size, and use the established neural network model and known historical data to predict PM2.5 in the next 12 days Concentration size.
  Analysis of Question Three
  Considering that the LSTM multi-step forecasting model established in question 2 has a good forecasting effect on the data, this model is used to predict the AQI data in the next 12 days. Firstly, the Spearman correlation coefficient analysis is used to obtain five relatively important indicators that affect the size of AQI, and combined with the calculation method of AQI, a multi-input, single-output LSTM multi-step forecasting model is established, and the root mean square error (RMSE) of the model is calculated. to evaluate the modeling effect. Again, a comparison is made by building an ARIMA forecasting model. After comprehensive consideration, the AQI data size in the next 12 days, the daily air quality warning level and the summary of the warning level color are obtained.

Model establishment and solution Overall paper thumbnail

insert image description here
insert image description here

For all papers, please see below "Only modeling QQ business cards" Click on the QQ business card

Program code: (code and documentation not free)

%lagrange insert
function y=lagranges(x0,y0,x)
n=length(x0);m=length(x);
for i=1:m
z=x(i);
s=0.0;
for k=1:n
p=1.0;
for j=1:n
if j~=k
p=p*(z-x0(j))/(x0(k)-x0(j));
end
end
s=p*y0(k)+s;
end
y(i)=s;
end

%lagrange insert
function y=lagranges(x0,y0,x)
n=length(x0);m=length(x);
for i=1:m
z=x(i);
s=0.0;
for k=1:n
p=1.0;
for j=1:n
if j~=k
p=p*(z-x0(j))/(x0(k)-x0(j));
end
end
s=p*y0(k)+s;
end
y(i)=s;
end
end
end
LOGL = reshape(LOGL,16,1);
PQ = reshape(PQ,16,1);
[~,bic] = aicbic(LOGL,PQ+1,100);
a=reshape(bic,4,4)
%reshape 重构数组
a_max=max(a(:));
[x,y]=find(a==min(a(:)));
%找最佳 lags 值
Mdl = arima(x, 1, y); %第二个变量值为 1,即一阶差分
EstMdl = estimate(Mdl,Y);
[res,~,logL] = infer(EstMdl,Y); %res 即残差
stdr = res/sqrt(EstMdl.Variance);
figure('Name','残差检验')
subplot(2,3,1)
plot(stdr)
title('Standardized Residuals')
subplot(2,3,2)
histogram(stdr,10)
title('Standardized Residuals')
subplot(2,3,3)
autocorr(stdr)
subplot(2,3,4)
parcorr(stdr)
subplot(2,3,5)
qqplot(stdr)
%上图为残差检验的结果图。
% Standardized Residuals 是查看残差是否接近正态分布
% ACF 和 PACF 检验残差的自相关和偏自相关
% 最后一张 QQ 图是检验残差是否接近正太分布的
% Durbin-Watson 统计是计量经济学分析中最常用的自相关度量
diffRes0 = diff(res);
SSE0 = res'*res;
DW0 = (diffRes0'*diffRes0)/SSE0 % Durbin-Watson statistic,
step = 12; %预测步数为 12
[forData,YMSE] = forecast(EstMdl,step,'Y0',Y);
lower = forData - 1.96*sqrt(YMSE); %95 置信区间下限
upper = forData + 1.96*sqrt(YMSE); %95 置信区间上限
figure()
plot(Y,'Color',[.7,.7,.7]);
hold on
h1 = plot(length(Y):length(Y)+step,[Y(end);lower],'r:','LineWidth',2);
plot(length(Y):length(Y)+step,[Y(end);upper],'r:','LineWidth',2)
h2 = plot(length(Y):length(Y)+step,[Y(end);forData],'k','LineWidth',2);
legend([h1 h2],'95% 置信区间','预测值',...
'Location','NorthWest')
title('Forecast')
X=3030:1:3041;
plot(X,test,'color',[0.5 0.5 0.5])
hold off

For all papers, please see below "Only modeling QQ business cards" Click on the QQ business card

Guess you like

Origin blog.csdn.net/weixin_43292788/article/details/131508673