(Source code version) Detailed explanation of Yellow River water and sediment monitoring data analysis for question E of the 2023 National College Student Mathematical Modeling Competition + Python code source code SARIMA model


Preface

I don’t know how everyone is doing after the competition. Personally speaking, due to the heavy work tasks, I only completed the solutions and modeling of questions D and E, which is a pity. It is indeed not an easy task for one person to complete the modeling and analysis of multiple questions. Of course, I promise to everyone that I will write detailed explanations and modeling processes for modeling competitions over the years. As long as everyone needs my help, I will Done to the best of my ability. In this competition, I personally think that question E is a relatively easy question to get started with. The meaning of the question is concise and the modeling ideas are clear. However, since it is time series data, data processing may be more troublesome. Although the modeling idea is relatively clear, the time series prediction analysis algorithm is still difficult and there are many details. Compared with question D, the requirements for modeling skills are higher.

I have a lot of experience in analyzing time series data in the industry, so it is relatively easy to answer this question. Time series data needs to be dealt with frequently in the field and industry. If you want to engage in data analysis or data mining in the future, you should be familiar with this type of data. Familiarity with the data modeling process. Students who are interested in this aspect or want to learn about modeling in this area are recommended to subscribe to the blogger's column. I have compiled more than ten articles on time series data processing and modeling as well as cases on the application of such model modeling. :

 The blogger has been focusing on modeling for four years, and has participated in dozens of mathematical modeling, large and small, and understands the principles of various models, the modeling process of each model, and various problem analysis methods. The purpose of this column is to quickly use various mathematical models, machine learning, deep learning, and code from scratch. Each article contains practical projects and runnable code. Bloggers keep up with various digital and analog competitions. For each digital and analog competition, bloggers will write the latest ideas and codes into this column, as well as detailed ideas and complete codes. I hope friends in need will not miss the column carefully created by the author.


1. Restatement of the problem

Competition background

Studying the changing laws of water and sediment flux in the Yellow River has implications for environmental governance, climate change and people's lives along the Yellow River Basin, as well as for optimizing water resources distribution in the Yellow River Basin, coordinating the relationship between man and land, regulating water and sediment, and preventing floods and disasters. important theoretical guidance.

Modeling requirements

According to the change pattern of water and sediment flux at the hydrological station, the changing trend of water and sediment flux at the hydrological station in the next two years is predicted and analyzed.

Modeling analysis

The most important key to this question is the third question. I have detailed source code answers to the first and second questions. Most of them are data processing and statistical work. There is no in-depth modeling requirement, but the third question is very clear. The requirements are given: a time series prediction model needs to be established to analyze and predict the change trend of water and sediment flux at the hydrological station in the next two years. After paving the way for questions one and two, we have a clearer understanding of the data and have transformed the data into the format we want. If you have questions about the first and second questions, go to my previous article:

(Source code version) 2023 Higher Education Society Cup National College Student Mathematical Modeling Competition - Question E Yellow River Water and Sediment Monitoring Question 1 Detailed Data Analysis + Python Code

The data format after data processing is:

One thing to note about the data here is that the data is only recorded once a day and the recording time is not fixed. However, since the time prediction range is two years, it is best to set a larger time dimension during the modeling process, that is, to adopt the dimension of daily maximum. Okay, otherwise if it is set to hourly, the data will be very sparse, and the filling data error will be too large, so just default to one hourly record for that day's data, which is what the question is asking you to do. Final processing result:

Data preview

Before modeling, we'd better preview the changes in the data to help us judge:

Water level:

Water flow:

Sand content:

It is obvious that the data shows certain cyclical fluctuations, and the data fluctuates much more at the end of the year than in other time periods. The model describing this type of series is called a seasonal time series model (seasonal ARIMA model), represented by SARIMA. Seasonal time series models are also called multiplicative seasonal models. So we can start modeling now.

SARIMA model modeling

SARIMA (Seasonal Autoregressive Integrated Moving Average) is a seasonal time series forecasting model based on the ARIMA model. The ARIMA model is a classic model widely used in time series forecasting, which takes into account the trend and periodicity of time series. The SARIMA model adds seasonality to the ARIMA model, so it can better cope with time series data with seasonal changes.
SARIMA models usually include the following parameters:

  • Seasonal period: A period in which time series data presents seasonal changes, such as a year, a week, a day, etc.
  • Order of differencing: The number of times to differentiate time series data to eliminate non-stationarity of the data.
  • Autoregressive terms: used to establish the relationship between a time series and its past values, indicating the trend of time series data.
  • Moving average terms: used to establish the relationship between the random volatility of the time series and past errors, indicating the randomness of the time series data.

The SARIMA model can be used to predict time series data with seasonal changes, such as sales, temperature, stock prices, etc. It requires fitting a model based on historical data and then using the model to predict data for a future period of time. When using the SARIMA model, it is necessary to select appropriate parameters and model structure, and perform model diagnosis and tuning to obtain more accurate prediction results.

data filling

What needs to be noted here is that there are a total of 2192 days of matching from January 1, 2016 to December 31, 2021, and there are only 2159 pieces of data. There are duplicate data in one day and there are no records in the pure number of days. We need Perform data filling.

First fill in the data time range:

# 创建完整日期范围的日期索引
date_range = pd.date_range(start='2016-01-01', end='2021-12-31', freq='D')
# 创建包含完整日期范围的 DataFrame
date_df = pd.DataFrame(index=date_range)
# 将原始数据与日期索引合并
merged_df = date_df.merge(df_with_sediment, left_index=True, right_on='日期时间', how='left')
# 将缺失值填充为特定的值(这里是NaN)
merged_df['水位(m)'].fillna('NaN', inplace=True)
merged_df=merged_df.set_index(merged_df['日期时间'])
# 删除原始的日期时间列
merged_df.drop(columns=['日期时间'], inplace=True)
merged_df

At this time, the data of the first time index of the same date is retained:

Perform exponential smoothing filling:

merged_df['水位(m)'].fillna(merged_df['水位(m)'].ewm(span=3).mean(), inplace=True)
merged_df['流量(m3/s)'].fillna(merged_df['流量(m3/s)'].ewm(span=3).mean(), inplace=True)
merged_df['含沙量(kg/m3) '].fillna(merged_df['含沙量(kg/m3) '].ewm(span=3).mean(), inplace=True)
merged_df

After the data is populated, seasonal analysis can begin.

Seasonal analysis

Seasonal_decompose() can be used for analysis to decompose the time series into three parts: long-term trend term (Trend), seasonal cycle term (Seansonal) and residual term (Resid).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from matplotlib.pylab import rcParams
import pmdarima as pm
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf  # 画图定阶
from statsmodels.tsa.stattools import adfuller                 # ADF检验
from statsmodels.stats.diagnostic import acorr_ljungbox        # 白噪声检验
import warnings
import itertools
warnings.filterwarnings("ignore")  # 选择过滤警告
plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签
from statsmodels.tsa.seasonal import seasonal_decompose
# 分解数据查看季节性   freq为周期
# 推断频率参数
ts_decomposition = seasonal_decompose(merged_df['水位(m)'],  period=52)
ts_decomposition.plot()
plt.show()

 

 The box has a negative sign.

ADF inspection

The ARIMA model requires that the time series is stationary. The basic idea of ​​so-called stationarity is that the statistical laws that determine process characteristics do not change with time. From the definition of stationarity: for all t, s, the sum Y_{t}of Y_sthe covariance dependence on time |t-s|is related to the time interval, and has nothing to do with the actual moments t and s. Therefore, the stationary process can have a simplified symbol, where Thoseis the autocovariance function and Corris the autocorrelation function, which is written as:

Regarding Yan Kuan Ping Ping, I have written very clearly about the autoregressive model (AR) before. There may be some subjectivity when looking at time series plots with the naked eye. The ADF test (also known as the unit root test) is a commonly used and rigorous statistical test method.

The ADF test mainly determines whether the time series contains a unit root: if the series is stationary, there is no unit root; otherwise, there will be a unit root.

from statsmodels.tsa.stattools import adfuller                 # ADF检验
 
def stableCheck(timeseries):
    # 移动60期的均值和方差
    rol_mean = timeseries.rolling(window=60).mean()
    rol_std = timeseries.rolling(window=60).std()
    # 绘图
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue', label='Original')
    mean = plt.plot(rol_mean, color='red', label='Rolling Mean')
    std = plt.plot(rol_std, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    # 进行ADF检验
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    # 对检验结果进行语义描述
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    print('ADF检验结果:')
    print(dfoutput)

Results of Dickey-Fuller Test:
ADF检验结果:
Test Statistic                   -5.503970
p-value                           0.000002
#Lags Used                        7.000000
Number of Observations Used    2184.000000
Critical Value (1%)              -3.433348
Critical Value (5%)              -2.862864
Critical Value (10%)             -2.567475

Based on the ADF inspection results provided, the following conclusions can be drawn:

Test Statistic : -5.503970

  • The test statistic is a key output of the ADF test. It compares the difference between actual data and a random walk (i.e. no trend). The lower the value, the more likely there is a trend in the data.

p-value : 0.000002

  • The p-value is a key output of hypothesis testing and represents the probability of observing the test statistic or more extreme case if the null hypothesis is true. Here, the p-value is very close to zero and much smaller than the commonly used significance level (e.g. 0.05), which indicates that we can reject the null hypothesis.

#Lags Used : 7

  • This is the order of lag used in the unit root test. It can help determine if autocorrelation exists.

Number of Observations Used (number of observation samples used) : 2184

  • This represents the number of observation samples used in the test.

Critical Values :

  • A critical value is a threshold used to compare test statistics. At different significance levels (1%, 5%, 10%, etc.), they determine the critical value for rejecting the null hypothesis.

Based on the p-value being smaller than the usual significance level (e.g. 0.05), we can reject the null hypothesis that the time series data is stationary. Therefore, based on the results of this ADF test, it can be concluded that this time series data is stable, has no unit root, and has a trend.

 white noise test

If a stationary sequence is a non-white noise sequence, it means that it is not composed of random noise. This means that there is some inherent structure or pattern in the sequence that can be further analyzed and modeled for prediction or other purposes. We can pass the Ljung-Box test, which is a method of testing serial autocorrelation in time series analysis.

def whiteNoiseCheck(data):
    result = acorr_ljungbox(data, lags=1)
    temp = result.iloc[:,1].values[0]
    #print(result.iloc[:,1].values[0])
    print('白噪声检验结果:', result)
    # 如果temp小于0.05,则可以以95%的概率拒绝原假设,认为该序列为非白噪声序列;否则,为白噪声序列,认为没有分析意义
    print(temp)
    return result
ifwhiteNoise = whiteNoiseCheck(df_evel_timese_diff2)

 

 The LB statistic is used to evaluate whether there is autocorrelation in the sequence, and the p-value is used to determine whether the sequence is white noise. Generally speaking, if the sequence is white noise, then the value of the LB statistic will be close to 0 and the p-value will be large, indicating that there is no autocorrelation in the sequence. But if the p-value is less than the significance level (usually 0.05), then the null hypothesis can be rejected, that is, the sequence is not white noise. In this test result, the LB statistic is 2122, the p-value is close to 0, and the p-value is much less than 0.05, so the null hypothesis can be rejected, that is, the sequence is not white noise.

Model fitting

Fitting the SARIMA model requires determining its parameters. The SARIMA model has three important parameters: p, d and q, which represent the autoregressive order, difference order and moving average order respectively; there are also seasonal parameters P, D and Q, which represent the seasonal autoregressive order respectively. , seasonal difference order and seasonal moving average order. According to experience and statistical methods, you can select the best p, d, q and P, D, Q parameters by observing the sample autocorrelation function ACF and partial autocorrelation function PACF, so that the autocorrelation function and partial autocorrelation of the residual sequence The function mean is 0.

time series ordering

from matplotlib.ticker import MultipleLocator
def draw_acf(data):
    # 利用ACF判断模型阶数
    plot_acf(data)
    plt.title("序列自相关图(ACF)")
    plt.show()
 
def draw_pacf(data):
    # 利用PACF判断模型阶数
    plot_pacf(data)
    plt.title("序列偏自相关图(PACF)")
    plt.show()
    
def draw_acf_pacf(data):
    f = plt.figure(facecolor='white')
    # 构建第一个图
    ax1 = f.add_subplot(211)
    # 把x轴的刻度间隔设置为1,并存在变量里
    x_major_locator = MultipleLocator(1)
    plot_acf(data,  ax=ax1)
    # 构建第二个图
    ax2 = f.add_subplot(212)
    plot_pacf(data, ax=ax2)
    plt.subplots_adjust(hspace=0.5)
    # 把x轴的主刻度设置为1的倍数
    ax1.xaxis.set_major_locator(x_major_locator)
    ax2.xaxis.set_major_locator(x_major_locator)
    plt.show()
    

(1) Determine dthe Dorder of the sum:d When the order difference of the original sequence and the sequence lagis a stationary sequence, the value of , can be determined.mDdDm

(2) Determine p,qthe P,Qorder of the sum:

  1. First, draw the ACF and PACF diagrams of the smoothed time series;
  2. Value determined by observing lagtailing/truncation at seasonalityP,Q
  3. Value lagdetermined by observing tailing/truncation at short-term non-seasonal locationsp,q

It has been determined in step three before d,D,m. The remaining parameters will be determined based on the ACF and PACF diagrams below. The abscissa in the figure below lagis in months.

 Non-seasonal part :

For p, lag=1after, ACFthe figure is tailed and PACFthe figure is truncated. Likewise, for q, lag=1after, ACFfigure censored, PACFfigure censored. Likewise, for q,in lag=1,ACF plots are censored and PACF plots are,tailed.

 Seasonal part :

P, Qthe determination is the same as non-seasonal, but you need to remember that the lag interval is 60.

AR model : the autocorrelation coefficient is tailed, and the partial autocorrelation coefficient is censored;
MA model : the autocorrelation coefficient is censored, and the partial autocorrelation function is tailed;
ARMA model : both the autocorrelation function and the partial autocorrelation function are tailed.

The pmdarima.auto_arima() method can help us automatically determine ARIMA(p,d,q)(P,D,Q)_{m}the parameters, enter the data directly, and set the parameters in auto_arima().

Model training

import pmdarima as pm
split_point = int(len(time_series) * 0.85)
# 确定训练集/测试集
data_train, data_test = time_series[0:split_point], time_series[split_point:len(time_series)]
# 使用训练集的数据来拟合模型
built_arimamodel = pm.auto_arima(data_train,
                                 start_p=0,   # p最小值
                                 start_q=0,   # q最小值
                                 test='adf',  # ADF检验确认差分阶数d
                                 max_p=5,     # p最大值
                                 max_q=5,     # q最大值
                                 m=12,        # 季节性周期长度,当m=1时则不考虑季节性
                                 d=None,      # 通过函数来计算d
                                 seasonal=True, start_P=0, D=1, trace=True,
                                 error_action='ignore', suppress_warnings=True,
                                 stepwise=False  # stepwise为False则不进行完全组合遍历
                                 )
print(built_arimamodel.summary())

In this way, we get the best parameters, and auto_arima directly builds the model. Of course, if you want to choose the parameters yourself based on the ADF picture:

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error


# 拟合ARIMA模型
model_water = ARIMA(merged_df['水位(m)'], order=(2, 0, 1))
results_water = model_water.fit()

# 评估模型
mse_water = mean_squared_error(merged_df['水位(m)'], results_water.fittedvalues)

print(f'Mean Squared Error (Water Level): {mse_water}')
Mean Squared Error (Water Level): 0.02088585362915174

 The mean square error is only 0.02. This result is already quite satisfactory. Next, we can try to predict the water level in the next two years:

# 定义要预测的时间范围
forecast_range = pd.date_range(start='2021-12-31', periods=730, freq='D')  # 预测未来两年,共730天

# 使用模型进行预测
forecast = results_water.get_forecast(steps=730)

 At this point, the modeling is over. The above is the overall process of the entire competition from data analysis, data processing to data modeling. It has to be said that it is quite complicated. Generally speaking, only those with rich data modeling experience can complete the project within two days. Completing the construction of the entire model requires relatively large abilities from students.


Guess you like

Origin blog.csdn.net/master_hunter/article/details/132800735