[Python time series analysis of the number of postgraduate applicants in the country: ARIMA code]

background

The report of the 19th National Congress of the Communist Party of China made a detailed description of education. In recent years, with the gradual expansion of the scale of postgraduate enrollment, the number of people signing up for postgraduate examinations has also increased year by year. Most of the articles about graduate students are on the topic of current situation of graduate students, education of graduate students, employment of graduate students, etc. As far as current news hotspots are concerned, the increase in the number of postgraduate applicants across the country is also a hot topic. There is also a huge difference between the number of applicants and the number of admissions.

In view of the steady increase in the number of postgraduate applicants across the country in recent years, using python and time series correlation analysis, combined with the data of the number of postgraduate applicants in the past 20 years, ARIMA was used to establish a model for analysis and research. Find the appropriate ARIMA model based on the experiment and predict the number of postgraduate applicants nationwide in the next three years.

1. Data understanding

1. Data source

The research data for this question comes from the Kaoyanbang and the Zhongyan Admissions Information Network. The data used include the number of postgraduate applicants nationwide from 1995 to 2018 .

2. Data import

data = [15.50,20.40,24.20,27.40,31.9,39.2,46,62.4,79.7,94.5,117.2,127.12,128.2,120,124.6,140.6,151.1,165.6,176,172,164,177,201,238]
time=pd.Series(data,#生成时间序列
           index=pd.date_range(start = '1995',periods=24,freq='A'))
  1. Data Display

2. Model establishment and solution

(1) Stationarity test

1. Timing diagram

According to the above information, use the matplotlib library in the python language to draw the trend graph of the national postgraduate students over time.

time.plot(figsize=(6,4),color='r', label='人数 = F(年份)')
plt.xlabel('年份')    # xlabel 方法指定 x 轴显示的名字
plt.ylabel('人数/万人')    # ylabel 方法指定 y 轴显示的名字
plt.legend()    # legend 是在图区显示label,即上面 .plot()方法中label参数的值
plt.title('1995—2018年全国硕士研究生报名人数时间序列图')
plt.show()

Figure 1: 1995-2018 time series chart of the number of postgraduate applicants nationwide

From Figure 1, it can be concluded that the number of applicants for master’s degree programs nationwide from 1995 to 2022 generally showed an upward trend, from 155,000 in 1995 to 4.57 million in 2022 , with an average annual growth rate of 15.77 %. So it is very likely that it is not a stationary sequence.

2. Autocorrelation diagram

According to the above information, use the drawing program in the python language to draw the autocorrelation diagram of the time series of the number of postgraduate applicants nationwide.

plot_acf(time).show

Figure 2 Autocorrelation diagram of the number of postgraduate applicants nationwide from 1995 to 2018

It can be seen from the above figure that it falls into the interval after the 5th order, and the autocorrelation coefficient is greater than 0 for a long time, showing a strong autocorrelation.

3. ADF unit root test

print(adfuller(time))

From the returned results, it can be seen that the p value of the test result , that is, the P value, is significantly greater than 0.05, and the sequence is judged to be a non-stationary sequence. 

4. Summary

From the above three tests, it can be found that the sequence must be a non-stationary sequence.

  • Stationarity test after difference
  1. difference operation

In order to convert a non-stationary time series into a stationary series, we need to perform a difference operation on the series. The differential operation is to subtract the current time from the next time point, such as y(t)-y(t-1), represented by D, defined as Dy(t)=y(t) - yt - 1. Then the k-order difference can be expressed as: y(t)-y(tk)=D(k)*y(t)=(1-L^k)*y(t)=y(t)-(L^k )*y(t), L is a lag operator, defined as L*y(t)=y(t-1), then the operator after k order is defined as (L^k)*y(t)= y( tk).

The difference follows the characteristic from small to large, so the first-order difference operation is performed on the time series, and the trend diagram shown in the figure below is obtained.

# do the first order difference

def differ1(price):
    price_diff = price.diff()
    price_diff = price_diff.dropna()
    plt.figure(figsize=(6, 4))
    plt.plot(price_diff)
    return price_diff
day_diff = differ1(time)#一阶差分
plt.title('一阶差分')
plt.show()

Figure 3 The first-order difference time series of the number of postgraduate applicants nationwide

The figure above shows that the data increase and decrease trend after the first-order difference processing is relatively stable, but according to the principles of data optimization and accuracy, it is necessary to perform a difference operation on the time series after the first-order difference. Therefore, the second-order difference operation is now performed on the sequence.

day_diff = differ1(day_diff)#二阶差分
plt.title('二阶差分')
plt.show()

Figure 4 The second-order difference time series of the number of postgraduate applicants nationwide

In theory, enough difference operations can fully extract the non-stationary deterministic information in the original time series. However, it should be noted that the order of the difference operation is not as high as possible when performing the difference operation. Differentiation is the process of extracting and processing information, and information will be lost every time the difference is made, so the order of the difference needs to be appropriate to avoid excessive difference. Whether the time series after difference is stable, you can judge whether the order of difference is optimal by performing a unit root test on the time series after difference.

#再次进行ADF检验
print(adfuller(day_diff1))
print(adfuller(day_diff2))

From the above, it can be quickly concluded that the number of applicants under the first-order difference has a constant mean and the ADF test P value under the second-order difference is less than 0.05, then the time series after two differences are all stationary sequences, and the selection of parameter d needs to consider 1 with two values ​​of 2.

  1. Autocorrelation plot and partial autocorrelation plot

The above is the selection of the difference order d, and the parameters p and q in the ARIMA model also need to be selected. The autocorrelation coefficient (ACF) and partial autocorrelation coefficient (PACF) of the time series can determine the parameters p and q. For the time series after stabilization, that is, for the time series after the first-order and second-order difference processing, the autocorrelation diagram and partial autocorrelation diagram are drawn.

① Time series after first order difference:

#一阶差分后的自相关图和偏自相关图
flg1 = plt.figure()
acf1 = flg1.add_subplot(1,2,1)
pacf1 = flg1.add_subplot(1,2,2)
plot_acf(day_diff1,lags=22,ax = acf1)
plt.title('一阶差分后自相关图')
plot_acf(day_diff1,lags=22,ax = pacf1)
plt.title('一阶差分后偏自相关图')
plt.show()

Fig.5 Autocorrelation and partial autocorrelation diagram under d=1

The autocorrelation graph after the first-order difference shows that the hysteresis autocorrelation value basically does not exceed the boundary value. Although the 1st-order and 3rd-order autocorrelation values ​​exceed the boundary, it is likely to be accidental, and the autocorrelation value does not exceed significantly in other aspects. boundary. The partial autocorrelation diagram shows that removing the first order basically does not exceed the boundary value. You can consider p=2, q=0, that is, the ARIMA(2,1,0) model. .

Time series after 2nd order difference:

#二阶差分后的自相关图和偏自相关图
flg2 = plt.figure()
acf2 = flg2.add_subplot(1,2,1)
pacf2 = flg2.add_subplot(1,2,2)
plot_acf(day_diff1,lags=22,ax = acf2)
plt.title('二阶差分后自相关图')
plot_acf(day_diff1,lags=22,ax = pacf2)
plt.title('二阶差分后偏自相关图')
plt.show()

Figure 6 Autocorrelation and partial autocorrelation diagram under d=2

The autocorrelation plot and partial autocorrelation plot after the second difference show that no boundary value is exceeded.

Summarize:

From the above test, we can see that this sequence can be regarded as a stationary sequence.

  • Model Selection and Parameter Estimation

At this time, when the ARIMA(p,d,q) model is selected for prediction, the parameters are selected from low order to high order according to 0, 1,2, and the optimal value model is selected according to the AIC criterion.

arma_mod120 = sm.tsa.ARIMA(time,order = (1,2,0)).fit()
print(arma_mod120.aic)


 

arma_mod121 = sm.tsa.ARIMA(time,order = (1,2,1)).fit()
print(arma_mod121.aic)


 

arma_mod220 = sm.tsa.ARIMA(time,order = (2,2,0)).fit()
print(arma_mod220.aic)


arma_mod221 = sm.tsa.ARIMA(time,order = (2,2,1)).fit()
print(arma_mod221.aic)

Organize the above information into the following table:

Model

aic

ARIMA(1,2,0)

162.25302291755247

ARIMA(1,2,1)

162.68977473937656

ARIMA(2,2,0)

163.44292392417188

ARIMA(2,2,1)

161.5282670619417

According to the comparison, it is found that the AIC=16 1 . 53 of the model ARIMA(2,2,1) is the smallest, so this model is the best.

  • white noise test
print(acorr_ljungbox(arma_mod221.resid, lags = 5,boxpierce=True))

return value:

l b _stat : test statistics

pvalue: p-statistic based on chi-square distribution

b p _stat : p-statistic based on Box-Pierce test

b p _ pvalue: p statistic based on the Box-Pierce test under the chi-square distribution [1]

The white noise test was carried out on the residual sequence, and the P value = 0.9 439 > 0.05 was obtained. The white noise test of the residual sequence showed that the model was significantly established. The ARIMA(2,2,1) model fits the time series successfully.

3. Model prediction

Using the above-mentioned ARIMA (2,2,1) model to calculate the number of postgraduate applicants in the country and give the forecast table and forecast chart as follows.

2019-2021 Predicted number of postgraduate applicants nationwide (10,000)

years

Predictive value

2019

265.796811

2020

277.979823

2021

280.468011

 Figure 7  Forecast of the number of postgraduate applicants nationwide from 2019 to 2021

As can be seen from the figure above, the forecast for the number of applicants in 2019 is 2.66 million, an increase of only 280,000 compared with the previous year, and the difference between 2018 and 2017 is 370,000, which may be insufficient in comparison. The rate of growth has declined.

The content of this article refers to: arima model_time series analysis (R)‖ARIMA model forecasting example

Guess you like

Origin blog.csdn.net/weixin_51315141/article/details/124795927