Use Python to test the stationarity of a time series

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join


Today I will explain how to use python to test the stationarity of time series. First, let's briefly introduce the related concepts of stationarity test.

 

Figure 1. Correlation formulas for stationary series

Figure 1. Correlation formulas for stationary series

 

The stationarity of time series can be divided into strict stationarity and wide stationarity. Let {Xt} be a time series, for any positive integer m, t1, t2, t3,..., tm∈T, for any integer τ, if it satisfies the formula (1) in Figure 1, it is called a time series { Xt} is a strictly stationary time series. The definition of wide stability is if {Xt} satisfies the following three conditions:

(1) Any t ∈ T, E(Xt·Xt)<∞;
(2) Any t ∈ T, E Xt = μ, μ is a constant;
(3) Any t, s, k ∈ T , And k+st∈T, and γ(t, s)=γ(k, k+st),
then {Xt} is called a wide stationary time series.

Because it is difficult to obtain the distribution function of a random sequence in practical applications, Yan Ping station is rarely used, mainly using wide stationary time series.

After understanding the basic concept of stationarity, let's talk about the meaning of stationary time series. The analysis of stationary time series also follows the basic principles of mathematical statistics, using sample information to infer overall information. This requires that the fewer random variables to be analyzed, the better (that is, the smaller the dimension of the data, the better), and the more sample information for each variable, the better (that is, the larger the observation value of the data, the better), because the more random variables Less, the simpler the analysis process, the larger the sample size, the more reliable the results of the analysis.

But the data structure of time series has its particularity. Its sequence value Xt at any time t is a random variable, and due to the non-repeatability of time, this variable can only obtain a unique sample observation value at any time.

Because the sample information is too little, if there is no other auxiliary information, this kind of data structure is usually impossible to analyze, but sequence stationarity can effectively solve this problem. In a stationary sequence, the mean of the sequence is equal to a constant, which means that the mean sequence {μt, t∈T}, which originally contained multiple random variables, has become a constant sequence {μ, t∈T}. Originally, each random variable The mean value of μt can only be estimated by relying on only one sample observation value xt. Now, since μt=μ, each sample observation value xt becomes the sample observation value of the constant mean, as shown in equation (2) in Figure 1. .

This greatly reduces the number of random variables and increases the sample size of parameters to be estimated, which also reduces the difficulty of time series analysis.

After understanding the stationarity of the time series, let's explain in detail how to use python to check.

There are three main methods for using python to test stationarity, namely, time series chart test, autocorrelation chart test and construction statistics test.

First of all, let's talk about the sequence diagram test. The sequence diagram is a normal time sequence diagram, that is, the time is the horizontal axis and the observation value is the vertical axis for testing. Here the author gives 3 examples, because the timing diagram is too simple, so I use Excel to make the timing diagram directly here. Python is also possible, but it is not as simple as Excel.

The first example is the time series of China's annual yarn production from 1964 to 1999 (the data comes from the Beijing Bureau of Statistics). The data is shown in Figure 2 and the sequence diagram is shown in Figure 3. It can be clearly seen in Figure 3 that the annual yarn output sequence in China has an obvious increasing trend, so it must not be a stationary sequence.

 

Figure 2. Screenshot of part of yarn production data

Figure 2. Screenshot of part of yarn production data

 

Figure 3. Timing chart of yarn production

Figure 3. Timing chart of yarn production


The second example is the time series of average milk production per cow from January 1962 to December 1975 (data from the website http://census-info.us ). The data is shown in Figure 4, and the sequence diagram is shown in Figure 4. Shown in Figure 5. It can be seen from Figure 5 that the average monthly milk production of each cow is regularly periodic with an annual cycle. In addition, there is an obvious trend of increasing year by year, so this series must not be a stationary series.

Figure 4. A screenshot of part of the cow production data

Figure 4. A screenshot of part of the cow production data

 

Figure 5. Timing diagram of dairy cow production

Figure 5. Timing diagram of dairy cow production


The third example is the annual maximum temperature series in Beijing from 1949 to 1998 (data from the Beijing Municipal Bureau of Statistics). The data is shown in Figure 6, and the sequence diagram is shown in Figure 7. It can be seen from Figure 7 that the annual maximum temperature in Beijing always fluctuates randomly around 37 degrees, with no obvious trend or cycle. It can basically be regarded as a stationary sequence, but we still need to use the autocorrelation graph to further verify.

 

Figure 6. Screenshot of part of Beijing's highest temperature data

Figure 6. Screenshot of part of Beijing's highest temperature data

 

Figure 7. Time series of Beijing's maximum temperature

Figure 7. Time series of Beijing's maximum temperature

 

As can be seen from the above example, the time series diagram can only roughly judge whether a time series is a stationary series, and we can use the autocorrelation diagram to further test. To draw an autocorrelation graph, we need to use python, the following is the relevant code.

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf

temperature = r'C:\Users\北京气温.xls'
milk = r'C:\Users\奶牛产量.xlsx'
yarn = r'C:\Users\纱产量.xls'

data_tem = pd.read_excel(temperature, parse_date=True)
data_milk = pd.read_excel(milk, parse_date=True)
data_yarn = pd.read_excel(yarn, parse_date=True)

plt.rcParams.update({'figure.figsize':(8,6), 'figure.dpi':100}) #设置图片大小
plot_acf(data_tem.Tem) #生成自相关图
plot_acf(data_milk.milk_yield)
plot_acf(data_yarn.yarn_yield)
plt.show()

The plot_acf method in statsmodels is used to draw the autocorrelation graph. This method is very simple, just input the data directly, but if the data is one-dimensional, the generated 3 graphs are shown in Figure 8, Figure 9 and Figure 10. .

 

Figure 8. Autocorrelation graph of yarn production

Figure 8. Autocorrelation graph of yarn production

 

Figure 9. Autocorrelation graph of cow production

Figure 9. Autocorrelation graph of cow production

 

Figure 10. Autocorrelation map of Beijing maximum temperature

Figure 10. Autocorrelation map of Beijing maximum temperature

Stationary series usually have short-term correlation, that is, as the number of delay periods k increases, the autocorrelation coefficient of the stationary series will quickly decay to zero, while the autocorrelation coefficient of the non-stationary series will decay slowly. This is what we Use the autocorrelation graph to judge the standard of stationarity. Let's take a look at these three autocorrelation graphs. Figure 8 is the autocorrelation graph of the annual yarn output. The horizontal axis represents the number of delay periods, and the vertical axis represents the autocorrelation coefficient. It can be seen from the figure that the autocorrelation coefficient decays to zero. Slower,

In a long delay period, the autocorrelation coefficient is always positive and then negative, showing triangular symmetry. This is a typical autocorrelation graph form of a non-stationary sequence with a monotonic trend. Let’s take a look at Figure 9. This is the autocorrelation graph of the monthly milk production of each cow. The autocorrelation coefficient in the figure has been on the side of the zero axis for a long time. This is a typical feature of a monotonous trend sequence.

At the same time, it also presents an obvious sinusoidal fluctuation law, which is a typical feature of a non-stationary sequence with periodic changes. Finally, look at Figure 10, which is the autocorrelation graph of the annual maximum temperature in Beijing. The figure shows that the autocorrelation coefficient of the series has been relatively small. It can be considered that the series has been fluctuating around the zero axis, which is a stable and stable with strong randomness. Sequences usually have autocorrelation graphs.

Finally, we talk about the ADF method. The first two methods are all drawing. The characteristics of the drawing are relatively intuitive, but not precise enough, while the ADF method directly verifies the stationarity through hypothesis testing. ADF (full name Augmented Dickey-Fuller) is a unit root test method. There are many unit root test methods, while the ADF method is one of the more commonly used methods. It is not much different from ordinary hypothesis testing. It lists the null hypothesis. And alternative hypothesis. The null hypothesis (H0) and alternative hypothesis (H1) of ADF are as follows.
H0: Has unit roots and belongs to non-stationary series.
H1: There is no unit root and belongs to a stationary sequence, indicating that this sequence does not have a time-dependent structure.
Let's use python code to explain the usage of ADF.

from statsmodels.tsa.stattools import adfuller
yarn_result = adfuller(data_yarn.yarn_yield) #生成adf检验结果
milk_result = adfuller(data_milk.milk_yield)
tem_result = adfuller(data_tem.Tem)

print('The ADF Statistic of yarn yield: %f' % yarn_result[0])
print('The p value of yarn yield: %f' % yarn_result[1])
print('The ADF Statistic of milk yield: %f' % milk_result[0])
print('The p value of milk yield: %f' % milk_result[1])
print('The ADF Statistic of Beijing temperature: %f' % tem_result[0])
print('The p value of Beijing temperature: %f' % tem_result[1])

Here we use the adfuller method in statsmodels. It is also relatively simple to use, just input the data directly, but its return value is more, there are 7 values ​​in the returned result, namely adf, pvalue, usedlag, nobs, critical values, icbest and restore, the meaning of these 7 values ​​can refer to the official documents,

What we use here is the first two, namely adf and pvalue, adf is the test result of the ADF method, and pvalue is our commonly used p value. Our results are shown in Figure 11.

 

Figure 11. ADF inspection results

Figure 11. ADF inspection results

 

In Figure 11, we can see that the adf values ​​of yarn production, cow production and Beijing temperature are -0.016384, -1.303812, and -8.294675, in theory, the more negative this value is, the more negative the null hypothesis, but we don’t use adf here. To judge, use the p-value instead. The three p-values ​​are 0.957156, 0.627427 and 0.000000 respectively.

Taking the commonly used judgment standard value of 0.05 as a reference, the first two p-values ​​are far greater than 0.05, indicating that they support the null hypothesis, indicating that yarn production and dairy cow production are non-stationary series, and the p-value of the Beijing temperature series is zero, indicating It rejects the null hypothesis, indicating that the series is a stationary series. We can see that the results obtained by the adf method and the previous two methods are consistent.

This article introduces three methods for judging the stationarity of a time series in more detail. These three methods are often used in practical applications. Of course, there are many methods for judging stationarity. If necessary, you can also find relevant ones. data.

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/112933000