Basics of time series analysis - descriptive statistical analysis (preparation work before modeling, code + data)

Topic 1 Basics of Time Series Analysis

This article starts with stock time series data and provides a comprehensive and detailed introduction to the basic knowledge of time series data analysis to help readers better understand and analyze this key area. Clear definitions and relevant examples are provided in the article to better understand the concepts and methods of time series analysis.
The content of this article includes descriptive statistical analysis of time series such as simple trend analysis, white noise analysis, and autocorrelation analysis. It is the preparatory work before time series modeling. These contents can help novices better understand and apply descriptive statistical analysis of time series and prepare for time series modeling.

In the following time series special articles, we will continue to expand and delve into more topics related to time series data analysis, including:

Trend analysis of special dates: Introduces how to analyze and model the impact of special dates such as holidays, weekly, daily, and yearly on time series data. This includes real-time adjustments and handling of holiday effects.
Python time series modeling: Provide more specific examples and cases to help readers quickly get started with commonly used time series modeling libraries in Python, such as Prophet, ARIMA, and Exponential Smoothing wait. Readers are provided with code examples and how to apply these models for analysis.
Financial big data application cases: Provide actual financial time series data application cases to show how to apply time series analysis to solve problems in the financial field, such as stock price prediction, risk management, investment strategies, etc.
Recommended learning materials and resources: Provides recommendations for learning materials, tutorials, online courses and related books about financial big data and time series analysis to further learn and deepen knowledge.

Follow gzh "finance melatonin" to obtain case codes and more financial big data and other related learning materials.

1.1 Time series definition

A time series refers to a collection of a series of data points arranged in time order. These data points are typically collected at regular intervals or points in time, such as daily, hourly, minute, or second.

For a certain variable or a group of variables $x (t)$ carry out observation and measurement, and In a series of moments $t_1,t_2,⋯,t_n$ The resulting sequence collection of discrete numbers is called a time series.

For example: The closing price of a certain stock A on each trading day from June 1, 2020 to June 1, 2022 can constitute a time series; the daily maximum temperature in a certain place can constitute a time series.

1.2 Characteristics of time series

Trend: Trend is the overall long-term upward or downward trend in time series data. Trends reflect overall patterns of increase or decrease in data over time.
Seasonality: Seasonality is a periodically recurring pattern in time series data. These recurring patterns are usually due to some cyclical factor, such as annual seasonal changes or weekly cyclical changes.
Cyclic Patterns: Cyclic patterns are non-fixed recurring patterns in time series data. Unlike seasonality, periodicity may have variable cycle lengths and does not necessarily follow an obvious pattern.
Noise: Noise is the random fluctuation or irregularity in time series data. Noise is often caused by random factors, making time series data not perfectly consistent with trends, seasonal and cyclical patterns.
Autocorrelation: Autocorrelation refers to the correlation between a data point in time series data and the data point of the previous moment (or multiple moments) . Autocorrelation analysis can help discover lagged patterns in time series.

1.3 Time series data collection and processing (taking the stock market as an example)

1.3.1 Get the stock code/index code and name list

Find the stock code of the corresponding stock from websites such as Oriental Fortune, Flush, and Sina Finance.
Use Python to crawl trading information of corresponding stocks

pip install tushare

!pip install tushare

# 导入需要的包
import pandas as pd
import tushare as ts #获取股票数据需要安装的库
import numpy as np
import matplotlib.pyplot as plt #绘图
import seaborn as sns     #seaborn画出的图更好看，且代码更简单
sns.set(color_codes=True) #seaborn设置背景

# 根据股票代码和时间范围获取股票数据
df = ts.get_k_data('600519', start='2020-01-01', end='2022-12-31')
# 简单查看一下股票数据：
df.head()

	date	open	close	high	low	volume	code
0	2020-01-02	1022.186	1024.186	1039.246	1010.186	148099.0	600519
1	2020-01-03	1011.186	972.746	1011.186	971.086	130318.0	600519
2	2020-01-06	965.046	972.176	987.086	961.486	63414.0	600519
3	2020-01-07	971.686	988.716	993.186	970.586	47853.0	600519
4	2020-01-08	979.236	982.326	989.686	976.766	25008.0	600519

# 此时如果想要将股票数据获取到Excel文件中,代码如下：
df.to_excel('股价数据.xlsx', index=False)

1.3.2 Draw stock price trend chart

After we already have the stock price data, we can display it visually. Here we first use the set_index() function to set the date as the row index. This will make it easier to draw directly using the pandas library. The code is as follows:

df.set_index('date', inplace=True)#columns
df.head()

	open	close	high	low	volume	code
date
2020-01-02	1022.186	1024.186	1039.246	1010.186	148099.0	600519
2020-01-03	1011.186	972.746	1011.186	971.086	130318.0	600519
2020-01-06	965.046	972.176	987.086	961.486	63414.0	600519
2020-01-07	971.686	988.716	993.186	970.586	47853.0	600519
2020-01-08	979.236	982.326	989.686	976.766	25008.0	600519

df['close'].plot(figsize=(18,4)) #绘制股价图
# plt.xticks(rotation=45) # 设置X轴标签旋转45度
plt.legend() #设置图例
plt.show()

Insert image description here

1.3.3 Calculation of rate of return

# 计算对数收益率
df_lgreturn=np.log(df['close'])-np.log(df['close'].shift(1)) #差分
# df['close'].pct_change().dropna()
df_lgreturn.head()

df_lgreturn=df_lgreturn.dropna()#删除缺失值
df_lgreturn.head()

df_lgreturn.plot(figsize=(18,4))
# plt.xticks(rotation=45) # 设置X轴标签旋转45度
plt.show()

Insert image description here

1.4 Time series graphic analysis

1.4.1 Correlation coefficient

For two vectors, we want to define whether they are related. A very natural idea, use the angle between vectors as distance definition, the smaller the included angle is, the smaller the distance is; the larger the included angle is, the larger the distance is.

As early as middle school mathematics, we often use the cosine formula to calculate angles: $\large cos<\vec a , \vec b> = \frac {\vec a \cdot \vec b}{\lvert \vec a \rvert \lvert \vec b \rvert}$

而对于 $\large \vec a \cdot \vec b$ We call itinner product, for example $\large (x_1,y_1) \cdot (x_2,y_2) = x_1x_2 + y_1y_2$

Let’s look at the definition formula of correlation coefficient, $X$ sum $Y$ 的相关系数为： $\ large \rho_{xy} = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}$

Let us define the equation of the graph: $\large \rho_{xy} = \frac{\sum_{t=1}^{T}(x_t-\bar x)(y_t-\bar y)}{\sqrt{\sum_{t=1 }^{T}(x_t-\bar x)^{2}\sum_{t=1}^{T}(y_t-\bar y)^{2}}} = \frac{\overrightarrow{(X- \bar x)} \cdot \overrightarrow{(Y- \bar y)}}{\lvert \overrightarrow{(X- \bar x)} \rvert \lvert \overrightarrow{(Y- \bar y)} \rvert }$

We found that the correlation coefficient actually calculates the angle between two vectors in vector space! The covariance is the inner product of two vectors after removing the mean!

If two vectors are parallel, the correlation coefficient is equal to 1 or -1. When they are in the same direction, it is 1, and when they are in opposite directions, it is -1. If two vectors are perpendicular, the cosine of the angle is equal to 0, indicating that they are unrelated. The smaller the angle between the two vectors, the closer the absolute value of the correlation coefficient is to 1, and the higher the correlation. It’s just that the vector is demeaned during calculation here, that is, a centralization operation. Instead of using vectors directly $X$ , $Y$ calculation.

For the operation of subtracting the mean, it does not affect the angle calculation. It is a "translation" effect, as shown in the following figure:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = pd.Series([9,8,7,5,4,2])
b = a - a.mean() # 去均值
plt.figure(figsize=(10,4))
a.plot(label='a')
b.plot(label='mean removed a')
plt.legend()

Insert image description here

1.4.2 Autocorrelation Function (ACF)

The correlation coefficient measures the linear correlation between two vectors, which is stationary $r_t$ Time series{ $r_{t}$ }, we sometimes want to know, $r_t$ and its past value $r_{t-i}$ linear correlation. At this time, we extend the concept of correlation coefficient to autocorrelation coefficient.

$r_t$ 与 $r_{t-l}$ The correlation coefficient of is called $r_t$ targetinterval $The autocorrelation coefficient of l$ is usually recorded as $\rho_l$ 。具体的：
$\large \rho_l = \frac{Cov(r_t,r_{t-l})} {\sqrt{Var(r_t)Var(r_{t-l})}} = \frac{Cov(r_t,r_{t-l})} {Var(r_t)}$
The properties of weakly stationary sequences are used here: $\large Var(r_t)=Var(r_{t-l})$

For astationary time series sample{ $r_t$ $r_{t}$ }, $1\and t\and T$ , separate interval $The estimate of the sample autocorrelation coefficient of l$ is:
$\large \hat \rho_l = \frac{\sum_{t=l+1}^{T }(r_t- \bar r)(r_{t-l}-\bar r)}{ \sum_{t=1}^{T}(r_t- \bar r)^{2}}, 0 \leqslant l \leqslant T-1$

Then the function $ \large \hat \rho_1,\hat \rho_2 ,\hat \rho_3… $called$ r_t$sample autocorrelation function (ACF)

When all the values in the autocorrelation function are 0, we think that the sequence is completely uncorrelated; therefore, we often It is necessary to check whether multiple autocorrelation coefficients are 0.

1. Check whether there is autocorrelation

mixture test

原假设 $H0:\exists i \in \{1,...,m\}, \rho_i \ne 0$
备择假设 $\rho_1 = ... = \rho_m=0$

Mixture test statistic: $\large Q(m) = T (T+2) \sum_{l=1}^{m} \frac{\hat{\rho_l}^{2}}{T-l}$
$Q (m)$ Freedom of movement $m$ 's $ \chi^2 $distribution

Decision rule:
$\large Q(m) > \chi_\alpha^{2} , reject H_0 < /span>$
即, $Q (m)$ 's degree of freedom为 $m$ 's map distribution $100(1-\alpha)$ At the quantile point, I refuse $H_0$ 。

Most software will give $The p-value of Q (m)$ , Then when the p-value is less than or equal to the significance level $\alpha$ time rejection H0.

Examples are given below:

from scipy import stats
import statsmodels.api as sm  # 统计相关的库

data = df_lgreturn # 使用收益率序列
m = 10 # 我们检验10个自相关系数

acf,q,p = sm.tsa.acf(data,nlags=m,qstat=True)  ## 计算自相关系数 及p-value
out = np.c_[range(1,11), acf[1:], q, p]
output=pd.DataFrame(out, columns=['lag', "AC", "Q", "P-value"])
output = output.set_index('lag')
output

	AC	Q	P-value
lag
1.0	-0.005176	0.019559	0.888775
2.0	0.011402	0.114591	0.944315
3.0	-0.020999	0.437372	0.932419
4.0	0.038581	1.528482	0.821585
5.0	-0.014422	1.681154	0.891265
6.0	-0.051001	3.593148	0.731538
7.0	0.028434	4.188252	0.757857
8.0	0.036009	5.144016	0.742078
9.0	-0.034035	5.999056	0.740013
10.0	0.017096	6.215102	0.796879

We take the significance level to be 0.05. It can be seen that all p-values are greater than 0.05; then we accept the null hypothesis $H_0$.

Therefore, we believe that this sequence, namely the Kweichow Moutai return rate sequence, hasno significant serial correlation

2. Draw ACF graphics

Calculate the original time series $y (t)$ ACF (k)，Display retention number $The autocorrelation coefficient of k$ is:

$\text{Corr}(y(t), y(t-k))$

In that, $\text{Corr}(·)$ represents the correlation coefficient.

Ranking:

ACF graphs are graphs that describe the autocorrelation structure of time series data. In the ACF chart, the horizontal axis represents the lag order (lag), and the vertical axis represents the correlation coefficient. Typically, an ACF plot takes a value of 1 when the lag order is 0 (because each data point has a correlation coefficient of 1 with itself), and then gradually decreases as the lag order increases. Values after the truncation point are close to zero, and values before the truncation point may oscillate. The first correlation coefficient after the censoring point that exceeds the significance level is usually used to determine the order of the MA model.

from statsmodels.graphics.tsaplots import plot_acf as ACF #自相关图

fig = ACF(data,lags = 20)
plt.show()

Insert image description here

3. Draw PACF graphics

Calculate PACF(k), which represents the difference between the current time point and the lag order after removing the influence of other lag orders $Partial correlation coefficient between k$ . The calculation method of PACF(k) is as follows:

$P A CF (1) = A CF (1)$

$\left[ \text{Corr}(y(t), y(t-k)) - \sum_{i=1}^{k-1} PACF(i) \times \text{Corr}(y(t-i), y(t-k)) \right], \text{ for } k > 1$

其中， $\sum_{i=1}^{k-1} PACF(i) \times \text{Corr}(y(t-i), y(t-k))$ Display ownership delinquency $i < k$ 的 $P A CF (i)$ 与 $A CF (y (t - i)$ , $y (t - k))$ 's interest.

Through the above recursive calculation, we can obtain the PACF sequence of time series data, which helps us determine the order of the AR model. PACF is an important tool in time series analysis, which can help us understand the autocorrelation structure of data and select appropriate models for prediction and analysis.

Ranking:

The PACF graph is a graph that describes the partial autocorrelation structure of time series data. In the PACF diagram, the horizontal axis also represents the lag order (lag), and the vertical axis represents the partial correlation coefficient. The values after the truncation point of the PACF plot are also close to zero, and the values before the truncation point usually show exponential decay after the truncation point. The first partial correlation coefficient exceeding the significance level after the censoring point is usually used to determine the order of the AR model.

from statsmodels.graphics.tsaplots import plot_pacf as PACF   #偏自相关图

fig = PACF(data,lags = 20)
plt.show()

Insert image description here

1.5 Stationarity of time series

Stationary time series Roughly speaking, a time series is called stationary if there is no systematic change in the mean (no trend), no systematic change in the variance, and cyclical changes are strictly eliminated .

# AR MA 
dataset=pd.DataFrame()
dataset['close'] = df['close']
dataset['closeDiff_1'] = df['close'].diff(1)  # 1阶差分处理
dataset['closeDiff_2'] = dataset['closeDiff_1'].diff(1)  # 2阶差分处理
dataset.plot(subplots=True,figsize=(18,12))

Insert image description here

1.5.1 Definition

Yan Pingping:

If for all times $t$ , any positive integer $k$ sum arbitrary $k$ 个正整数$ \large (t_1,t_2…t_k)$, $\large (r_{t_1},r_{t_2},......r_{t_k})$
的联合分布与 $\large (r_{t_1 + t},r_{t_2+t},......r_{t_k + t})$ is the same, we call the time series { $r_t$ $r_{t}$ }is严平稳.

也就是， $\large (r_{t_1},r_{t_2},......r_{t_k})$ remains unchanged under the time translation transformation, which is a strong conditions of. What we often assume is a weaker form of stationarity

Weakly stationary:

If time series { $r_t$ $r_{t}$ }Satisfy the following two conditions: $\large E(r_t) = \mu, \mu is a constant$ $\large Cov(r_t ,r_{t-l}) = \gamma_l, \gamma_l only depends on l$
then the time series { $r_t$ $r_{t}$ } isweakly stationary. That is, the mean of the sequence, $r_t$ 与 $r_{t-l}$ The covariance of does not change with time, $l$ any integer.

In financial data, what we usually call stationary series is weakly stationary.

difference

Difference (forward here) is to find the time series { $r_t$ $r_{t}$ }在 $t$ 时刻的值 $r_t$ 与 $t - 1$ 时刻的值 $r_{t-1}$ The difference of may be recorded as $d_t$ , then we get a new sequence { $d_t$ $d_{t}$ }, is the first difference of, for the new sequence { $d_t$ $d_{t}$ }Do the same operation again, it will besecond order difference.

Usually a non-stationary sequence can go through $d$ subdifferences are processed into weakly stationary or approximately weakly stationary time series. Looking back at the picture above, we find that the sequence obtained by second-order difference is better than the first-order difference.

1.5.2 Stationarity test

In the above operations, we mainly rely on the naked eye to distinguish the stationarity, which has certain errors. In financial time series, we often use the ADF test to test the stationarity of the time series.

ADF is a commonly used unit root test method. Its null hypothesis is that the sequence has a unit root, that is, it is non-stationary. Stationary time series data need to be significant at a given confidence level to reject the null hypothesis.

data2 = dataset['close'] # 贵州茅台收盘价
data2.plot(figsize=(15,4))

Insert image description here

Looking at the graph, it is obviously non-stationary here. Then we perform the ADF unit root test.

#statsmodels.org
from statsmodels.tsa.stattools import adfuller as ADF  #平稳性检测

temp = np.array(data2)
t = ADF(temp)  # ADF检验

output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
output

	value
Test Statistic Value	-2.30762
p-value	0.169522
Lags Used	0
Number of Observations Used	727
Critical Value(1%)	-3.439377
Critical Value(5%)	-2.865524
Critical Value(10%)	-2.568891

(-2.3076196321727642,
 0.16952226399336345,
 0,
 727,
 {'1%': -3.439376877165393,
  '5%': -2.865523768488869,
  '10%': -2.5688914082860164},
 7104.773703721578)

It can be seen that the p-value is 0.169522, which is greater than the significance level of 0.05. Null hypothesis: The series has a unit root and is non-stationary. Can't be denied. Therefore, the closing price series is non-stationary.

We differentiate the sequence once and check again!

data2Diff = data2.diff()  # 差分
data2Diff.plot(figsize=(15,4))

Insert image description here

temp = np.array(data2Diff)[1:] # 差分后第一个值为NaN,舍去
t = ADF(temp)  # ADF检验
print("p-value:   ",t[1])

p-value:    6.887197234767805e-24

It can be seen that the p-value is very close to 0, rejecting the null hypothesis, therefore, the series is stationary.

It can be seen that the sequence after one difference is stationary. For the original sequence, the value of d is 1.

1.6 White noise sequence and linear time series

1.6.1 White noise sequence

Amount of change $X (t)$ (t=1, 2,3...), if it is composed of a sequence of irrelevant random variables, that is, for all S is not equal to T, $X_t$ Random variable $X_{t}$ sum $X_s$ If the covariance of is zero, it is called a purely random process.

对于一个纯随机过程来说，若其期望和方差均为常数，则称之为白噪声过程。白噪声过程的样本实称成为白噪声序列，简称白噪声。之所以称为白噪声，是因为他和白光的特性类似，白光的光谱在各个频率上有相同的强度，白噪声的谱密度在各个频率上的值相同。

1.6.2 线性时间序列

时间序列{ $r_t$ }，如果能写成：
$\large r_t = \mu + \sum_{i=0}^{\infty}\psi_ia_{t-i} \\ \large \mu为r_t 的均值， \psi_0=1,\{a_t\}为白噪声序列$
则我们称{ $r_t$ } 为线性序列。其中 $a_t$ 称为在 $t$ 时刻的新息(innovation)或扰动(shock)

很多时间序列具有线性性，即是线性时间序列，相应的有很多线性时间序列模型，例如接下来要介绍的AR、MA、ARMA，都是线性模型，但并不是所有的金融时间序列都是线性的

对于弱平稳序列，我们利用白噪声的性质很容易得到 $r_t$ 的均值和方差：
$\large E(r_t) = \mu , Var(r_t) = \sigma_a^2 \sum_{i=0}^{\infty} \psi_i^{2} \\ \large \sigma_a^2为a_t的方差。$

因为 $Var(r_t)$ 一定小于正无穷，因此$ {\psi_i^2}$必须是收敛序列，因此满足 $\to \infty 时， \psi_i^2 \to 0$

即，随着 $i$ 的增大，远处的扰动 $a_{t-i}$ 对 $r_t$ 的影响会逐渐消失。

1.6.3 白噪声检验-混成检验

一般常用来进行模型残差白噪声检验

残差为白噪声，说明模型拟合的很好，残差部分为无法捕捉的纯随机数据。
残差非白噪声，说明模型哪里出了问题，比如参数没调好，需要继续优化；若如何优化模型也无法使得残差为白噪声，换模型，或者对残差进行二次预测。

混成检验（Ljung-Box检验） 是一种用于检验时间序列数据是否存在自相关性的统计检验方法，是对Box-Pierce检验的改进。它的原理基于自相关函数（ACF）的概念，用于评估时间序列数据在不同滞后阶数下的自相关性。

假设我们有一个时间序列数据： $x_1, x_2, x_3, \ldots, x_n$

Ljung-Box检验的步骤如下：

首先，我们需要拟合一个模型（例如ARIMA模型）来预测时间序列。然后，得到该模型的残差序列： $e_1, e_2, e_3, \ldots, e_n$ 。这些残差是模型预测值与实际值之间的差异。
接下来，我们计算残差序列在不同滞后阶数（即不同延迟）下的自相关系数（ACF）。ACF表示在某个滞后阶数下，序列与其自身之间的相关性。
Ljung-Box检验统计量计算如下：
$\sum_{k=1}^{h} \frac{\hat{\rho}_k^2}{n-k}$
其中， $h$ 是滞后阶数， $\hat{\rho}_k$ 是第 $The sample autocorrelation coefficient of k$ order lag.
After the calculation is completed, we will get a statistic $Q$ . This statistic approximately obeys the degree of freedom $Chi-square distribution of h$ . Typically, at a significance level (e.g. 0.05) we look for critical values if $Q$ is greater than the critical value, reject the null hypothesis (the residual sequence has no autocorrelation); if $If Q$ is less than or equal to the critical value, the null hypothesis cannot be rejected and the residual sequence is considered to be a white noise sequence (no autocorrelation).

The principle of the Ljung-Box test is to use the autocorrelation coefficient of the residual sequence to perform statistical testing to determine whether the model can better capture the autocorrelation of the data. If the test result rejects the null hypothesis, it means that there is autocorrelation in the model prediction, and the model selection needs to be reconsidered or improved; if the null hypothesis is accepted, it means that the residual sequence of the model is basically a white noise sequence, and the model can be considered to be compared suitable.

There are two types of code, one is the method in the first point in 1.4.2, and the other is to directly call Ljung-BOX to check

from statsmodels.stats.diagnostic import acorr_ljungbox as LjungBox #White noise test

from statsmodels.stats.diagnostic import acorr_ljungbox as LjungBox

Case 1 :

Organize and calculate self-selected financial asset data: (1) simple rate of return; (2) logarithmic rate of return.
Calculate statistical indicators and conduct descriptive statistical analysis accordingly;
Plot the distribution of the data.
Test whether the distribution follows a normal distribution.