Basics of time series analysis - descriptive statistical analysis (preparation work before modeling, code + data)

Topic 1 Basics of Time Series Analysis

This article starts with stock time series data and provides a comprehensive and detailed introduction to the basic knowledge of time series data analysis to help readers better understand and analyze this key area. Clear definitions and relevant examples are provided in the article to better understand the concepts and methods of time series analysis.
The content of this article includes descriptive statistical analysis of time series such as simple trend analysis, white noise analysis, and autocorrelation analysis. It is the preparatory work before time series modeling. These contents can help novices better understand and apply descriptive statistical analysis of time series and prepare for time series modeling.

In the following time series special articles, we will continue to expand and delve into more topics related to time series data analysis, including:

  1. Trend analysis of special dates: Introduces how to analyze and model the impact of special dates such as holidays, weekly, daily, and yearly on time series data. This includes real-time adjustments and handling of holiday effects.
  2. Python time series modeling: Provide more specific examples and cases to help readers quickly get started with commonly used time series modeling libraries in Python, such as Prophet, ARIMA, and Exponential Smoothing wait. Readers are provided with code examples and how to apply these models for analysis.
  3. Financial big data application cases: Provide actual financial time series data application cases to show how to apply time series analysis to solve problems in the financial field, such as stock price prediction, risk management, investment strategies, etc.
  4. Recommended learning materials and resources: Provides recommendations for learning materials, tutorials, online courses and related books about financial big data and time series analysis to further learn and deepen knowledge.

Follow gzh "finance melatonin" to obtain case codes and more financial big data and other related learning materials.

1.1 Time series definition

A time series refers to a collection of a series of data points arranged in time order. These data points are typically collected at regular intervals or points in time, such as daily, hourly, minute, or second.

For a certain variable or a group of variables x ( t ) x(t) x(t) carry out observation and measurement, and In a series of moments t 1 , t 2 , ⋯ , t n t_1,t_2,⋯,t_n t1,t2,,tnThe resulting sequence collection of discrete numbers is called a time series.

For example: The closing price of a certain stock A on each trading day from June 1, 2020 to June 1, 2022 can constitute a time series; the daily maximum temperature in a certain place can constitute a time series.

1.2 Characteristics of time series

  • Trend: Trend is the overall long-term upward or downward trend in time series data. Trends reflect overall patterns of increase or decrease in data over time.

  • Seasonality: Seasonality is a periodically recurring pattern in time series data. These recurring patterns are usually due to some cyclical factor, such as annual seasonal changes or weekly cyclical changes.

  • Cyclic Patterns: Cyclic patterns are non-fixed recurring patterns in time series data. Unlike seasonality, periodicity may have variable cycle lengths and does not necessarily follow an obvious pattern.

  • Noise: Noise is the random fluctuation or irregularity in time series data. Noise is often caused by random factors, making time series data not perfectly consistent with trends, seasonal and cyclical patterns.

  • Autocorrelation: Autocorrelation refers to the correlation between a data point in time series data and the data point of the previous moment (or multiple moments) . Autocorrelation analysis can help discover lagged patterns in time series.

1.3 Time series data collection and processing (taking the stock market as an example)

1.3.1 Get the stock code/index code and name list

  • Find the stock code of the corresponding stock from websites such as Oriental Fortune, Flush, and Sina Finance.

  • Use Python to crawl trading information of corresponding stocks

pip install tushare

!pip install tushare
# 导入需要的包
import pandas as pd
import tushare as ts #获取股票数据需要安装的库
import numpy as np
import matplotlib.pyplot as plt #绘图
import seaborn as sns     #seaborn画出的图更好看,且代码更简单
sns.set(color_codes=True) #seaborn设置背景
# 根据股票代码和时间范围获取股票数据
df = ts.get_k_data('600519', start='2020-01-01', end='2022-12-31')
# 简单查看一下股票数据:
df.head()
date open close high low volume code
0 2020-01-02 1022.186 1024.186 1039.246 1010.186 148099.0 600519
1 2020-01-03 1011.186 972.746 1011.186 971.086 130318.0 600519
2 2020-01-06 965.046 972.176 987.086 961.486 63414.0 600519
3 2020-01-07 971.686 988.716 993.186 970.586 47853.0 600519
4 2020-01-08 979.236 982.326 989.686 976.766 25008.0 600519
# 此时如果想要将股票数据获取到Excel文件中,代码如下:
df.to_excel('股价数据.xlsx', index=False)

1.3.2 Draw stock price trend chart

After we already have the stock price data, we can display it visually. Here we first use the set_index() function to set the date as the row index. This will make it easier to draw directly using the pandas library. The code is as follows:

df.set_index('date', inplace=True)#columns
df.head()
open close high low volume code
date
2020-01-02 1022.186 1024.186 1039.246 1010.186 148099.0 600519
2020-01-03 1011.186 972.746 1011.186 971.086 130318.0 600519
2020-01-06 965.046 972.176 987.086 961.486 63414.0 600519
2020-01-07 971.686 988.716 993.186 970.586 47853.0 600519
2020-01-08 979.236 982.326 989.686 976.766 25008.0 600519
df['close'].plot(figsize=(18,4)) #绘制股价图
# plt.xticks(rotation=45) # 设置X轴标签旋转45度
plt.legend() #设置图例
plt.show()

Insert image description here

1.3.3 Calculation of rate of return

# 计算对数收益率
df_lgreturn=np.log(df['close'])-np.log(df['close'].shift(1)) #差分
# df['close'].pct_change().dropna()
df_lgreturn.head()
df_lgreturn=df_lgreturn.dropna()#删除缺失值
df_lgreturn.head()
df_lgreturn.plot(figsize=(18,4))
# plt.xticks(rotation=45) # 设置X轴标签旋转45度
plt.show()

Insert image description here

1.4 Time series graphic analysis

1.4.1 Correlation coefficient

For two vectors, we want to define whether they are related. A very natural idea, use the angle between vectors as distance definition, the smaller the included angle is, the smaller the distance is; the larger the included angle is, the larger the distance is.

As early as middle school mathematics, we often use the cosine formula to calculate angles: c o s < a ⃗ , b ⃗ > = a ⃗ ⋅ b ⃗ ∣ a ⃗ ∣ ∣ b ⃗ ∣ \large cos<\vec a , \vec b> = \frac {\vec a \cdot \vec b}{\lvert \vec a \rvert \lvert \vec b \rvert} cos<a ,b >=a b a b

而对于 a ⃗ ⋅ b ⃗ \large \vec a \cdot \vec b a b We call itinner product, for example ( x 1 , y 1 ) ⋅ ( x 2 , y 2 ) = x 1 x 2 + y 1 y 2 \large (x_1,y_1) \cdot (x_2,y_2) = x_1x_2 + y_1y_2 (x1,and1)(x2,and2)=x1x2+and1and2

Let’s look at the definition formula of correlation coefficient, X X Xsum Y Y Y的相关系数为: ρ x y = C o v (X, Y) V a r (X) V a r (Y) \ large \rho_{xy} = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} rxy=Var(X) Var(Y) Cov(X,Y)

Let us define the equation of the graph: ρ x y = ∑ t = 1 T ( x t − x ˉ ) ( y t − y ˉ ) ∑ t = 1 T ( x t − x ˉ ) 2 ∑ t = 1 T ( y t − y ˉ ) 2 = ( X − x ˉ ) → ⋅ ( Y − y ˉ ) → ∣ ( X − x ˉ ) → ∣ ∣ ( Y − y ˉ ) → ∣ \large \rho_{xy} = \frac{\sum_{t=1}^{T}(x_t-\bar x)(y_t-\bar y)}{\sqrt{\sum_{t=1 }^{T}(x_t-\bar x)^{2}\sum_{t=1}^{T}(y_t-\bar y)^{2}}} = \frac{\overrightarrow{(X- \bar x)} \cdot \overrightarrow{(Y- \bar y)}}{\lvert \overrightarrow{(X- \bar x)} \rvert \lvert \overrightarrow{(Y- \bar y)} \rvert } rxy=t=1T(xtxˉ)2t=1T(ytandˉ)2 t=1T(xtxˉ)(ytandˉ)=(Xxˉ) (Yandˉ) (Xxˉ) (Yandˉ)

We found that the correlation coefficient actually calculates the angle between two vectors in vector space! The covariance is the inner product of two vectors after removing the mean!

If two vectors are parallel, the correlation coefficient is equal to 1 or -1. When they are in the same direction, it is 1, and when they are in opposite directions, it is -1. If two vectors are perpendicular, the cosine of the angle is equal to 0, indicating that they are unrelated. The smaller the angle between the two vectors, the closer the absolute value of the correlation coefficient is to 1, and the higher the correlation. It’s just that the vector is demeaned during calculation here, that is, a centralization operation. Instead of using vectors directly X X X, Y Y Ycalculation.

For the operation of subtracting the mean, it does not affect the angle calculation. It is a "translation" effect, as shown in the following figure:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = pd.Series([9,8,7,5,4,2])
b = a - a.mean() # 去均值
plt.figure(figsize=(10,4))
a.plot(label='a')
b.plot(label='mean removed a')
plt.legend()


Insert image description here

1.4.2 Autocorrelation Function (ACF)

The correlation coefficient measures the linear correlation between two vectors, which is stationary r t r_t Time series{ rt}, we sometimes want to know, r t r_t rtand its past value r t − i r_{t-i} rtilinear correlation. At this time, we extend the concept of correlation coefficient to autocorrelation coefficient.

r t r_t rt r t − l r_{t-l} rtlThe correlation coefficient of is called r t r_t rttargetinterval l l The autocorrelation coefficient of l is usually recorded as ρ l \rho_l rl。具体的:
ρ l = C o v ( r t , r t − l ) V a r ( r t ) V a r ( r t − l ) = C o v ( r t , r t − l ) V a r ( r t ) \large \rho_l = \frac{Cov(r_t,r_{t-l})} {\sqrt{Var(r_t)Var(r_{t-l})}} = \frac{Cov(r_t,r_{t-l})} {Var(r_t)} rl=Var(rt)Var(rtl) Cov(rt,rtl)=Var(rt)Cov(rt,rtl)
The properties of weakly stationary sequences are used here: V a r ( r t ) = V a r ( r t − l ) \large Var(r_t)=Var(r_{t-l}) Var(rt)=Var(rtl)

For astationary time series sample{ r t r_t rt}, 1 ≤ t ≤ T 1\and t\and T 1tT, separate interval l l The estimate of the sample autocorrelation coefficient of l is:
ρ ^ l = ∑ t = l + 1 T ( r t − r ˉ ) ( r t − l − r ˉ ) ∑ t = 1 T ( r t − r ˉ ) 2 , 0 ⩽ l ⩽ T − 1 \large \hat \rho_l = \frac{\sum_{t=l+1}^{T }(r_t- \bar r)(r_{t-l}-\bar r)}{ \sum_{t=1}^{T}(r_t- \bar r)^{2}}, 0 \leqslant l \leqslant T-1 r^l=t=1T(rtrˉ)2t=l+1T(rtrˉ)(rtlrˉ),0lT1

Then the function $ \large \hat \rho_1,\hat \rho_2 ,\hat \rho_3… is called calledr_t$sample autocorrelation function (ACF)

When all the values in the autocorrelation function are 0, we think that the sequence is completely uncorrelated; therefore, we often It is necessary to check whether multiple autocorrelation coefficients are 0.

1. Check whether there is autocorrelation

mixture test

原假设 H 0 : ∃ i ∈ { 1 , . . . , m } , ρ i ≠ 0 H0:\exists i \in \{1,...,m\}, \rho_i \ne 0 H0:i{ 1,...,m},ri=0
备择假设 H 1 : ρ 1 = . . . = ρ m = 0 H1: \rho_1 = ... = \rho_m=0 H1:r1=...=rm=0

Mixture test statistic: Q ( m ) = T ( T + 2 ) ∑ l = 1 m ρ l ^ 2 T − l \large Q(m) = T (T+2) \sum_{l=1}^{m} \frac{\hat{\rho_l}^{2}}{T-l} Q(m)=T(T+2)l=1mTlrl^2
Q ( m ) Q(m) Q(m)Freedom of movement m m m's $ \chi^2 $distribution

Decision rule:
Q ( m ) > χ α 2 , reject H 0 \large Q(m) > \chi_\alpha^{2} , reject H_0 < /span>Q(m)>ha2,拒绝H0
即, Q ( m ) Q(m) Q(m)'s degree of freedom为 m m m's map distribution 100 ( 1 − α ) 100(1-\alpha) 100(1α)At the quantile point, I refuse H 0 H_0 H0

Most software will give Q ( m ) Q(m) The p-value of Q(m), Then when the p-value is less than or equal to the significance level α \alpha αtime rejection H0.

Examples are given below:

from scipy import stats
import statsmodels.api as sm  # 统计相关的库

data = df_lgreturn # 使用收益率序列
m = 10 # 我们检验10个自相关系数

acf,q,p = sm.tsa.acf(data,nlags=m,qstat=True)  ## 计算自相关系数 及p-value
out = np.c_[range(1,11), acf[1:], q, p]
output=pd.DataFrame(out, columns=['lag', "AC", "Q", "P-value"])
output = output.set_index('lag')
output
AC Q P-value
lag
1.0 -0.005176 0.019559 0.888775
2.0 0.011402 0.114591 0.944315
3.0 -0.020999 0.437372 0.932419
4.0 0.038581 1.528482 0.821585
5.0 -0.014422 1.681154 0.891265
6.0 -0.051001 3.593148 0.731538
7.0 0.028434 4.188252 0.757857
8.0 0.036009 5.144016 0.742078
9.0 -0.034035 5.999056 0.740013
10.0 0.017096 6.215102 0.796879
We take the significance level to be 0.05. It can be seen that all p-values ​​are greater than 0.05; then we accept the null hypothesis $H_0$.

Therefore, we believe that this sequence, namely the Kweichow Moutai return rate sequence, hasno significant serial correlation

2. Draw ACF graphics

Calculate the original time series y ( t ) y(t) y(t) ACF (k),Display retention number k k The autocorrelation coefficient of k is:

A C F ( k ) = Corr ( y ( t ) , y ( t − k ) ) ACF(k) = \text{Corr}(y(t), y(t-k)) ACF(k)=Corr(y(t),y(tk))

In that, Corr ( ⋅ ) \text{Corr}(·) Corr() represents the correlation coefficient.

Ranking:

ACF graphs are graphs that describe the autocorrelation structure of time series data. In the ACF chart, the horizontal axis represents the lag order (lag), and the vertical axis represents the correlation coefficient. Typically, an ACF plot takes a value of 1 when the lag order is 0 (because each data point has a correlation coefficient of 1 with itself), and then gradually decreases as the lag order increases. Values ​​after the truncation point are close to zero, and values ​​before the truncation point may oscillate. The first correlation coefficient after the censoring point that exceeds the significance level is usually used to determine the order of the MA model.

from statsmodels.graphics.tsaplots import plot_acf as ACF #自相关图

fig = ACF(data,lags = 20)
plt.show()

Insert image description here

3. Draw PACF graphics

Calculate PACF(k), which represents the difference between the current time point and the lag order after removing the influence of other lag orders k k Partial correlation coefficient between k. The calculation method of PACF(k) is as follows:

P A C F ( 1 ) = A C F ( 1 ) PACF(1) = ACF(1) PACF(1)=ACF(1)

P A C F ( k ) = [ Corr ( y ( t ) , y ( t − k ) ) − ∑ i = 1 k − 1 P A C F ( i ) × Corr ( y ( t − i ) , y ( t − k ) ) ] ,  for  k > 1 PACF(k) = \left[ \text{Corr}(y(t), y(t-k)) - \sum_{i=1}^{k-1} PACF(i) \times \text{Corr}(y(t-i), y(t-k)) \right], \text{ for } k > 1 PACF(k)=[Corr(y(t),y(tk))i=1k1PACF(i)×Corr(y(ti),y(tk))], for k>1

其中, ∑ i = 1 k − 1 P A C F ( i ) × Corr ( y ( t − i ) , y ( t − k ) ) \sum_{i=1}^{k-1} PACF(i) \times \text{Corr}(y(t-i), y(t-k)) i=1k1PACF(i)×Corr(y(ti),y(tk)) Display ownership delinquency i < k i < k i<k P A C F ( i ) PACF(i) PACF(i) A C F ( y ( t − i ) ACF(y(t-i) ACF(y(ti), y ( t − k ) ) y(t-k)) y(tk))'s interest.

Through the above recursive calculation, we can obtain the PACF sequence of time series data, which helps us determine the order of the AR model. PACF is an important tool in time series analysis, which can help us understand the autocorrelation structure of data and select appropriate models for prediction and analysis.

Ranking:

The PACF graph is a graph that describes the partial autocorrelation structure of time series data. In the PACF diagram, the horizontal axis also represents the lag order (lag), and the vertical axis represents the partial correlation coefficient. The values ​​after the truncation point of the PACF plot are also close to zero, and the values ​​before the truncation point usually show exponential decay after the truncation point. The first partial correlation coefficient exceeding the significance level after the censoring point is usually used to determine the order of the AR model.

from statsmodels.graphics.tsaplots import plot_pacf as PACF   #偏自相关图

fig = PACF(data,lags = 20)
plt.show()

Insert image description here

1.5 Stationarity of time series

Stationary time series Roughly speaking, a time series is called stationary if there is no systematic change in the mean (no trend), no systematic change in the variance, and cyclical changes are strictly eliminated .

# AR MA 
dataset=pd.DataFrame()
dataset['close'] = df['close']
dataset['closeDiff_1'] = df['close'].diff(1)  # 1阶差分处理
dataset['closeDiff_2'] = dataset['closeDiff_1'].diff(1)  # 2阶差分处理
dataset.plot(subplots=True,figsize=(18,12))

Insert image description here

1.5.1 Definition

  1. Yan Pingping:

If for all times t t t, any positive integer k k ksum arbitrary k k k个正整数$ \large (t_1,t_2…t_k)$, ( r t 1 , r t 2 , . . . . . . r t k ) \large (r_{t_1},r_{t_2},......r_{t_k}) (rt1,rt2,......rtk)
的联合分布与 ( r t 1 + t , r t 2 + t , . . . . . . r t k + t ) \large (r_{t_1 + t},r_{t_2+t},......r_{t_k + t}) (rt1+t,rt2+t,......rtk+tThe joint distribution of ) is the same, we call the time series { r t r_t rt}is严平稳.

也就是, ( r t 1 , r t 2 , . . . . . . r t k ) \large (r_{t_1},r_{t_2},......r_{t_k}) (rt1,rt2,......rtkThe joint distribution of ) remains unchanged under the time translation transformation, which is a strong conditions of. What we often assume is a weaker form of stationarity

  1. Weakly stationary:

If time series { r t r_t rt}Satisfy the following two conditions: E ( r t ) = μ , μ is a constant \large E(r_t) = \mu, \mu is a constant E(rt)=μ,μ is a constant C o v ( r t , r t − l ) = γ l , γ l only depends on l \large Cov(r_t ,r_{t-l}) = \gamma_l, \gamma_l only depends on l Cov(rt,rtl)=cl,cl only depends on l
then the time series { r t r_t rt} isweakly stationary. That is, the mean of the sequence, r t r_t rt r t − l r_{t-l} rtlThe covariance of does not change with time, l l lany integer.

In financial data, what we usually call stationary series is weakly stationary.

  1. difference

Difference (forward here) is to find the time series { r t r_t rt}在 t t t时刻的值 r t r_t rt t − 1 t-1 t1时刻的值 r t − 1 r_{t-1} rt1The difference of may be recorded as d t d_t dt, then we get a new sequence { d t d_t dt}, is the first difference of, for the new sequence { d t d_t dt}Do the same operation again, it will besecond order difference.

Usually a non-stationary sequence can go through d d d subdifferences are processed into weakly stationary or approximately weakly stationary time series. Looking back at the picture above, we find that the sequence obtained by second-order difference is better than the first-order difference.

1.5.2 Stationarity test

In the above operations, we mainly rely on the naked eye to distinguish the stationarity, which has certain errors. In financial time series, we often use the ADF test to test the stationarity of the time series.

ADF is a commonly used unit root test method. Its null hypothesis is that the sequence has a unit root, that is, it is non-stationary. Stationary time series data need to be significant at a given confidence level to reject the null hypothesis.

data2 = dataset['close'] # 贵州茅台收盘价
data2.plot(figsize=(15,4))

Insert image description here

Looking at the graph, it is obviously non-stationary here. Then we perform the ADF unit root test.

#statsmodels.org
from statsmodels.tsa.stattools import adfuller as ADF  #平稳性检测

temp = np.array(data2)
t = ADF(temp)  # ADF检验

output=pd.DataFrame(index=['Test Statistic Value', "p-value", "Lags Used", "Number of Observations Used","Critical Value(1%)","Critical Value(5%)","Critical Value(10%)"],columns=['value'])
output['value']['Test Statistic Value'] = t[0]
output['value']['p-value'] = t[1]
output['value']['Lags Used'] = t[2]
output['value']['Number of Observations Used'] = t[3]
output['value']['Critical Value(1%)'] = t[4]['1%']
output['value']['Critical Value(5%)'] = t[4]['5%']
output['value']['Critical Value(10%)'] = t[4]['10%']
output
value
Test Statistic Value -2.30762
p-value 0.169522
Lags Used 0
Number of Observations Used 727
Critical Value(1%) -3.439377
Critical Value(5%) -2.865524
Critical Value(10%) -2.568891
t
(-2.3076196321727642,
 0.16952226399336345,
 0,
 727,
 {'1%': -3.439376877165393,
  '5%': -2.865523768488869,
  '10%': -2.5688914082860164},
 7104.773703721578)

It can be seen that the p-value is 0.169522, which is greater than the significance level of 0.05. Null hypothesis: The series has a unit root and is non-stationary. Can't be denied. Therefore, the closing price series is non-stationary.

We differentiate the sequence once and check again!

data2Diff = data2.diff()  # 差分
data2Diff.plot(figsize=(15,4))

Insert image description here

temp = np.array(data2Diff)[1:] # 差分后第一个值为NaN,舍去
t = ADF(temp)  # ADF检验
print("p-value:   ",t[1])
p-value:    6.887197234767805e-24

It can be seen that the p-value is very close to 0, rejecting the null hypothesis, therefore, the series is stationary.

It can be seen that the sequence after one difference is stationary. For the original sequence, the value of d is 1.

1.6 White noise sequence and linear time series

1.6.1 White noise sequence

Amount of change X ( t ) X(t) X(t)(t=1, 2,3...), if it is composed of a sequence of irrelevant random variables, that is, for all S is not equal to T, X t X_t Random variableXtsum X s X_s XsIf the covariance of is zero, it is called a purely random process.

对于一个纯随机过程来说,若其期望和方差均为常数,则称之为白噪声过程。白噪声过程的样本实称成为白噪声序列,简称白噪声。之所以称为白噪声,是因为他和白光的特性类似,白光的光谱在各个频率上有相同的强度,白噪声的谱密度在各个频率上的值相同。

1.6.2 线性时间序列

时间序列{ r t r_t rt},如果能写成:
r t = μ + ∑ i = 0 ∞ ψ i a t − i μ 为 r t 的均值, ψ 0 = 1 , { a t } 为白噪声序列 \large r_t = \mu + \sum_{i=0}^{\infty}\psi_ia_{t-i} \\ \large \mu为r_t 的均值, \psi_0=1,\{a_t\}为白噪声序列 rt=μ+i=0ψiatiμrt的均值,ψ0=1,{ at}为白噪声序列
则我们称{ r t r_t rt} 为线性序列。其中 a t a_t at称为在 t t t时刻的新息(innovation)扰动(shock)

很多时间序列具有线性性,即是线性时间序列,相应的有很多线性时间序列模型,例如接下来要介绍的AR、MA、ARMA,都是线性模型,但并不是所有的金融时间序列都是线性的

对于弱平稳序列,我们利用白噪声的性质很容易得到 r t r_t rt的均值和方差:
E ( r t ) = μ , V a r ( r t ) = σ a 2 ∑ i = 0 ∞ ψ i 2 σ a 2 为 a t 的方差。 \large E(r_t) = \mu , Var(r_t) = \sigma_a^2 \sum_{i=0}^{\infty} \psi_i^{2} \\ \large \sigma_a^2为a_t的方差。 E(rt)=μ,Var(rt)=σa2i=0ψi2σa2at的方差。

因为 V a r ( r t ) Var(r_t) Var(rt)一定小于正无穷,因此$ {\psi_i^2}$必须是收敛序列,因此满足 i → ∞ 时, ψ i 2 → 0 i \to \infty 时, \psi_i^2 \to 0 i时,ψi20

即,随着 i i i的增大,远处的扰动 a t − i a_{t-i} ati r t r_t rt的影响会逐渐消失。

1.6.3 白噪声检验-混成检验

一般常用来进行模型残差白噪声检验

  1. 残差为白噪声,说明模型拟合的很好,残差部分为无法捕捉的纯随机数据。
  2. 残差非白噪声,说明模型哪里出了问题,比如参数没调好,需要继续优化;若如何优化模型也无法使得残差为白噪声,换模型,或者对残差进行二次预测。

混成检验(Ljung-Box检验) 是一种用于检验时间序列数据是否存在自相关性的统计检验方法,是对Box-Pierce检验的改进。它的原理基于自相关函数(ACF)的概念,用于评估时间序列数据在不同滞后阶数下的自相关性。

假设我们有一个时间序列数据: x 1 , x 2 , x 3 , … , x n x_1, x_2, x_3, \ldots, x_n x1,x2,x3,,xn

Ljung-Box检验的步骤如下:

  1. 首先,我们需要拟合一个模型(例如ARIMA模型)来预测时间序列。然后,得到该模型的残差序列: e 1 , e 2 , e 3 , … , e n e_1, e_2, e_3, \ldots, e_n e1,e2,e3,,en。这些残差是模型预测值与实际值之间的差异。

  2. 接下来,我们计算残差序列在不同滞后阶数(即不同延迟)下的自相关系数(ACF)。ACF表示在某个滞后阶数下,序列与其自身之间的相关性。

  3. Ljung-Box检验统计量计算如下:
    Q = n ( n + 2 ) ∑ k = 1 h ρ ^ k 2 n − k Q = n(n+2) \sum_{k=1}^{h} \frac{\hat{\rho}_k^2}{n-k} Q=n(n+2)k=1hnkρ^k2
    其中, h h h是滞后阶数, ρ ^ k \hat{\rho}_k ρ^k是第 k k The sample autocorrelation coefficient of k order lag.

  4. After the calculation is completed, we will get a statistic Q Q Q. This statistic approximately obeys the degree of freedom h h Chi-square distribution of h. Typically, at a significance level (e.g. 0.05) we look for critical values ​​if Q Q Q is greater than the critical value, reject the null hypothesis (the residual sequence has no autocorrelation); if Q Q If Q is less than or equal to the critical value, the null hypothesis cannot be rejected and the residual sequence is considered to be a white noise sequence (no autocorrelation).

The principle of the Ljung-Box test is to use the autocorrelation coefficient of the residual sequence to perform statistical testing to determine whether the model can better capture the autocorrelation of the data. If the test result rejects the null hypothesis, it means that there is autocorrelation in the model prediction, and the model selection needs to be reconsidered or improved; if the null hypothesis is accepted, it means that the residual sequence of the model is basically a white noise sequence, and the model can be considered to be compared suitable.

There are two types of code, one is the method in the first point in 1.4.2, and the other is to directly call Ljung-BOX to check

from statsmodels.stats.diagnostic import acorr_ljungbox as LjungBox #White noise test

from statsmodels.stats.diagnostic import acorr_ljungbox as LjungBox

Case 1 :

  1. Organize and calculate self-selected financial asset data: (1) simple rate of return; (2) logarithmic rate of return.
  2. Calculate statistical indicators and conduct descriptive statistical analysis accordingly;
  3. Plot the distribution of the data.
  4. Test whether the distribution follows a normal distribution.

Follow gzh "finance melatonin" to obtain case codes and more financial big data and other related learning materials.

Guess you like

Origin blog.csdn.net/celiaweiwei/article/details/133809965