Three-minute introduction to quantification (5): Inferential statistics of yield

hello, i'm edamame. This series uses the most streamlined codes and cases to take you quickly to get started with quantification, and only talks about the most dry goods. Friends who want to learn quantification but don’t know how to get started, hurry up and read it!

Previous review:

Three-minute introduction to quantification (1): Acquisition of market data & drawing candlestick charts

Three-minute introduction to quantification (2): Introduction to Tushare Pro data interface

Three-minute introduction to quantification (3): Calculate the rate of return

Three-minute introduction to quantification (4): Statistical analysis of market data

This issue will introduce how to use python to do inferential statistics. Inferential statistics refers to the statistical method of inferring the overall statistical characteristics in the case of limited sample data. Inferential statistics include two categories, parameter estimation and hypothesis testing, which are described in detail below. Maodou will update this series every weekend. It is recommended that you collect it for easy learning.

1. Draw a yield histogram

Let's take the return rate of the Shanghai Stock Exchange Index as an example, calculate the sample mean and standard deviation, and draw a histogram.

First import the relevant packages and authenticate the Tushare Pro account.

import tushare as ts
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
​
pro = ts.pro_api('your token')
 
 

Get the Shanghai Composite Index data since this year:

df1=ts.get_k_data('sh',start='2023-01-01',end='2023-05-26')#上证指数
df1.tail()

returns as follows:

Use the method learned before to calculate the daily rate of return of the Shanghai Composite Index:

df1['lagclose']=df1.close.shift(1)
df1['SHRet']=(df1['close']-df1['lagclose'])/df1['lagclose']
df1.tail()

returns as follows:

Calculate the return mean and standard deviation:

SHRet=df1.SHRet.dropna()
mu=SHRet.mean()#均值
sigma=SHRet.std()#标准差
print('均值:{}'.format(mu))
print('标准差:{}'.format(sigma))

It prints as follows:

Use one line of code to draw the return histogram of the Shanghai Stock Exchange Index:

SHRet.hist()

returns as follows:

It is also possible to add a normal distribution curve to the histogram:

plt.hist(SHRet,density=True)
plt.plot(np.arange(-0.04,0.04,0.002),stats.norm.pdf(np.arange(-0.04,0.04,0.002),mu,sigma))

returns as follows:

2. Parameter estimation

There are two forms of parameter estimation: point estimation and interval estimation.

Point estimation, also known as fixed value estimation, refers to a parameter estimation method that directly uses sample estimators to estimate population parameters without considering the estimation error in parameter estimation. For example, the sample mean is directly used to estimate the population mean, and the sample variance is used to estimate the population variance.

Interval estimation takes into account the existence of estimation errors. It is based on point estimation to estimate the interval range of the overall parameter. The interval range is guaranteed by a certain probability. The lower limit/upper limit of the interval is subtracted from the sample statistics/ plus the estimation error. Among them, the interval range of the overall parameter to be inferred is the confidence interval, and the degree of reliability of the estimate is the confidence degree.

When the sample size is constant, if the confidence level is increased, the width of the confidence interval will increase, which will reduce the accuracy of the estimate; if the accuracy is to be improved, the confidence level will inevitably decrease. Therefore, the requirements for confidence and precision are often contradictory. If you want to meet the quantitative requirements at the same time, you must increase the sample size.

1. Python function for interval estimation

If the normal distribution or approximate normal distribution is satisfied, then we use the interval() function of the norm class of the stats module. If it is a t distribution, use the interval() function of the t class of the stats module.

stats.norm.interval(alpha, loc, scale)
stats.t.interval(alpha, df, loc, scale)

Where alpha is the degree of confidence, df is the degree of freedom, for the interval estimation of the t distribution, the degree of freedom is n-1, loc is the sample mean, and scale is the standard error, calculated by stats.sem().

2. Estimation of the return range of the Shanghai Composite Index

Interval estimation of the rate of return of the Shanghai Composite Index. When the confidence level is 95%, the t-distribution with a degree of freedom of 93, and the mean confidence interval:

stats.t.interval(0.95,len(SHRet)-1,mu,stats.sem(SHRet))

returns as follows:

That is, there is a 95% probability that the population mean is within this interval and a 5% probability that it is not.

3. Hypothesis testing

The task of parameter estimation is to guess the value of the parameter, while the purpose of hypothesis testing is to test whether the value of the parameter is equal to a certain target value.

The core idea of ​​hypothesis testing is based on the fact that small probability events (such as p<0.01 or p<0.05) will basically not happen in one experiment. If under our assumption, a small probability event occurs, then the assumption can be considered invalid.

The statistic used by the t-test obeys the t-distribution, or the standard deviation is unknown, and the mean of the population obeys the normal distribution. Common t-tests include one-sample test, independent two-sample test, and correlated paired test.

1. One-sample test

For example, we use the return data of the Shanghai Composite Index this year to conduct a t-test on whether the mean return rate is 0. Null hypothesis H0: the average return rate of the Shanghai Composite Index is 0, alternative hypothesis: the average return rate of the Shanghai Composite Index is not 0.

#只需输入我们要检验的变量以及要比较的数值即可
stats.ttest_1samp(SHRet,0)

returns as follows:

p>0.05 means that we cannot reject the null hypothesis at the 5% significance level.

2. Independent two-sample test

For example, check whether the mean returns of the Shanghai Composite Index and the Shenzhen Component Index are equal.

First calculate the rate of return of the Shenzhen Component Index:

df2=ts.get_k_data('sz',start='2023-01-01',end='2023-05-26')#深圳成指
df2['lagclose']=df2.close.shift(1)
df2['SHRet']=(df2['close']-df2['lagclose'])/df2['lagclose']
SZRet=df2.SHRet.dropna()

Test whether the average returns of the Shanghai Composite Index and the Shenzhen Component Index are equal:

#输入两个变量即可
stats.ttest_ind(SHRet,SZRet)

3. Paired sample t-test

#输入两个变量即可
stats.ttest_rel(SHRet,SZRet)

The above is all the content of today’s dry goods. Maodou will update this series every weekend, and will continue to share with you the real situation of the Whirlwind Charge quantitative strategy every trading day. Welcome everyone to like and follow.

Backtest: Whirlwind Charge Strategy Description

Firm Offer: April Strategic Data Publicity & Frequently Asked Questions

Guess you like

Origin blog.csdn.net/weixin_37475278/article/details/130914532