In recent years, Bayesian statistics has become the cornerstone of empirical finance. This chapter cannot cover all concepts in this field. Therefore, when necessary, you should refer to the textbook of Geweke (2005) for general introductory knowledge, and interest drivers should refer to the textbook of Rachev (2008).
13.3.1 Bayesian formula
The common interpretation of Bayesian formula in finance is diachronic interpretation. This mainly means that over time, we will learn new information about the variables or parameters of interest, such as the average rate of return of the time series. Equation 13-5 formally describes this theory.
Formula 13-5 Bayesian formula
In the formula, H represents a certain event (hypothesis), and D represents data that may be provided by experiments or the real world [4]. On the basis of these definitions, we get:
- p( H ) is called the prior probability;
- p( D ) is the probability of the data under any hypothesis, called the normalization constant;
- p( D | H ) is the likelihood (ie probability) of the data under hypothesis H ;
- p( H | D ) is the posterior probability, that is, the probability obtained after we see the data.
Consider a simple example. There are two boxes B 1 and B 2, B 1 20 70 black balls and red balls, and B 2 in black balls 40 and 50 red balls. Randomly take a ball from two boxes, assuming that the ball is black. What are the probabilities of the two hypotheses " H 1: The ball comes from B 1" and " H 2: The ball comes from B 2"?
Before randomly taking the ball, the two hypothetical possibilities are the same. But when the ball is obviously black, we must update the probability of the two hypotheses according to the Bayesian formula, considering hypothesis H 1.
- Prior probability: p ( H 1) = 1/2
- Standardization constant: p ( D ) =
- Likelihood: p ( D | H 1) =
The updated probability of H 1 p( H 1| D ) =
This result also has intuitive meaning. The probability of taking a black ball from B 2 is twice the probability of the same thing happening on B 1. Thus, a black ball was taken out, assuming H updated two probability P ( H 2 | D ) = 2 /. 3, is assumed that H 2-fold after 1 updated probabilities.
13.3.2 Bayesian regression
The Python ecosystem uses PyMC3 to provide a comprehensive software package that technically implements Bayesian statistics and probability planning.
Consider the following example based on noise data around a straight line. [5] First, implement a linear ordinary least squares regression on the data set (see Chapter 11), and the results are shown in Figure 13-15:
Figure 13-15 Sample data points and regression line
In [1]: import numpy as np import pandas as pd import datetime as dt from pylab import mpl, plt In [2]: plt.style.use('seaborn') mpl.rcParams['font.family'] = 'serif' np.random.seed(1000) %matplotlib inline In [3]: x = np.linspace(0, 10, 500) y = 4 + 2 * x + np.random.standard_normal(len(x)) * 2 In [4]: reg = np.polyfit(x, y, 1) In [5]: reg Out[5]: array([2.03384161, 3.77649234]) In [6]: plt.figure(figsize=(10, 6)) plt.scatter(x, y, c=y, marker='v', cmap='coolwarm') plt.plot(x, reg[1] + reg[0] * x, lw=2.0) plt.colorbar() plt.xlabel ('x') plt.ylabel ('y')
The result of the OLS regression method is a fixed value of two parameters (intercept and slope) on the regression line. Note that the highest order monomial factor (in this case, the slope of the regression line) is at exponential level 0, and the intercept is at exponential level 1. The original parameters 2 and 4 could not be completely restored, which is caused by the noise contained in the data.
Bayesian regression uses the PyMC3 software package. It is assumed that the parameters follow a certain distribution. For example, consider the regression line equation
. Assume that there are the following prior probabilities:
- α is normally distributed, with a mean of 0 and a standard deviation of 20;
- β is normally distributed, with a mean of 0 and a standard deviation of 10.
As for the likelihood, assuming a mean value
The normal distribution and a uniform distribution with a standard deviation between 0 and 10.
One element of Bayesian regression is (Markov chain) Monte Carlo (MCMC) sampling [6]. In principle, this is the same as removing the ball from the box multiple times in the previous example-but in a more systematic and automated way.
There are 3 different functions that can be called for technical sampling:
- find_MAP() finds the starting point of the sampling algorithm by obtaining the local maximum posterior point;
- NUTS() implements the so-called "Efficient Double Average No Rotation Sampler" (NUTS) algorithm for MCMC sampling assuming prior probability;
- sample() extracts a certain number of samples with a given starting value (from find_MAP()) and optimal step size (from NUTS algorithm).
The above functions are packaged in a PyMC3 Model object and executed in the with statement:
In [8]: import pymc3 as pm In [9]: %%time with pm.Model() as model: # model alpha = pm.Normal('alpha', mu=0, sd=20) ❶ beta = pm.Normal('beta', mu=0, sd=10) ❶ sigma = pm.Uniform('sigma', lower=0, upper=10) ❶ y_est = alpha + beta * x ❷ likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y) ❸ # inference start = pm.find_MAP() ❹ step = pm.NUTS() ❺ trace = pm.sample(100, tune=1000, start=start, progressbar=True, verbose=False) ❻ logp = -1,067.8, ||grad|| = 60.354: 100%| | 28/28 [00:00<00:00, 474.70it/s] Only 100 samples in chain. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sigma, beta, alpha] Sampling 2 chains: 100%| | 2200/2200 [00:03<00:00, 690.96draws/s] CPU times: user 6.2 s, sys: 1.72 s, total: 7.92 s Wall time: 1min 28s In [10]: pm.summary(trace) ❼ Out[10]: mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat alpha 3.764027 0.174796 0.013177 3.431739 4.070091 152.446951 0.996281 beta 2.036318 0.030519 0.002230 1.986874 2.094008 106.505590 0.999155 sigma 2.010398 0.058663 0.004517 1.904395 2.138187 188.643293 0.998547 In [11]: trace[0] ❽ Out[11]: {'alpha': 3.9303300798212444, 'beta': 2.0020264758995463, 'sigma_interval__': -1.3519315719461853, 'sigma': 2.0555476283253156}
❶ Define the prior probability.
❷ Specify linear regression.
❸ Define likelihood.
❹ Find the starting value through optimization.
❺ Instantiate the MCMC algorithm.
❻ Use NUTS to obtain posterior samples.
❼ Show the statistical summary of the sampling.
❽ Estimate from the first sample.
The three estimated values are quite close to the original value (4, 2, 2). However, the whole process will get many estimates. It is best to describe them with a trajectory diagram (shown in Figure 13-20). The trajectory graph shows the posterior distribution of different parameters and the individual estimated value of each sample. The posterior distribution helps us intuitively understand the uncertainty in the estimated value:
In [12]: pm.traceplot(trace, lines={'alpha': 4, 'beta': 2, 'sigma': 2});
Figure 13-16 Posterior distribution and trajectory diagram
All results regression lines can be drawn only by obtaining the alpha and beta values from the regression (see Figure 13-17):
In [13]: plt.figure(figsize=(10, 6)) plt.scatter(x, y, c=y, marker='v', cmap='coolwarm') plt.colorbar() plt.xlabel('x') plt.ylabel('y') for i in range(len(trace)): plt.plot(x, trace['alpha'][i] + trace['beta'][i] * x) ❶
❶ Draw a single regression line.
Figure 13-17 Regression line based on different estimated values
13.3.3 Two financial instruments
After introducing PyMC3 Bayesian regression with virtual data, it is very simple to turn to real financial data. The example uses the financial time series data of two trading open-end funds (GLD and GDX) (see Figure 13-18):
In [14]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True) In [15]: data = raw[['GDX', 'GLD']].dropna() In [16]: data = data / data.iloc[0] ❶ In [17]: data.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2138 entries, 2010-01-04 to 2018-06-29 Data columns (total 2 columns): GDX 2138 non-null float64 GLD 2138 non-null float64 dtypes: float64(2) memory usage: 50.1 KB In [18]: data.ix[-1] / data.ix[0] - 1 ❷ Out[18]: GDX -0.532383 GLD 0.080601 dtype: float64 In [19]: data.corr() ❸ Out[19]: GDX GLD GDX 1.00000 0.71539 GLD 0.71539 1.00000 In [20]: data.plot(figsize=(10, 6));
❶ Normalize the data to start value 1.
❷ Calculate relative performance.
❸ Calculate the correlation between two financial instruments.
Figure 13-18 Standardized prices of GLD and GDX after a period of time
In the example below, the date of a single data point is visualized with a scatter plot. To do this, the DatetimeIndex object in the DataFrame is converted into a matplotlib date. Figure 13-19 shows a scatter plot of time series data, and compares the GLD value with the GDX value, and then displays the date of each pair of data in different colors: [7]
In [21]: data.index[:3] Out[21]: DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06'], dtype='datetime64[ns]', name='Date', freq=None) In [22]: mpl_dates = mpl.dates.date2num(data.index.to_pydatetime()) ❶ mpl_dates[:3] Out[22]: array([733776., 733777., 733778.]) In [23]: plt.figure(figsize=(10, 6)) plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, marker='o', cmap='coolwarm') plt.xlabel('GDX') plt.ylabel('GLD') plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), format=mpl.dates.DateFormatter('%d %b %y')); ❷
❶ Convert DatetimeIndex objects to matplotlib dates.
❷ Color chart with custom date.
Figure 13-19 Scatter diagram of GLD price and GDX price
Next, Bayesian regression is implemented on the basis of these two time series. The parameterization is essentially the same as the previous virtual data example. Figure 13-20 shows the results of the MCMC sampling process under the assumption of the prior probability distribution of 3 parameters:
In [24]: with pm.Model() as model: alpha = pm.Normal('alpha', mu=0, sd=20) beta = pm.Normal('beta', mu=0, sd=20) sigma = pm.Uniform('sigma', lower=0, upper=50) y_est = alpha + beta * data['GDX'].values likelihood = pm.Normal('GLD', mu=y_est, sd=sigma, observed=data['GLD'].values) start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(250, tune=2000, start=start, progressbar=True) logp = 1,493.7, ||grad|| = 188.29: 100%| | 27/27 [00:00<00:00, 1609.34it/s] Only 250 samples in chain. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sigma, beta, alpha] Sampling 2 chains: 100%| | 4500/4500 [00:09<00:00, 465.07draws/s] The estimated number of effective samples is smaller than 200 for some parameters. In [25]: pm.summary(trace) Out[25]: mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat alpha 0.913335 0.005983 0.000356 0.901586 0.924714 184.264900 1.001855 beta 0.385394 0.007746 0.000461 0.369154 0.398291 215.477738 1.001570 sigma 0.119484 0.001964 0.000098 0.115305 0.123315 312.260213 1.005246 In [26]: fig = pm.traceplot(trace)
Figure 13-20 Posterior distribution and trajectory diagram of GDX and GLD data
Figure 13-21 adds all the regression lines obtained to the previous scatter plot. However, all regression lines are very close to each other:
In [27]: plt.figure(figsize=(10, 6)) plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, marker='o', cmap='coolwarm') plt.xlabel('GDX') plt.ylabel('GLD') for i in range(len(trace)): plt.plot(data['GDX'], trace['alpha'][i] + trace['beta'][i] * data['GDX']) plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), format=mpl.dates.DateFormatter('%d %b %y'));
Figure 13-21 Multiple Bayesian regression lines passing through GDX and GLD data
Figure 13-21 reveals a major drawback of the regression method used: the method does not take into account changes that occur over time. In other words, the most recent data and the oldest data are treated the same.
13.3.4 Update estimates at any time
As pointed out earlier, Bayesian methods in finance are usually most useful in historical analysis—that is, new data revealed over time can lead to better regressions and estimates.
In order to add the above concepts to the current example, it is assumed that the regression parameters are not only random and in a certain distribution, but also randomly "walk" over time. This is the same way that financial theory generalizes from random variables to random processes (essentially an ordered sequence of random variables).
To this end, we define a new PyMC3 model, this time specifying parameter values to walk randomly. After specifying the distribution of the random walk parameters, we can continue to specify the alpha and beta values of the random walk. To make the whole process more efficient, use the same coefficients for 50 data points:
In [28]: from pymc3.distributions.timeseries import GaussianRandomWalk In [29]: subsample_alpha = 50 subsample_beta = 50 In [30]: model_randomwalk = pm.Model() with model_randomwalk: sigma_alpha = pm.Exponential('sig_alpha', 1. / .02, testval=.1) ❶ sigma_beta = pm.Exponential('sig_beta', 1. / .02, testval=.1) ❶ alpha = GaussianRandomWalk('alpha', sigma_alpha ** -2, shape=int(len(data) / subsample_alpha)) ❷ beta = GaussianRandomWalk('beta', sigma_beta ** -2, shape=int(len(data) / subsample_beta)) ❷ alpha_r = np.repeat(alpha, subsample_alpha) ❸ beta_r = np.repeat(beta, subsample_beta) ❸ regression = alpha_r + beta_r * data['GDX'].values[:2100] ❹ sd = pm.Uniform('sd', 0, 20) ❺ likelihood = pm.Normal('GLD', mu=regression, sd=sd, observed=data['GLD'].values[:2100]) ❻
❶ Define the prior probability for the random walk parameters.
❷ Random walk model.
❸ Substitute the parameter vector into the time interval length.
❹ Define the regression model.
❺ The prior value of standard deviation.
❻ Use mu in the regression result to define the likelihood.
Due to the use of random walks instead of single random variables, these definitions are more complicated than before. However, the derivation steps of MCMC remain essentially unchanged. It should be noted that, since we have to calculate a parameter pair for each random walk sample, a total of 1950/50=39 (previously only one was needed), the computational burden is significantly increased:
In [31]: %%time import scipy.optimize as sco with model_randomwalk: start = pm.find_MAP(vars=[alpha, beta], fmin=sco.fmin_l_bfgs_b) step = pm.NUTS(scaling=start) trace_rw = pm.sample(250, tune=1000, start=start, progressbar=True) logp = -6,657: 2%| | 82/5000 [00:00<00:08, 550.29it/s] Only 250 samples in chain. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sd, beta, alpha, sig_beta, sig_alpha] Sampling 2 chains: 100%| | 2500/2500 [02:48<00:00, 8.59draws/s] CPU times: user 27.5 s, sys: 3.68 s, total: 31.2 s Wall time: 5min 3s In [32]: pm.summary(trace_rw).head() ❶ Out[32]: mean sd mc_error hpd_2.5 hpd_97.5 n_eff \ alpha__0 0.673846 0.040224 0.001376 0.592655 0.753034 1004.616544 alpha__1 0.424819 0.041257 0.001618 0.348102 0.509757 804.760648 alpha__2 0.456817 0.057200 0.002011 0.321125 0.553173 800.225916 alpha__3 0.268148 0.044879 0.001725 0.182744 0.352197 724.967532 alpha__4 0.651465 0.057472 0.002197 0.544076 0.761216 978.073246 Rhat alpha__0 0.998637 alpha__1 0.999540 alpha__2 0.998075 alpha__3 0.998995 alpha__4 0.998060
❶ Summary statistics for each time interval (only the first 5 and alpha are displayed).
Figure 13-22 shows a subset of the estimated values, illustrating the evolution of the regression parameters alpha and beta over time:
In [33]: sh = np.shape(trace_rw['alpha']) ❶ sh ❶ Out[33]: (500, 42) In [34]: part_dates = np.linspace(min(mpl_dates), max(mpl_dates), sh[1]) ❷ In [35]: index = [dt.datetime.fromordinal(int(date)) for date in part_dates] ❷ In [36]: alpha = {'alpha_%i' % i: v for i, v in enumerate(trace_rw['alpha']) if i < 20} ❸ In [37]: beta = {'beta_%i' % i: v for i, v in enumerate(trace_rw['beta']) if i < 20} ❸ In [38]: df_alpha = pd.DataFrame(alpha, index=index) ❸ In [39]: df_beta = pd.DataFrame(beta, index=index) ❸ In [40]: ax = df_alpha.plot(color='b', style='-.', legend=False, lw=0.7, figsize=(10, 6)) df_beta.plot(color='r', style='-.', legend=False, lw=0.7, ax=ax) plt.ylabel('alpha/beta');
❶ The composition of objects including parameter estimates.
❷ Create a list of dates to match the number of time intervals.
❸ Collect the relevant parameter time series in two DataFrame objects.
Figure 13-22 Estimated values of selection parameters over a period of time
Absolute price data and relative yield data
The analysis in this section is based on normalized price data. This is for illustrative purposes only, because the corresponding graphical results are easier to understand and interpret (they are visually "more eye-catching"). However, financial applications in the real world should rely on yield data to ensure the stability of time series data.
Using the mean of alpha and beta, we can account for the renewal of the regression over a period of time. Figure 13-23 shows how the regression is updated over time. In addition, it also shows 39 regression lines derived from both alpha and beta. Obviously, the number of updates over time greatly improves the regression fit (for current/latest data)-in other words, each time period requires its own regression:
In [41]: plt.figure(figsize=(10, 6)) plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, marker='o', cmap='coolwarm') plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), format=mpl.dates.DateFormatter('%d %b %y')) plt.xlabel('GDX') plt.ylabel('GLD') x = np.linspace(min(data['GDX']), max(data['GDX'])) for i in range(sh[1]): ❶ alpha_rw = np.mean(trace_rw['alpha'].T[i]) beta_rw = np.mean(trace_rw['beta'].T[i]) plt.plot(x, alpha_rw + beta_rw * x, '--', lw=0.7, color=plt.cm.coolwarm(i / sh[1]))
❶ Draw regression lines for all time intervals of length 50.
Figure 13-23 Scatter plot with time-dependent regression line (updated estimate)
This is the end of the introduction to Bayesian regression. Python provides a powerful PyMC3 library to implement different methods in Bayesian statistics and probabilistic programming, especially Bayesian regression, which has become equivalent in econometric finance. Popular and important tool.
This article is excerpted from "Python Financial Big Data Analysis 2nd Edition"
"Python Financial Big Data Analysis 2nd Edition" is divided into 5 parts, a total of 21 chapters. Part 1 introduces the application of Python in finance, and its content covers the reasons why Python is used in the financial industry, the basic architecture and tools of Python, and some specific examples of Python in econometric finance; Part 2 introduces Basic knowledge of Python and the well-known library NumPy and pandas toolset in Python. Object-oriented programming is also introduced; Part 3 introduces the basic techniques and methods of financial data science, including data visualization, input/output operations, and mathematical Financial-related knowledge, etc.; Part 4 introduces the application of Python in algorithmic trading, focusing on common algorithms, including machine learning, deep neural networks and other artificial intelligence related algorithms; Part 5 explains the development of options and derivatives based on Monte Carlo simulation The application of pricing includes the introduction of the valuation framework, the simulation of financial models, the valuation of derivatives, and the valuation of investment portfolios.
"Python Financial Big Data Analysis 2nd Edition" This book is suitable for financial industry developers who are interested in using Python for big data analysis and processing.