Financial Data Science: Bayesian Statistics

In recent years, Bayesian statistics has become the cornerstone of empirical finance. This chapter cannot cover all concepts in this field. Therefore, when necessary, you should refer to the textbook of Geweke (2005) for general introductory knowledge, and interest drivers should refer to the textbook of Rachev (2008).

13.3.1 Bayesian formula

The common interpretation of Bayesian formula in finance is diachronic interpretation. This mainly means that over time, we will learn new information about the variables or parameters of interest, such as the average rate of return of the time series. Equation 13-5 formally describes this theory.

Formula 13-5 Bayesian formula

In the formula, H represents a certain event (hypothesis), and D represents data that may be provided by experiments or the real world [4]. On the basis of these definitions, we get:

  • p( H ) is called the prior probability;
  • p( D ) is the probability of the data under any hypothesis, called the normalization constant;
  • p( D | H ) is the likelihood (ie probability) of the data under hypothesis H ;
  • p( H | D ) is the posterior probability, that is, the probability obtained after we see the data.

Consider a simple example. There are two boxes B 1 and B 2, B 1 20 70 black balls and red balls, and B 2 in black balls 40 and 50 red balls. Randomly take a ball from two boxes, assuming that the ball is black. What are the probabilities of the two hypotheses " H 1: The ball comes from B 1" and " H 2: The ball comes from B 2"?

Before randomly taking the ball, the two hypothetical possibilities are the same. But when the ball is obviously black, we must update the probability of the two hypotheses according to the Bayesian formula, considering hypothesis H 1.

  • Prior probability: p ( H 1) = 1/2
  • Standardization constant: p ( D ) =
  • Likelihood: p ( D | H 1) =

The updated probability of H 1 p( H 1| D ) =

This result also has intuitive meaning. The probability of taking a black ball from B 2 is twice the probability of the same thing happening on B 1. Thus, a black ball was taken out, assuming H updated two probability P ( H 2 | D ) = 2 /. 3, is assumed that H 2-fold after 1 updated probabilities.

13.3.2 Bayesian regression

The Python ecosystem uses PyMC3 to provide a comprehensive software package that technically implements Bayesian statistics and probability planning.

Consider the following example based on noise data around a straight line. [5] First, implement a linear ordinary least squares regression on the data set (see Chapter 11), and the results are shown in Figure 13-15:

Figure 13-15 Sample data points and regression line

In [1]: import numpy as np
        import pandas as pd
        import datetime as dt
        from pylab import mpl, plt

In [2]: plt.style.use('seaborn')
        mpl.rcParams['font.family'] = 'serif'
        np.random.seed(1000)
        %matplotlib inline

In [3]: x = np.linspace(0, 10, 500)
        y = 4 + 2 * x + np.random.standard_normal(len(x)) * 2

In [4]: reg = np.polyfit(x, y, 1)

In [5]: reg
Out[5]: array([2.03384161, 3.77649234])

In [6]: plt.figure(figsize=(10, 6))
        plt.scatter(x, y, c=y, marker='v', cmap='coolwarm')
        plt.plot(x, reg[1] + reg[0] * x, lw=2.0)
        plt.colorbar()
        plt.xlabel ('x')
        plt.ylabel ('y')

The result of the OLS regression method is a fixed value of two parameters (intercept and slope) on the regression line. Note that the highest order monomial factor (in this case, the slope of the regression line) is at exponential level 0, and the intercept is at exponential level 1. The original parameters 2 and 4 could not be completely restored, which is caused by the noise contained in the data.

Bayesian regression uses the PyMC3 software package. It is assumed that the parameters follow a certain distribution. For example, consider the regression line equation

. Assume that there are the following prior probabilities:

  • α is normally distributed, with a mean of 0 and a standard deviation of 20;
  • β is normally distributed, with a mean of 0 and a standard deviation of 10.

As for the likelihood, assuming a mean value

The normal distribution and a uniform distribution with a standard deviation between 0 and 10.

One element of Bayesian regression is (Markov chain) Monte Carlo (MCMC) sampling [6]. In principle, this is the same as removing the ball from the box multiple times in the previous example-but in a more systematic and automated way.

There are 3 different functions that can be called for technical sampling:

  • find_MAP() finds the starting point of the sampling algorithm by obtaining the local maximum posterior point;
  • NUTS() implements the so-called "Efficient Double Average No Rotation Sampler" (NUTS) algorithm for MCMC sampling assuming prior probability;
  • sample() extracts a certain number of samples with a given starting value (from find_MAP()) and optimal step size (from NUTS algorithm).

The above functions are packaged in a PyMC3 Model object and executed in the with statement:

In [8]: import pymc3 as pm

In [9]: %%time
        with pm.Model() as model:
            # model
            alpha = pm.Normal('alpha', mu=0, sd=20) ❶
            beta = pm.Normal('beta', mu=0, sd=10) ❶
            sigma = pm.Uniform('sigma', lower=0, upper=10) ❶
            y_est = alpha + beta * x ❷
            likelihood = pm.Normal('y', mu=y_est, sd=sigma,
                                    observed=y) ❸

            # inference
            start = pm.find_MAP() ❹
            step = pm.NUTS() ❺
            trace = pm.sample(100, tune=1000, start=start,
                              progressbar=True, verbose=False) ❻
        logp = -1,067.8, ||grad|| = 60.354: 100%|     | 28/28 [00:00<00:00,
         474.70it/s]
        Only 100 samples in chain.
        Auto-assigning NUTS sampler...
        Initializing NUTS using jitter+adapt_diag...
        Multiprocess sampling (2 chains in 2 jobs)
        NUTS: [sigma, beta, alpha]
        Sampling 2 chains: 100%|     | 2200/2200 [00:03<00:00,
         690.96draws/s]

        CPU times: user 6.2 s, sys: 1.72 s, total: 7.92 s
        Wall time: 1min 28s

In [10]: pm.summary(trace) ❼
Out[10]:
             mean      sd mc_error hpd_2.5 hpd_97.5     n_eff     Rhat
    alpha 3.764027 0.174796 0.013177 3.431739 4.070091 152.446951 0.996281
    beta  2.036318 0.030519 0.002230 1.986874 2.094008 106.505590 0.999155
    sigma 2.010398 0.058663 0.004517 1.904395 2.138187 188.643293 0.998547

In [11]: trace[0] ❽
Out[11]: {'alpha': 3.9303300798212444,
          'beta': 2.0020264758995463,
          'sigma_interval__': -1.3519315719461853,
          'sigma': 2.0555476283253156}

❶ Define the prior probability.

❷ Specify linear regression.

❸ Define likelihood.

❹ Find the starting value through optimization.

❺ Instantiate the MCMC algorithm.

❻ Use NUTS to obtain posterior samples.

❼ Show the statistical summary of the sampling.

❽ Estimate from the first sample.

The three estimated values ​​are quite close to the original value (4, 2, 2). However, the whole process will get many estimates. It is best to describe them with a trajectory diagram (shown in Figure 13-20). The trajectory graph shows the posterior distribution of different parameters and the individual estimated value of each sample. The posterior distribution helps us intuitively understand the uncertainty in the estimated value:

In [12]: pm.traceplot(trace, lines={'alpha': 4, 'beta': 2, 'sigma': 2});

Figure 13-16 Posterior distribution and trajectory diagram

All results regression lines can be drawn only by obtaining the alpha and beta values ​​from the regression (see Figure 13-17):

In [13]: plt.figure(figsize=(10, 6))
         plt.scatter(x, y, c=y, marker='v', cmap='coolwarm')
         plt.colorbar()
         plt.xlabel('x')
         plt.ylabel('y')
         for i in range(len(trace)):
             plt.plot(x, trace['alpha'][i] + trace['beta'][i] * x) ❶

❶ Draw a single regression line.

Figure 13-17 Regression line based on different estimated values

13.3.3 Two financial instruments

After introducing PyMC3 Bayesian regression with virtual data, it is very simple to turn to real financial data. The example uses the financial time series data of two trading open-end funds (GLD and GDX) (see Figure 13-18):

In [14]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv',
                           index_col=0, parse_dates=True)

In [15]: data = raw[['GDX', 'GLD']].dropna()

In [16]: data = data / data.iloc[0] ❶

In [17]: data.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 2138 entries, 2010-01-04 to 2018-06-29
         Data columns (total 2 columns):
         GDX    2138 non-null float64
         GLD    2138 non-null float64
         dtypes: float64(2)
         memory usage: 50.1 KB

In [18]: data.ix[-1] / data.ix[0] - 1 ❷
Out[18]: GDX   -0.532383
         GLD    0.080601
         dtype: float64

In [19]: data.corr() ❸
Out[19]:          GDX      GLD
         GDX  1.00000  0.71539
         GLD  0.71539  1.00000

In [20]: data.plot(figsize=(10, 6));

❶ Normalize the data to start value 1.

❷ Calculate relative performance.

❸ Calculate the correlation between two financial instruments.

Figure 13-18 Standardized prices of GLD and GDX after a period of time

In the example below, the date of a single data point is visualized with a scatter plot. To do this, the DatetimeIndex object in the DataFrame is converted into a matplotlib date. Figure 13-19 shows a scatter plot of time series data, and compares the GLD value with the GDX value, and then displays the date of each pair of data in different colors: [7]

In [21]: data.index[:3]
Out[21]: DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06'],
          dtype='datetime64[ns]', name='Date', freq=None)
In [22]: mpl_dates = mpl.dates.date2num(data.index.to_pydatetime()) ❶
         mpl_dates[:3]
Out[22]: array([733776., 733777., 733778.])

In [23]: plt.figure(figsize=(10, 6))
         plt.scatter(data['GDX'], data['GLD'], c=mpl_dates,
                     marker='o', cmap='coolwarm')
         plt.xlabel('GDX')
         plt.ylabel('GLD')
         plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
                      format=mpl.dates.DateFormatter('%d %b %y')); ❷

❶ Convert DatetimeIndex objects to matplotlib dates.

❷ Color chart with custom date.

Figure 13-19 Scatter diagram of GLD price and GDX price

Next, Bayesian regression is implemented on the basis of these two time series. The parameterization is essentially the same as the previous virtual data example. Figure 13-20 shows the results of the MCMC sampling process under the assumption of the prior probability distribution of 3 parameters:

In [24]: with pm.Model() as model:
             alpha = pm.Normal('alpha', mu=0, sd=20)
             beta = pm.Normal('beta', mu=0, sd=20)
             sigma = pm.Uniform('sigma', lower=0, upper=50)
             y_est = alpha + beta * data['GDX'].values

             likelihood = pm.Normal('GLD', mu=y_est, sd=sigma,
                                    observed=data['GLD'].values)

             start = pm.find_MAP()
             step = pm.NUTS()
             trace = pm.sample(250, tune=2000, start=start,
                               progressbar=True)
        logp = 1,493.7, ||grad|| = 188.29: 100%|     | 27/27 [00:00<00:00,
         1609.34it/s]
        Only 250 samples in chain.
        Auto-assigning NUTS sampler...
        Initializing NUTS using jitter+adapt_diag...
        Multiprocess sampling (2 chains in 2 jobs)
        NUTS: [sigma, beta, alpha]
        Sampling 2 chains: 100%|     | 4500/4500 [00:09<00:00,
         465.07draws/s]
        The estimated number of effective samples is smaller than 200 for some
         parameters.

In [25]: pm.summary(trace)
Out[25]:
             mean      sd mc_error hpd_2.5 hpd_97.5     n_eff     Rhat
    alpha 0.913335 0.005983 0.000356 0.901586 0.924714 184.264900 1.001855
     beta 0.385394 0.007746 0.000461 0.369154 0.398291 215.477738 1.001570
    sigma 0.119484 0.001964 0.000098 0.115305 0.123315 312.260213 1.005246

In [26]: fig = pm.traceplot(trace)

 

Figure 13-20 Posterior distribution and trajectory diagram of GDX and GLD data

Figure 13-21 adds all the regression lines obtained to the previous scatter plot. However, all regression lines are very close to each other:

In [27]: plt.figure(figsize=(10, 6))
         plt.scatter(data['GDX'], data['GLD'], c=mpl_dates,
                     marker='o', cmap='coolwarm')
         plt.xlabel('GDX')
         plt.ylabel('GLD')
         for i in range(len(trace)):
             plt.plot(data['GDX'],
                      trace['alpha'][i] + trace['beta'][i] * data['GDX'])
         plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
                      format=mpl.dates.DateFormatter('%d %b %y'));

Figure 13-21 Multiple Bayesian regression lines passing through GDX and GLD data

Figure 13-21 reveals a major drawback of the regression method used: the method does not take into account changes that occur over time. In other words, the most recent data and the oldest data are treated the same.

13.3.4 Update estimates at any time

As pointed out earlier, Bayesian methods in finance are usually most useful in historical analysis—that is, new data revealed over time can lead to better regressions and estimates.

In order to add the above concepts to the current example, it is assumed that the regression parameters are not only random and in a certain distribution, but also randomly "walk" over time. This is the same way that financial theory generalizes from random variables to random processes (essentially an ordered sequence of random variables).

To this end, we define a new PyMC3 model, this time specifying parameter values ​​to walk randomly. After specifying the distribution of the random walk parameters, we can continue to specify the alpha and beta values ​​of the random walk. To make the whole process more efficient, use the same coefficients for 50 data points:

In [28]: from pymc3.distributions.timeseries import GaussianRandomWalk

In [29]: subsample_alpha = 50
         subsample_beta = 50

In [30]: model_randomwalk = pm.Model()
         with model_randomwalk:
             sigma_alpha = pm.Exponential('sig_alpha', 1. / .02, testval=.1) ❶
             sigma_beta = pm.Exponential('sig_beta', 1. / .02, testval=.1) ❶
             alpha = GaussianRandomWalk('alpha', sigma_alpha ** -2,
                                 shape=int(len(data) / subsample_alpha)) ❷
             beta = GaussianRandomWalk('beta', sigma_beta ** -2,
                                 shape=int(len(data) / subsample_beta)) ❷
             alpha_r = np.repeat(alpha, subsample_alpha) ❸
             beta_r = np.repeat(beta, subsample_beta) ❸
             regression = alpha_r + beta_r * data['GDX'].values[:2100] ❹
             sd = pm.Uniform('sd', 0, 20) ❺
             likelihood = pm.Normal('GLD', mu=regression, sd=sd,
                                    observed=data['GLD'].values[:2100]) ❻

❶ Define the prior probability for the random walk parameters.

❷ Random walk model.

❸ Substitute the parameter vector into the time interval length.

❹ Define the regression model.

❺ The prior value of standard deviation.

❻ Use mu in the regression result to define the likelihood.

Due to the use of random walks instead of single random variables, these definitions are more complicated than before. However, the derivation steps of MCMC remain essentially unchanged. It should be noted that, since we have to calculate a parameter pair for each random walk sample, a total of 1950/50=39 (previously only one was needed), the computational burden is significantly increased:

In [31]: %%time
         import scipy.optimize as sco
         with model_randomwalk:
             start = pm.find_MAP(vars=[alpha, beta],
                                 fmin=sco.fmin_l_bfgs_b)
             step = pm.NUTS(scaling=start)
             trace_rw = pm.sample(250, tune=1000, start=start,
                                  progressbar=True)
         logp = -6,657: 2%|         | 82/5000 [00:00<00:08, 550.29it/s]
         Only 250 samples in chain.
         Auto-assigning NUTS sampler...
         Initializing NUTS using jitter+adapt_diag...
         Multiprocess sampling (2 chains in 2 jobs)
         NUTS: [sd, beta, alpha, sig_beta, sig_alpha]
         Sampling 2 chains: 100%|     | 2500/2500 [02:48<00:00, 8.59draws/s]

         CPU times: user 27.5 s, sys: 3.68 s, total: 31.2 s
         Wall time: 5min 3s

In [32]: pm.summary(trace_rw).head() ❶
Out[32]:

                 mean       sd  mc_error  hpd_2.5  hpd_97.5      n_eff \
    alpha__0  0.673846  0.040224  0.001376  0.592655  0.753034  1004.616544
    alpha__1  0.424819  0.041257  0.001618  0.348102  0.509757  804.760648
    alpha__2  0.456817  0.057200  0.002011  0.321125  0.553173  800.225916
    alpha__3  0.268148  0.044879  0.001725  0.182744  0.352197  724.967532
    alpha__4  0.651465  0.057472  0.002197  0.544076  0.761216  978.073246
                 Rhat
    alpha__0 0.998637 
    alpha__1 0.999540 
    alpha__2 0.998075 
    alpha__3 0.998995 
    alpha__4 0.998060

❶ Summary statistics for each time interval (only the first 5 and alpha are displayed).

Figure 13-22 shows a subset of the estimated values, illustrating the evolution of the regression parameters alpha and beta over time:

In [33]: sh = np.shape(trace_rw['alpha']) ❶
         sh ❶
Out[33]: (500, 42)

In [34]: part_dates = np.linspace(min(mpl_dates),
                                  max(mpl_dates), sh[1]) ❷

In [35]: index = [dt.datetime.fromordinal(int(date)) for
                  date in part_dates] ❷

In [36]: alpha = {'alpha_%i' % i: v for i, v in
                  enumerate(trace_rw['alpha']) if i < 20} ❸

In [37]: beta = {'beta_%i' % i: v for i, v in
                  enumerate(trace_rw['beta']) if i < 20} ❸

In [38]: df_alpha = pd.DataFrame(alpha, index=index) ❸

In [39]: df_beta = pd.DataFrame(beta, index=index) ❸

In [40]: ax = df_alpha.plot(color='b', style='-.', legend=False,
                            lw=0.7, figsize=(10, 6))
         df_beta.plot(color='r', style='-.', legend=False,
                      lw=0.7, ax=ax)
         plt.ylabel('alpha/beta');

❶ The composition of objects including parameter estimates.

❷ Create a list of dates to match the number of time intervals.

❸ Collect the relevant parameter time series in two DataFrame objects.

Figure 13-22 Estimated values ​​of selection parameters over a period of time

 

Absolute price data and relative yield data

The analysis in this section is based on normalized price data. This is for illustrative purposes only, because the corresponding graphical results are easier to understand and interpret (they are visually "more eye-catching"). However, financial applications in the real world should rely on yield data to ensure the stability of time series data.

Using the mean of alpha and beta, we can account for the renewal of the regression over a period of time. Figure 13-23 shows how the regression is updated over time. In addition, it also shows 39 regression lines derived from both alpha and beta. Obviously, the number of updates over time greatly improves the regression fit (for current/latest data)-in other words, each time period requires its own regression:

In [41]: plt.figure(figsize=(10, 6))
         plt.scatter(data['GDX'], data['GLD'], c=mpl_dates,
                     marker='o', cmap='coolwarm')
         plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
                      format=mpl.dates.DateFormatter('%d %b %y'))
         plt.xlabel('GDX')
         plt.ylabel('GLD')
         x = np.linspace(min(data['GDX']), max(data['GDX']))
         for i in range(sh[1]): ❶
             alpha_rw = np.mean(trace_rw['alpha'].T[i])
             beta_rw = np.mean(trace_rw['beta'].T[i])
             plt.plot(x, alpha_rw + beta_rw * x, '--', lw=0.7,
                     color=plt.cm.coolwarm(i / sh[1]))

❶ Draw regression lines for all time intervals of length 50.

Figure 13-23 Scatter plot with time-dependent regression line (updated estimate)

This is the end of the introduction to Bayesian regression. Python provides a powerful PyMC3 library to implement different methods in Bayesian statistics and probabilistic programming, especially Bayesian regression, which has become equivalent in econometric finance. Popular and important tool.

This article is excerpted from "Python Financial Big Data Analysis 2nd Edition"

"Python Financial Big Data Analysis 2nd Edition" is divided into 5 parts, a total of 21 chapters. Part 1 introduces the application of Python in finance, and its content covers the reasons why Python is used in the financial industry, the basic architecture and tools of Python, and some specific examples of Python in econometric finance; Part 2 introduces Basic knowledge of Python and the well-known library NumPy and pandas toolset in Python. Object-oriented programming is also introduced; Part 3 introduces the basic techniques and methods of financial data science, including data visualization, input/output operations, and mathematical Financial-related knowledge, etc.; Part 4 introduces the application of Python in algorithmic trading, focusing on common algorithms, including machine learning, deep neural networks and other artificial intelligence related algorithms; Part 5 explains the development of options and derivatives based on Monte Carlo simulation The application of pricing includes the introduction of the valuation framework, the simulation of financial models, the valuation of derivatives, and the valuation of investment portfolios. 

"Python Financial Big Data Analysis 2nd Edition" This book is suitable for financial industry developers who are interested in using Python for big data analysis and processing.

 

Guess you like

Origin blog.csdn.net/epubit17/article/details/107687120