Python financial big data analysis: one of the most commonly used mathematical techniques in finance, approximation method

First, the usual import work:

In [1]: import numpy as np
        from pylab import plt, mpl

In [2]: plt.style.use('seaborn')
        mpl.rcParams['font.family'] = 'serif'
        %matplotlib inline

The main function example used in this section is as follows, which consists of a trigonometric function term and a linear term:

In [3]: def f(x):
            return np.sin(x) + 0.5 * x

The key point is to find the approximate value of the function through regression and interpolation in a given interval. First, generate the graph of the function to better observe the effect of the approximation method. The interval we are interested in is [−2π, 2π]. Figure 11-1 shows the image of this function on the fixed interval defined by the np.linspace() function. create_plot() is a helper function that can create similar charts that will be used many times in this chapter:

Figure 11-1 Sample function chart

In [4]: def create_plot(x, y, styles, labels, axlabels):
            plt.figure(figsize=(10, 6))
            for i in range(len(x)):
                plt.plot(x[i], y[i], styles[i], label=labels[i])
                plt.xlabel(axlabels[0])
                plt.ylabel(axlabels[1])
            plt.legend(loc=0)

In [5]: x = np.linspace(-2 * np.pi, 2 * np.pi, 50) ❶

In [6]: create_plot([x], [f(x)], ['b'], ['f(x)'], ['x', 'f(x)'])

❶ The x value for drawing and calculation .

11.1.1 Regression

Regression is a very efficient function approximation calculation tool. It is not only suitable for approximating one-dimensional functions, but also effective in higher dimensions. The numerical methods required to obtain the regression results are easy to implement and fast to execute. Essentially, the task of regression is to find the optimal parameters according to formula 11-1 given a set of so-called "basis functions" bd , d ∈ {1,..., D }

,…

, Where for i ∈ {1,... I } observation point, yi  ≡  f ( xi ). xi can be regarded as the observed value of the independent variable, and yi can be regarded as the observed value of the dependent variable (in a functional or statistical sense).

Equation 11-1. Minimize the regression problem

1. Monomials as basis functions

In the simplest case, the monomial is used as the basis function—that is, b 1=1, b 2= x , b 3= x 2, b 4= x 3,... In this case, NumPy can determine the most Optimal parameters (np.polyfit()) and a built-in function for approximating a set of input values ​​(np.polyval()).

Table 11-1 lists the parameters of the np.polyfit() function. On the basis of the optimal regression correlation coefficient ρ returned by np.polyfit(), np.polyval(ρ,x) returns the regression value of the x coordinate.

 

The typical vectorized style of np.polyfit() and np.polyval() linear regression (deg=1) is applied as follows. Since the regression estimates are stored in the ry array, we can compare the regression results with the original function as shown in Figure 11-2. Of course, linear regression cannot handle the sin part of the example function:

In [7]: res = np.polyfit(x, f(x), deg=1, full=True) ❶

In [8]: res ❷
Out[8]: (array([ 4.28841952e-01, -1.31499950e-16]),
         array([21.03238686]),
         2,
         array([1., 1.]),
         1.1102230246251565e-14)

In [9]: ry = np.polyval(res[0], x) ❸

In [10]: create_plot([x, x], [f(x), ry], ['b', 'r.'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

❶ Linear regression step.

❷ Complete results: regression parameters, residuals, effective ranks, singular values ​​and relative condition numbers.

❸ Use regression parameters to evaluate.

Figure 11-2 Linear regression

In order to process the sin part of the example function, a higher order monomial must be used. The next regression attempts to use the 5-th order monomial as the basis function. Sure enough, the regression result (shown in Figure 11-3) looks closer to the original function. However, it is far from perfect:

In [11]: reg = np.polyfit(x, f(x), deg=5)
         ry = np.polyval(reg, x)

In [12]: create_plot([x, x], [f(x), ry], ['b', 'r.'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

Figure 11-3 Regression using the highest 5 monomials

The last attempt is to use 7 monomials as the basis function to calculate the approximate value of the example function. The results this time are shown in Figure 11-4, which is quite convincing:

Figure 11-4 7-order single regression

In [13]: reg = np.polyfit(x, f(x), 7)
         ry = np.polyval(reg, x)

In [14]: np.allclose(f(x), ry)❶
Out[14]: False

In [15]: np.mean((f(x) - ry) ** 2) ❷
Out[15]: 0.0017769134759517689

In [16]: create_plot([x, x], [f(x), ry], ['b', 'r.'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

❶ Check whether the function and the regression value are the same (at least close).

❷ Calculate the mean square error (MSE) of the regression value based on the function value.

2. Separate basis function

Generally speaking, when you choose a better basis function group, you can get better regression results, for example, use the knowledge of the function to calculate the approximate value. In this case, the individual basis functions must be defined by a matrix method (that is, using NumPy's ndarray object). First, the polynomial in the example is up to 3rd degree (Figure 11-5). The core function of this example is np.linalg.lstsq():

In [17]: matrix = np.zeros((3 + 1, len(x))) ❶
         matrix[3, :] = x ** 3 ❷
         matrix[2, :] = x ** 2 ❷
         matrix[1, :] = x ❷
         matrix[0, :] = 1 ❷

In [18]: reg = np.linalg.lstsq(matrix.T, f(x), rcond=None)[0] ❸

In [19]: reg.round(4) ❹
Out[19]: array([ 0. , 0.5628, -0. , -0.0054])

In [20]: ry = np.dot(reg, matrix) ❺

In [21]: create_plot([x, x], [f(x), ry], ['b', 'r.'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

❶ The ndarray object used by the basis function value (matrix).

❷ From constant to cubic basis function value.

❸ Return step.

❹ Optimal regression parameters.

❺ Regression estimation of function value.

Figure 11-5 Regression with a separate basis function

Based on the experience of the previous monomials, the results in Figure 11-5 are not really as good as expected. Using a more general approach allows us to take advantage of our knowledge of example functions. We know that there is a sin part in the function. Therefore, it makes sense to include a sine function in the basis function. For the sake of simplicity, we replace the highest order monomials. The fitting is now perfect, as shown in Figure 11-6:

In [22]: matrix[3, :] = np.sin(x) ❶

In [23]: reg = np.linalg.lstsq(matrix.T, f(x), rcond=None)[0]

In [24]: reg.round(4) ❷
Out[24]: array([0. , 0.5, 0. , 1. ])

In [25]: ry = np.dot(reg, matrix)

In [26]: np.allclose(f(x), ry) ❸
Out[26]: True

In [27]: np.mean((f(x) - ry) ** 2) ❸
Out[27]: 3.404735992885531e-31

In [28]: create_plot([x, x], [f(x), ry], ['b', 'r.'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

❶ The new basis functions utilize knowledge about the example functions.

❷ The optimal regression parameters restore the original parameters.

❸ Now, the regression produces a perfect fit.

Figure 11-6 Regression using sine basis functions

3. Noisy data

Regression can also handle noisy data, which comes from simulation or (imperfect) measurements. In order to illustrate this point, we generate independent variable observations and dependent variable observations that also have noise. Figure 11-7 shows that the regression results are closer to the original function than the noisy data points. In a sense, the regression averages the noise to a certain extent:

Figure 11-7 Regression using noisy data

In [29]: xn = np.linspace(-2 * np.pi, 2 * np.pi, 50) ❶
         xn = xn + 0.15 * np.random.standard_normal(len(xn)) ❷
         yn = f(xn) + 0.25 * np.random.standard_normal(len(xn)) ❸

In [30]: reg = np.polyfit(xn, yn, 7)
         ry = np.polyval(reg, xn)
In [31]: create_plot([x, x], [f(x), ry], ['b', 'r.'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

❶ The new determined value of x .

Introduce noise into the x value.

Introduce noise into the y value.

4. Unsorted data

Another important feature of regression is that it can handle unsorted data seamlessly. The previous examples all rely on sorted x data, which is not always the case. To illustrate this point, we randomly generate independent variable data points. In this case, it is difficult to identify any structure only by visually inspecting the raw data:

In [32]: xu = np.random.rand(50) * 4 * np.pi - 2 * np.pi ❶
         yu = f(xu)

In [33]: print(xu[:10].round(2)) ❶
         print(yu[:10].round(2)) ❶
         [-4.17 -0.11 -1.91 2.33 3.34 -0.96 5.81 4.92 -4.56 -5.42]
         [-1.23 -0.17 -1.9 1.89 1.47 -1.29 2.45 1.48 -1.29 -1.95]

In [34]: reg = np.polyfit(xu, yu, 5)
         ry = np.polyval(reg, xu)
In [35]: create_plot([xu, xu], [yu, ry], ['b.', 'ro'],
                     ['f(x)', 'regression'], ['x', 'f(x)'])

❶ Randomize the x value.

Like noisy data, regression methods do not care about the order of observation points. This is obvious when studying the structure of the minimization problem shown in Equation 11-1. It is also obvious from the results shown in Figure 11-8.

Figure 11-8 Regression using unsorted data

5. Multidimensional

Another advantage of the least squares regression method is that it can be used in multi-dimensional situations without too much modification. Next, take the fm() function as an example to explain:

In [36]: def fm(p):
             x, y = p
             return np.sin(x) + 0.25 * x + np.sqrt(y) + 0.05 * y ** 2

In order to visualize this function correctly, we need a grid of independent variable data points (in two dimensions). Figure 11-9 shows the shape of the fm() function according to the two-dimensional grid of independent and dependent variables represented by x , y , and z :

In [37]: x = np.linspace(0, 10, 20)
         y = np.linspace(0, 10, 20)
         X, Y = np.meshgrid(x, y) ❶

In [38]: Z = fm((X, Y))
         x = X.flatten() ❷
         y = Y.flatten() ❷

In [39]: from mpl_toolkits.mplot3d import Axes3D ❸

In [40]: fig = plt.figure(figsize=(10, 6))
         ax = fig.gca(projection='3d')
         surf = ax.plot_surface(X, Y, Z, rstride=2, cstride=2,
                                cmap='coolwarm', linewidth=0.5,
                                antialiased=True)
         ax.set_xlabel('x')
         ax.set_ylabel('y')
         ax.set_zlabel('f(x, y)')
         fig.colorbar(surf, shrink=0.5, aspect=5)

❶ Generate a two-dimensional ndarray object (grid) from a one-dimensional ndarray object.

❷ Obtain a one-dimensional ndarray object from a two-dimensional ndarray object.

❸ Import 3D plotting functions from matplotlib when necessary.

Figure 11-9 A function with two parameters

In order to obtain a good regression result, we will apply the basic function set, including fm() function, np.sin() and np.sqrt() function. Figure 11-10 visually shows the perfect regression result:

In [41]: matrix = np.zeros((len(x), 6 + 1))
         matrix[:, 6] = np.sqrt(y) ❶
         matrix[:, 5] = np.sin(x) ❷
         matrix[:, 4] = y ** 2
         matrix[:, 3] = x ** 2
         matrix[:, 2] = y
         matrix[:, 1] = x
         matrix[:, 0] = 1

In [42]: reg = np.linalg.lstsq(matrix, fm((x, y)), rcond=None)[0]

In [43]: RZ = np.dot(matrix, reg).reshape((20, 20)) ❸

In [44]: fig = plt.figure(figsize=(10, 6))
         ax = fig.gca(projection='3d')
         surf1 = ax.plot_surface(X, Y, Z, rstride=2, cstride=2,
                     cmap=mpl.cm.coolwarm, linewidth=0.5,
                     antialiased=True) ❹
         surf2 = ax.plot_wireframe(X, Y, RZ, rstride=2, cstride=2,
                                   label='regression') ❺
         ax.set_xlabel('x')
         ax.set_ylabel('y')
         ax.set_zlabel('f(x, y)')
         ax.legend()
         fig.colorbar(surf, shrink=0.5, aspect=5)

The np.sqrt() function for the y parameter.

The np.sin() function for the x parameter.

❸ Convert the regression results into a grid structure.

❹ Draw the original function surface.

❺ Draw the regression surface.

Figure 11-10 The regression surface of a two-parameter function

 

return

The least squares regression method has many applications, including simple function approximation and function approximation based on noisy or unsorted data. These methods can be applied to one-dimensional problems or multi-dimensional problems. Because of the basic mathematical theory of this method, its application to one-dimensional problems and multi-dimensional problems is always "almost the same."

11.1.2 Interpolation

Compared with regression, interpolation (for example, cubic spline interpolation) is more mathematically complex. It is also limited to low-dimensional issues. Given a set of ordered observation points (sorted according to the x dimension), the basic idea is to perform regression between two adjacent data points. Not only does this generate a piecewise interpolation function that exactly matches the data point, but the function is in the data The point is continuously differentiable. Continuous differentiability requires at least 3rd order interpolation-that is, 3rd order spline interpolation. However, this method is generally also suitable for quartic or linear spline interpolation.

The following code can implement linear spline interpolation, and the result is shown in Figure 11-11:

In [45]: import scipy.interpolate as spi ❶

In [46]: x = np.linspace(-2 * np.pi, 2 * np.pi, 25)

In [47]: def f(x):
             return np.sin(x) + 0.5 * x

In [48]: ipo = spi.splrep(x, f(x), k=1) ❷

In [49]: iy = spi.splev(x, ipo) ❸

In [50]: np.allclose(f(x), iy) ❹
Out[50]: True

In [51]: create_plot([x, x], [f(x), iy], ['b', 'ro'],
                     ['f(x)', 'interpolation'], ['x', 'f(x)'])

❶ Import the necessary sub-libraries from SciPy.

❷ Realize linear spline interpolation.

❸ Get the interpolation value.

❹ Check whether the interpolated value is (sufficiently) close to the function value.

Figure 11-11 Linear spline interpolation (complete data set)

If there is a set of data points sorted by x value, then the application itself is as simple as using the np.polyfit() and np.polyval() functions. In this example, their corresponding functions are sci.splrep() and sci.splev(). Table 11-2 lists the main parameters of the sci.splrep() function.

 

Table 11-3 lists the parameters of the sci.splev() function.

 

Spline interpolation is often used in finance to estimate the dependent variable value of independent variable data points that are not included in the original observation point. For this reason, choose a smaller interval in the next example and carefully observe the value of a spline insertion. Figure 11-12 shows that the interpolation function can indeed linearly interpolate between two observation points. For some applications, this may not be accurate enough. In addition, it is obvious that the function is not continuously differentiable on the original data points-this is another disadvantage:

In [52]: xd = np.linspace(1.0, 3.0, 50) ❶
         iyd = spi.splev(xd, ipo)

In [53]: create_plot([xd, xd], [f(xd), iyd], ['b', 'ro'],
                     ['f(x)', 'interpolation'], ['x', 'f(x)'])

❶ A smaller interval with more data points.

Figure 11-12 Linear spline interpolation (data subset)

Repeat the entire exercise, using the spline 3 times this time, and the result is significantly improved (see Figure 11-13):

In [54]: ipo = spi.splrep(x, f(x), k=3) ❶
         iyd = spi.splev(xd, ipo) ❷

In [55]: np.allclose(f(xd), iyd) ❸
Out[55]: False

In [56]: np.mean((f(xd) - iyd) ** 2) ❹
Out[56]: 1.1349319851436892e-08

In [57]: create_plot([xd, xd], [f(xd), iyd], ['b', 'ro'],
                     ['f(x)', 'interpolation'], ['x', 'f(x)'])

❶ 3rd order spline interpolation on the complete data set.

❷ The results are applied to smaller time intervals.

❸ The interpolation is still not perfect.

❹ But better than before.

 

Interpolation

In the case where spline interpolation can be applied, it can be expected to get a better approximation result than the least square regression method. But remember, there must be sorted (and "no noise") data, and this method is limited to low-dimensional problems. The calculation requirements of spline interpolation are also higher, and in some use cases, it may take much longer than the regression method.

Figure 11-13 Cubic spline interpolation (data subset)

This article is excerpted from "Python Financial Big Data Analysis" (2nd Edition)

The second edition of this book is more of an upgrade than an update. For example, this edition adds an entire section (Part 4) on algorithmic trading. This topic has recently become quite important in the financial industry and is also very popular among retail investors. This edition also adds an introductory part (Part 2) that introduces basic Python programming and data analysis. This knowledge will lay the foundation for the subsequent parts of this book. On the other hand, some chapters of the first edition have been completely deleted. For example, the part about Web technology and corresponding libraries (such as Flask) is deleted, because there are now books dedicated to these knowledge.

In the second edition, I tried to cover more finance-related topics, focusing on Python techniques that are particularly useful for financial data science, algorithmic trading, and computational finance. As in the first edition, I took a practical approach. The implementation and illustrations precede the theoretical details, and I usually focus on the whole, rather than some classes, methods, or obscure function parameterization options.

After describing the basic methods of the second edition, I must also emphasize that this book is neither an introduction to Python programming nor general financial knowledge. In both aspects, there are a large number of excellent sources of knowledge. This book is positioned at the intersection of these two exciting fields, and assumes that the reader has a certain programming (not necessarily Python) and financial background. These readers will learn how to apply Python and its ecosystem to the financial field.

 

Guess you like

Origin blog.csdn.net/epubit17/article/details/108091898