20200918: [Guangfa Financial Engineering] The ninth series of heavyweight topics in 2018: Research on stock selection strategy based on hidden Markov model

Reference link: https://www.sohu.com/a/252454782_465470

 

[Guangfa Financial Engineering] The ninth series of heavyweight topics in 2018: Research on stock selection strategy based on hidden Markov model 

2018-09-07 11:26

Report summary

1

Legendary medallion fund

From its establishment in 1988 to the formal retirement of Simmons on January 1, 2010, the net average annual return rate of the medallion fund exceeded 35%, far exceeding the annualized return rate of the S&P 500 index. Moreover, when the market was volatile, such as the technology stock disaster in 2000 and the global financial crisis in 2008, the medallion fund performed better, with more than 90% performance returns that year.

Among the core members of Renaissance Technology, there are many experts in the field of HMM and speech recognition. Therefore, people believe that the hidden Markov model is the magic weapon for the medal fund to achieve brilliant performance.

2

Stock selection strategy based on HMM model

This report introduces the technology of speech recognition into the stock price forecast. It is assumed that there is a clear pattern for both rising and falling stocks, and each can be described by an HMM model. We choose 6 price and volume indicators such as the turnover rate and the 1-day rise and fall of stock prices as the model observation values, and select the samples of rising stocks in the stock pool to train the HMM model that characterizes the rising pattern.

When forecasting, the greater the observation probability of a stock on the HMM model that characterizes the rising pattern, the greater the probability that the stock will actually rise. We divide the stocks in the stock pool into 10 tiers according to the HMM factor value, and the stocks of the highest tier are over-allocated in each period.

3

The model's stock picking returns far exceed the benchmark

Through empirical analysis, the stock selection strategy based on the Hidden Markov Model uses the CSI 500 constituent stocks as the stock pool and can obtain higher excess returns. Moreover, after industry-neutral optimization, the effect of the strategy has been further improved. In the backtest interval from February 1, 2007 to August 15, 2018, the annualized excess return rate of the strategy was 16.19%, and the maximum drawdown was- 9.69%, the information ratio reached 2.14.

1. Research background

In the field of quantitative investment, there is a legend-James Simons, the world's most profitable hedge fund manager, who earned US$20 billion on the Forbes Global Rich List updated in March 2018. His net worth is ranked 59th.

The annual rate of return of the medallion fund is shown in Figure 1. Except for the decrease in net asset value in the second year of establishment, the rest of the year have achieved very high positive returns. From its establishment in 1988 to Simmons's official retirement on January 1, 2010, the net average annual return rate of the Medal Fund exceeded 35%, far exceeding the annualized return rate of the Standard & Poor's 500 Index. Moreover, when the market is volatile, the performance of the fund will be better. In the technology stock disaster of 2000, the S&P 500 fell 10.1%, and the net return of the medallion fund was 98.5%, almost doubled. In the global financial crisis in 2008, the price of various assets fell across the board, and most hedge funds lost money. Chapter Fund’s net return rate was 98.2%. The above data are the net investment income after deducting the annual fixed 5% management fee and the performance commission ranging from 22% to 44%. These almost amazing returns have made many investors very curious about Renaissance investment methods.

For so many years, Simmons has kept a close eye on the investment model of the medallion fund. Except for the core members of the team, no one knows how they earn so much return in the market, and no other organization or individual can replicate their brilliance. People try to speculate on the "mysterious model" behind the brilliant performance of the Medal Fund from the background of the core team of the Fuxing Company.

In the early days of Renaissance Technology, the famous statistician Leonard Baum undoubtedly played a key role. Baum is one of the first experts to propose Hidden Markov Model (HMM), which is used to describe a Markov process with unknown parameters and has been very successful in the fields of speech recognition, bioinformatics, etc. Applications. In addition, the famous Baum-Welch formula in statistics is also named after this outstanding scientist. It is an effective method for learning the parameters of the Hidden Markov Model. Simmons himself believes that investment and speech recognition are very similar, so he once dug the elite of IBM's entire speech laboratory to Renaissance. Based on these backgrounds, people have reason to believe that the Hidden Markov Model is the mysterious magic weapon for the medal fund to achieve brilliant performance.

In fact, the process of stock price prediction and speech recognition does have many similarities. For example, in the process of word recognition, the pronunciation of the word is divided into a series of continuous "phones" (phone), and finally the whole pronunciation is recognized as a word is determined by this continuous phoneme sequence. This scenario is similar to the stock price rise and fall forecast, that is, the rise or fall of stock prices is also driven by a series of time series of price and volume factors. For example, in a unilateral market, the probability of stock prices rising for a period of time in the future is relatively large. . Hidden Markov model can describe this dynamic change process and dig out more information from time series. This is an advantage over statistical learning models such as linear regression, SVM or random forest. In the previously released report "Exploring Simmons Investment Ways: Research on Weekly Timing Strategies Based on HMM Models", we empirically demonstrated that the Hidden Markov Model is used in index timing, and the forecast accuracy and timing strategy benefits have been achieved. Good results. This report will further explore how to apply the Hidden Markov Model to stock selection strategies and demonstrate its effects.

2. Introduction to Hidden Markov Model

2.1

Definition of Hidden Markov Model

Hidden Markov Model (HMM) is a probabilistic model about time series. It describes the process of generating a random sequence of unobservable states from a hidden Markov chain, and then generating a corresponding observation from each state to generate a random sequence of observations. Therefore, the Hidden Markov Model contains two random sequences. One is an unobservable state sequence. Each state generates an observation, which constitutes another random sequence of observations, which is called the observation sequence (observation sequence). ). Each position in the sequence can be regarded as a moment, and the front and back of the position in the sequence indicate the sequence in time.

The hidden Markov model has three main elements, namely the initial probability distribution, the state transition probability distribution and the observation probability distribution. Its specific form is as follows:

The set of all possible states of the model is Q={q_1,q_2,…,q_N }, and the set of all possible observations is V={v_1,v_2,…,v_M}, where N represents the number of possible states, and M represents The number of possible observations. Assuming the length of the state sequence is T, the state sequence can be expressed as I=(i_1,i_2,...,i_T), and the corresponding observation sequence can be expressed as O=(o_1,o_2,...,o_T).

The hidden Markov model is determined by the initial state probability distribution π, the state transition probability matrix A and the observation probability matrix B. Among them, π and A determine the state sequence, and B determines the observation sequence, so the hidden Markov model can be expressed as λ=(A,B,π).

2.2

Three basic problems of HMM

To apply the hidden Markov model to practice, there are three basic problems that need to be solved:

Question 1: Probability calculation problem. Given model parameters λ=(A,B,π) and observation sequence O=(o_1,o_2,...,o_T), calculate the probability P(O|λ) that O appears under the model.

Question 2: Learning problem. Given the observation sequence O=(o_1,o_2,...,o_T), estimate the model parameter λ=(A,B,π), so that the probability of O appears under the model P(O|λ) is the largest. That is, the parameters of the model are estimated by the method of maximum likelihood estimation.

Question 3: Forecast problem. Given model parameters λ=(A,B,π) and observation sequence O=(o_1,o_2,…,o_T), find the state sequence I=(i_1) with the maximum conditional probability P(O|λ) for a given observation sequence ,i_2,…,i_T). That is, given the observation sequence, find the most likely corresponding state transition sequence.

For solving the three basic problems of the hidden Markov model, corresponding algorithms have been produced, which are the forward-backward algorithm for solving probabilistic calculation problems, the Baum-Welch algorithm for solving model learning problems, and the Wit to solve the model prediction problem. Than algorithm (Viterbi) algorithm.

2.3

Application of HMM in the field of speech recognition

Since the 1980s, the research on speech recognition has gradually shifted from traditional technical ideas to statistical models, and the application of hidden Markov models has achieved a major breakthrough in the field of speech recognition. The well-known investor, Dr. Kai-Fu Li, implemented the first large vocabulary speech recognition system Sphinx based on Hidden Markov Model at Carnegie Mellon University.

James Simmons believes that investment and speech recognition have many similarities. This report attempts to introduce classic speech recognition models into the stock price prediction problem. The classic HMM speech recognition training process is shown in Figure 2. The observation (observation) obtained during speech recognition is a waveform signal (waveform). First, the waveform is divided into small segments of equal length (frame), and features (such as MFCC) are extracted for each frame, so that each observation instance can use a feature sequence X=(x_1,x_2,…,x_T ), x_i∈ R^D, where T represents the length of the sequence, x_i represents the feature vector extracted from each frame, and the dimension is D. In the isolated word recognition system, an HMM model is designed for each word, and the training samples are continuously iterated to finally obtain the local optimal parameter estimates.

After the model is trained, when recognizing words, the features are also extracted from the speech and input into each HMM model, and the forward-backward algorithm is used to find the probability of each HMM model generating the sequence. Finally, the model corresponding to the maximum probability is selected, and the word corresponding to the model is the result of the recognition.

3. HMM-based stock rise and fall model

3.1

The principle of HMM's prediction of stock price rise and fall

Inspired by the application of the HMM model in speech recognition, we try to introduce its ideas into the stock price prediction. The strategy proposed in this report is based on the following two core assumptions:

1) According to the stock price trend in the previous period of time (weekly or monthly), there is a clear pattern corresponding to the rising and falling stocks, which can be described by an HMM model (UP model and DOWN model) respectively;

2) The external data performance of the stock (such as the rate of change, the turnover rate, etc.) can be determined by a limited number of hidden states that obey the Markov process, and satisfy the homogeneous Markov hypothesis and the observation independence hypothesis.

Based on the above two core assumptions, this report will use the Hidden Markov Model to predict the rise and fall of individual stocks in the future. First of all, assuming that the holding period is L days, our goal is to predict the rise and fall of individual stock prices on the (T+L) day relative to the T day on the trading day T. Assuming that the length of the observation sequence is Q, then the price and volume information of each stock is divided into daily frequency observation sequences, and the price and volume information vector of each day (such as the fluctuation rate of 1 day, the turnover rate) corresponds to the value in the speech recognition model frame, take (T-Q+1)~T day price and quantity information vector, obtain the observation sequence of length Q, corresponding to the waveform in the speech recognition model. When training the model, we choose (T+L) stocks whose daily stock price rises relative to T day as the training sample of the UP model, and choose (T+L) stocks whose daily stock price declines relative to the T day as the training sample of the DOWN model. In actual forecasting, on T day, we separately input the price and volume information vector of each stock on (T-Q+1)~T day as features into the trained UP model and DOWN model, and obtain observations under different models The probabilities of the observation sequence P_UP and P_DOWN. If P_UP>P_DOWN, it means that the observation sequence is more likely to be generated by the UP model. The probability of the stock rising on the (T+L) day relative to the T day is greater than the probability of falling; otherwise , It is considered that the probability of the stock falling is greater than the probability of rising. The training and forecasting process of the HMM stock rise and fall prediction model are shown in Figure 4 and Figure 5, respectively.

3.2

Selection and standardization of model input features

This report mainly selects features from the price and volume information of individual stocks as the observation value of the HMM model. The dimension of the observation vector is 6, which can be expressed as

Among them, close, open, high, and low respectively represent the closing price, opening price, highest price and lowest price of the daily stock; turnover represents the daily turnover rate; returnPast1d represents the increase or decrease of the daily closing price relative to the previous day's closing price ; Cap represents the circulating market value at the close of each day. These 7 factors are called original stock factors, and each component of the observation vector is called a feature.

There are some missing values ​​and outliers in the original factor data of stocks, and the value ranges of different factor values ​​vary greatly. Therefore, we need to standardize these characteristics. The process of feature standardization in this report is divided into the following links:

3.3

Model training method and parameter selection

In this report, we choose the position adjustment cycle L to be 5 trading days, and the HMM model also chooses 5 trading days as the forecast cycle, that is, at the close of trading day T, after 5 trading days, the stock price will be relative to T The rise and fall of the closing price of the trading day. The stock pool is the constituent stocks of CSI 500, and all training samples and test samples are selected from the constituent stocks of CSI 500. In order to enable the model to better track changes in market styles, this report chooses to train the model in a completely rolling manner, that is, retrain the model every time the position is adjusted. At the same time, a compromise is taken between sufficient training samples and computational complexity, and 10 cycles of samples are pushed forward as the training set. The length of the model observation sequence is Q, which is a hyperparameter to be determined by the model. As shown in Figure 6, on the T trading day, our goal is to predict the rise and fall of each component stock relative to the current stock price on the T+5 trading day, assuming that the length of the observation sequence is Q=10, and each stock in the test set The observation sequence of is (T-9)~T trading day a total of 10 days of observations, each day has a total of 6 observations.

We use the CSI 500 component stocks with positive gains on February 1, 2007 relative to 5 days ago as the test set, and construct the training set according to the method described above to determine the parameters of the Hidden Markov Model (UP model). We first filter the possible values ​​of each hyperparameter within a reasonable range, and then train the model on the training set using the grid search method, and finally select the hyperparameter combination that maximizes the average observation probability of the samples on the test set, which is determined as:

1) The number of hidden states, N=3

2) Length of observation sequence, Q=10

3) The number of sub-models of the Gaussian mixture model, M=2

Fourth, strategy and empirical analysis

4.1

Strategy principle and parameter setting

We previously assumed that there is a clear pattern corresponding to the rise and fall of stocks, and the UP model obtained through the training of rising stock data can represent the pattern of stock "rising". Therefore, the stock selection strategy of this report is based on the main logic that under the UP model, the greater the probability P(O|λ_UP) of a component stock’s observation sequence, the greater the probability that the stock’s observation sequence is generated by the UP model. Larger, then the greater the probability of it rising in the future. Under this logic, the observed sequence probability of individual stocks calculated by the UP model is similar to traditional stock selection factors, such as EP, ROE, etc. For the convenience of description, we will call this observation sequence probability the HMM factor below.

This report uses the constituent stocks of the China Securities 500 Index as the stock pool to conduct a backtest of the stock selection strategy. The backtest parameters are set as follows:

Position adjustment cycle: 5 trading days

Stock pool: CSI 500 constituent stocks, excluding ST stocks, and excluding stocks suspended from trading on the trading day

Over-allocation portfolio: When adjusting the position, the stock pool is divided into 10 levels according to the size of the HMM factor, and the highest level of stocks is bought with equal rights

Hedging plan: long and short hedging; CSI 500 index hedging

Backtest period: February 1, 2007-August 15, 2018

Transaction cost: transaction cost is not considered when hedging long and short; when index hedging, two-thousandth of the transaction cost is taken

Position adjustment rules: In order to control the turnover rate of the strategy and prevent excessive transaction costs, the stocks that have been held will be selected first in each position adjustment. The specific rule is to divide the constituent stocks of CSI 500 into ten tiers according to the HMM factor value. The highest tier gets 10 points, the second tier gets 9 points, and so on, the last tier gets 1 point; The stocks held are increased by 15% on the basis of the original factor score, and then one-tenth of the number of stocks in the stock pool is selected to buy according to the final score. When the score is the same, the HMM factor value is preferred to be selected. The effect of this is to ensure that stocks held in the previous period and with a score of 9 or more in the current period are given priority to retain (9*1.15> 10), reducing the transaction cost of swapping positions.

4.2

Strategic empirical analysis

4.2.1 Empirical analysis of historical IC and tiered returns of HMM factors

The factor IC refers to the correlation coefficient between the cross-sectional factor value and the return rate of individual stocks in the next period, which can reflect the ability of the factor to provide excess returns. Since February 1, 2007, the HMM factor and the return rate of the next 5 trading days have been calculated for the cross-sectional rank correlation coefficient, and the IC sequence of the HMM factor is shown in Figure 7. The average IC is 0.082 and the standard deviation is 0.135.

Taking 5 trading days as the adjustment cycle, on each adjustment day, the constituent stocks of CSI 500 are divided into 10 tiers according to the HMM factor. The tiered performance since February 2007 is shown in Figure 8. The accumulation of each tier The rate of return is shown in Figure 9.

It can be seen from the above two figures that stocks with large HMM factor values ​​outperform stocks with small factor values ​​on the whole, and the monotonicity of the factor is better.

4.2.2 Empirical analysis of long and short hedging strategies

Assuming that you can short the lowest (tenth) stocks and buy the first stocks, the net value of the long-short portfolio since February 2007 is shown in Figure 10, and the income statistics are shown in Table 1. Regardless of transaction costs, the long-short portfolio has an annualized rate of return of 48.68%, with a maximum drawdown of -15.74%, and a winning rate of 66.96%. The annual performance of the long-short hedging strategy is shown in Table 2. Except for 2017, when the market style changed greatly, the strategy has achieved good returns in other years.

4.2.3 Empirical analysis of CSI 500 index hedging (non-industry neutral)

The performance of the stock selection strategy based on the HMM model with the CSI 500 index as the hedging target is shown in Figure 11. The annualized excess return rate within the CSI 500% stocks is 14.82%, and the maximum drawdown is -19.94%, which is excess. The winning percentage was 59.11% and the information ratio was 1.78. The performance of the strategy by year is shown in Table 4. During the period from 2007 to 2016, the strategy has achieved positive returns every year, but the performance has been poor after 2017. (Note: The data for 2007 starts from February 1, 2007; the data for 2018 ends on August 15, 2018.)

4.2.4 Empirical Analysis of CSI 500 Index Hedging (Industry Neutral)

It can be seen from the statistics in Section 4.2.3 that non-industry-neutral stock selection strategies have a large retracement and will perform poorly after 2017. We consider optimization through industry-neutral methods. The specific strategy is as follows:

The performance of the industry-neutral stock selection strategy based on the HMM model is shown in Figure 12. The annualized excess return rate within CSI 500% shares is 16.19%, the maximum drawdown is -9.69%, and the excess winning rate is 63.39%. The information ratio is 2.14. The annual performance of the strategy is shown in Table 6. Except for 2017, the strategy has achieved positive returns every year, and the overall performance has been significantly improved compared to non-industry neutral strategies. (Note: The data for 2007 starts from February 1, 2007; the data for 2018 ends on August 15, 2018.)

On the whole, the comprehensive performance of the stock selection strategy based on the HMM model is shown in Table 7. The strategy has achieved significant excess returns during the backtest period of more than 11 years. In particular, the industry-neutral HMM stock selection strategy can better hedge market risks and reduce the maximum drawdown while increasing the excess return rate.

Five, summary

The Hidden Markov Model is one of the secret weapons behind the brilliant performance of the Medal Fund. Inspired by this, this report introduces the Hidden Markov Model to the investment field and uses voice recognition technology to predict the rise and fall of stocks. Assuming that there is a clear pattern for rising and falling stocks, which can each be described by an HMM model, then if a stock’s observation probability on the HMM model that characterizes the rising pattern is greater, the probability of the stock’s actual rise is also Bigger.

The HMM factor has a good ability to provide excess returns. The average value of IC is 0.082, which is positive most of the time. In addition, during the backtest period, stocks with large HMM factor values ​​outperform stocks with small factor values ​​overall, and the cumulative returns based on factor values ​​are very monotonic. The annualized return rate of the long-short portfolio based on the HMM factor reached 48.68%, and the maximum drawdown was -15.74%.

This report demonstrates through empirical evidence that the stock selection strategy based on the Hidden Markov Model uses the CSI 500 constituent stocks as the stock pool and can obtain higher excess returns. Moreover, after industry-neutral optimization, the effect of the strategy has been further improved. The annualized excess return rate is 16.19%, the maximum drawdown is -9.69%, and the information ratio reaches 2.14.

The stock selection strategy based on the hidden Markov model proposed in this report is back-tested based on historical data. The strategy model is not 100% effective. Changes in market structure and trading behavior and the increase in similar trading participants may make the strategy invalid.

For details, please refer to the GF Metalworking Special Report

"Revisiting Simmons' Investment Approach: Research on Stock Selection Strategy Based on Hidden Markov Model"

"Indices and industry timing strategies based on the recognition of rising and falling patterns"

"Multi-factor ALPHA Series Report (36)-Machine Learning Multi-factor Dynamic Position Adjustment Strategy"

"New Advances in Deep Learning: Alpha Factor Re-mining"

"Exploring Simmons' Investment Way: A Study of Weekly Time Selection Strategy Based on HMM Model"

Guess you like

Origin blog.csdn.net/weixin_38192254/article/details/111628318