This article is an original article of Quantum Financial Services, and reprinting requires authorization
Past reviews:
The dice in the hands of God - Bayesian omnipotent (Part 1)
The dice in the hands of God - Bayesian omnipotent (Part 2)
Machine Learning (ML) is one of the hottest topics nowadays with a wide range of applications. Among them, the solution of classification problems is a common field of machine learning. It is not difficult to imagine that the essence of the stock selection strategy is a classification problem, that is, the stocks are divided into two categories: holding and not holding. Then, the application of machine learning technology for stock selection should be a direction that can be tried. In this article, we will convert the stock selection problem into a binary classification problem based on the ideas in the research report "Logistic stock selection model and its empirical evidence in the CSI 300" published by Guosen Securities on September 7, 2010. Thereby, the Logistic algorithm is used for prediction and classification, and the stock selection is completed.
Concept introduction
Logistic regression (Logistic Regression, hereinafter referred to as LR) is one of the commonly used machine learning algorithms, which can be used for regression or classification, and is mainly used to deal with the problem of binary classification. Unlike traditional binary classification methods (such as SVM), LR does not completely separate samples from two different categories, either, but returns a probability value, that is, what percentage of the predicted samples belong to. The positive class, and what percentage of the probability belongs to the negative class, which is close to the category of "fuzzy" classification . In practical applications, LR is often used to estimate the likelihood of something. For example, it can be used for ad prediction, that is, based on the likelihood that an ad will be clicked by the user, place the most likely to be clicked on the ad where the user can see it. There are also similar possibilities for a user to purchase a certain product, a possibility for a patient to suffer from a certain disease, and so on.
LR is essentially a linear classification model. It is different from linear regression in that it tries to compress a large range of numbers output by linear regression, such as from negative infinity to positive infinity, to between 0 and 1 to express " possibility" . To achieve this function, a mapping is required, that is, a logistic function (or sigmoid function) is added to the output:
The Sigmoid function has a nice "S" shape, as shown in the following figure (quoted from Wikipedia):
That is to say, LR is actually a linear regression process normalized by the logistic equation .
When using the LR model, we first need to input a set of training data with different features, and determine the category (target) of each training data in advance. After inputting such a set of data, the model will "train" itself through the algorithm to fit the relationship between the feature and the target in the data. After the training is completed, we can input a prediction sample, and the model will determine which target the feature of the sample should correspond to through the fitted relationship, and return the probability of belonging to different targets. This completes a complete LR classification.
strategic thinking
The stock selection problem is transformed into a classification problem . Those whose expected performance is better than the market are classified as positive, and those whose performance is lower than the market are classified as negative. The LR model is used to calculate the probability that each stock belongs to the positive class. The 30 stocks with the highest probability of belonging to the positive category are selected for purchase, the funds are distributed evenly, and the positions are adjusted monthly.
Policy details
1) Classification (target)
We use the CSI 300 Index as a benchmark to study the probability of individual stocks' returns exceeding that of the index. Taking each month as the observation period, when the return rate of a stock exceeds the CSI 300 index in the whole period, we will record its target as 1 (positive category), otherwise it is 0 (negative category);
2) Features
Feature is a quantity used to express the characteristics of data in different dimensions. In the stock market, factors are undoubtedly the most suitable variables as data characteristics. Here we refer to the research report "Single-Factor Validity Test" published by Orient Securities on June 26, 2015, and select ten factors that are most closely related to the rate of return in the research report to describe each data. (each stock) features in different dimensions. The list of factors is as follows:
factor name |
short name |
The total market capitalization |
TMV |
Circulating market value |
FMV |
Changes on the 22nd |
STOCKZF22 |
Average deviation |
STOCKPJCJ5/60 |
22nd turnover rate |
HSL22 |
Year-on-year growth rate of operating income (single quarter) |
SALESGROWRATE1 |
Net Assets / Total Market Cap |
BP |
Year-on-year growth rate of net profit (single quarter) |
PROFITGROWRATE1 |
Operating Income / Total Market Cap |
SP |
Year-on- year growth rate of operating income TTM |
SALESGROWRATE |
3) Extreme value and normalization
Using the LR algorithm has certain requirements for data quality, and the first thing we need to do is data processing.
a) extreme value. The extreme value refers to the value that deviates from the majority of the sample population. The existence of a few extreme values may greatly distort the results of data analysis, so it is necessary to remove extreme values. Here we use the median de-extremum method, and the formula is as follows:
where Di represents the ith observation and Dm represents the median of each feature variable. Dmad is the median of the deviation of each observation from the median.
b) Standardization
When working with different factor data, it must be normalized (i.e. mapped to the [0,1] interval), otherwise all combinations will be meaningless because each factor has a different unit. The normalized way here is:
That is, normal standardization with 0-bit mean and 1-bit standard deviation.
strategy implementation
1) Logistic stock selection
Trading subject: stocks
Adjustment cycle: every month
Backtest time: 2012.01.01~2017.01.01
Backtest time: 5 years
yield curve
Earnings attribution
performance analysis
The backtest shows that during the five-year backtest period, our strategy performed well and achieved an annualized rate of return of 31.8%, outperforming the CSI 300 and CSI 500 indices. However, there are also problems of excessive backtesting and high volatility. Next we will add the pullback control to test.
2) Logistic stock selection, full-time retracement control
Trading subject: stocks
Adjustment cycle: every month
Backtest time: 2012.01.01~2017.01.01
Backtest time: 5 years
Retracement Control: 20%
yield curve
Earnings attribution
performance analysis
After adding the full-time drawdown control, compared with strategy (1), the yield is accompanied by a drawdown, the volatility has dropped sharply, the Sharpe ratio has increased, and the strategy performance is more stable . However, it is worth noting that in the bear market in the second half of 2015, the holdings of the investment portfolio dropped to 0, that is, the stock market was completely withdrawn. Although this avoided the possible losses caused by the stock market crash to a certain extent, it also gave up potential gains thereafter. To avoid this, we changed the drawdown control to monthly, i.e. controlling the monthly drawdown not to exceed a certain threshold.
3) Logistic stock selection, monthly drawdown control
Trading subject: stocks
Adjustment cycle: every month
Drawback Control: Up to 20% per month
Backtest time: 2012.01.01~2017.01.01
Backtest time: 5 years
yield curve
Earnings attribution
performance analysis
Compared with strategy (2), the rate of return, drawdown and volatility of strategy (3) have all increased, and the Sharpe ratio is slightly better than that of strategy (2), with little difference. The biggest difference between the two is that strategy (3) adopts monthly controlled drawdown, and every month is a new beginning. After the stock market crash is closed, it will still participate in investment next month and will still be a participant in the capital market. And strategy (2) no longer open positions after closing positions. Which control method to use should depend on confidence in the strategy and judgment on the general market trend.
summary
LR is a classic method for dealing with classification problems with machine learning techniques, and it is an interesting attempt to apply it to stock selection strategies. From the backtest results, the LR stock selection strategy is effective. It is worth noting that although the LR model is a relatively simple and mature model in machine learning methods, there are still many different parameters that can be set and adjusted. This article does not do much exploration, and directly uses the default settings . In addition, the selection of factors is very important to the prediction effect of the model . In the process of selecting factors, in addition to paying attention to the direct correlation between factors and income, the direct correlation between factors should also be considered, and the selection of factors with low correlation itself should be considered. The combination of factors is beneficial to improve the stability of the model. Finally, the LR model only provides a quantitative relationship between factor combinations and outperforming the market, and the logic behind such a relationship is difficult to explain reasonably, and therefore may not guarantee the reliability and stability of future stock selection results. sex.
Recommended courses
Three months to teach you to start artificial intelligence from scratch! ! | Deep Learning Essentials Practical Course:
Click "Read the original text" to open a new pose