A Preliminary Exploration of Machine Learning: Empirical Case of Logistic Stock Selection Model

This article is an original article of Quantum Financial Services, and reprinting requires authorization

Past reviews:

The dice in the hands of God - Bayesian omnipotent (Part 1)

The dice in the hands of God - Bayesian omnipotent (Part 2)

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1

Machine Learning (ML) is one of the hottest topics nowadays with a wide range of applications. Among them, the solution of classification problems is a common field of machine learning. It is not difficult to imagine that the essence of the stock selection strategy is a classification problem, that is, the stocks are divided into two categories: holding and not holding. Then, the application of machine learning technology for stock selection should be a direction that can be tried. In this article, we will convert the stock selection problem into a binary classification problem based on the ideas in the research report "Logistic stock selection model and its empirical evidence in the CSI 300" published by Guosen Securities on September 7, 2010. Thereby, the Logistic algorithm is used for prediction and classification, and the stock selection is completed.

640?wx_fmt=png&wxfrom=5&wx_lazy=1

Concept introduction

Logistic regression (Logistic Regression, hereinafter referred to as LR) is one of the commonly used machine learning algorithms, which can be used for regression or classification, and is mainly used to deal with the problem of binary classification. Unlike traditional binary classification methods (such as SVM), LR does not completely separate samples from two different categories, either, but returns a probability value, that is, what percentage of the predicted samples belong to. The positive class, and what percentage of the probability belongs to the negative class, which is close to the category of "fuzzy" classification . In practical applications, LR is often used to estimate the likelihood of something. For example, it can be used for ad prediction, that is, based on the likelihood that an ad will be clicked by the user, place the most likely to be clicked on the ad where the user can see it. There are also similar possibilities for a user to purchase a certain product, a possibility for a patient to suffer from a certain disease, and so on.

LR is essentially a linear classification model. It is different from linear regression in that it tries to compress a large range of numbers output by linear regression, such as from negative infinity to positive infinity, to between 0 and 1 to express " possibility" . To achieve this function, a mapping is required, that is, a logistic function (or sigmoid function) is added to the output: 

640?wx_fmt=png

The Sigmoid function has a nice "S" shape, as shown in the following figure (quoted from Wikipedia):

640?wx_fmt=png

That is to say, LR is actually a linear regression process normalized by the logistic equation .

640?wx_fmt=png

When using the LR model, we first need to input a set of training data with different features, and determine the category (target) of each training data in advance. After inputting such a set of data, the model will "train" itself through the algorithm to fit the relationship between the feature and the target in the data. After the training is completed, we can input a prediction sample, and the model will determine which target the feature of the sample should correspond to through the fitted relationship, and return the probability of belonging to different targets. This completes a complete LR classification.

640?wx_fmt=png

strategic thinking

The stock selection problem is transformed into a classification problem . Those whose expected performance is better than the market are classified as positive, and those whose performance is lower than the market are classified as negative. The LR model is used to calculate the probability that each stock belongs to the positive class. The 30 stocks with the highest probability of belonging to the positive category are selected for purchase, the funds are distributed evenly, and the positions are adjusted monthly.

Policy details

1) Classification (target)

We use the CSI 300 Index as a benchmark to study the probability of individual stocks' returns exceeding that of the index. Taking each month as the observation period, when the return rate of a stock exceeds the CSI 300 index in the whole period, we will record its target as 1 (positive category), otherwise it is 0 (negative category);

2) Features

Feature is a quantity used to express the characteristics of data in different dimensions. In the stock market, factors are undoubtedly the most suitable variables as data characteristics. Here we refer to the research report "Single-Factor Validity Test" published by Orient Securities on June 26, 2015, and select ten factors that are most closely related to the rate of return in the research report to describe each data. (each stock) features in different dimensions. The list of factors is as follows:

factor name

short name

The total market capitalization

TMV

Circulating market value

FMV

Changes on the 22nd

STOCKZF22

Average deviation

STOCKPJCJ5/60

22nd turnover rate

HSL22

Year-on-year growth rate of operating income (single quarter)

SALESGROWRATE1

Net Assets / Total Market Cap

BP

Year-on-year growth rate of net profit (single quarter)

PROFITGROWRATE1

Operating Income / Total Market Cap

SP

Year-on- year growth rate of operating income TTM

SALESGROWRATE

 

3) Extreme value and normalization

Using the LR algorithm has certain requirements for data quality, and the first thing we need to do is data processing.

a) extreme value. The extreme value refers to the value that deviates from the majority of the sample population. The existence of a few extreme values ​​may greatly distort the results of data analysis, so it is necessary to remove extreme values. Here we use the median de-extremum method, and the formula is as follows:

640?wx_fmt=png

where Di represents the ith observation and Dm represents the median of each feature variable. Dmad is the median of the deviation of each observation from 640?wx_fmt=pngthe median.

b) Standardization

When working with different factor data, it must be normalized (i.e. mapped to the [0,1] interval), otherwise all combinations will be meaningless because each factor has a different unit. The normalized way here is:

640?wx_fmt=png 

That is, normal standardization with 0-bit mean and 1-bit standard deviation.

640?wx_fmt=png

strategy implementation

1) Logistic stock selection

Trading subject: stocks

Adjustment cycle: every month

Backtest time: 2012.01.01~2017.01.01

Backtest time: 5 years

yield curve 

640?wx_fmt=png

640?wx_fmt=png

Earnings attribution

640?wx_fmt=png

640?wx_fmt=png

performance analysis

640?wx_fmt=png

The backtest shows that during the five-year backtest period, our strategy performed well and achieved an annualized rate of return of 31.8%, outperforming the CSI 300 and CSI 500 indices. However, there are also problems of excessive backtesting and high volatility. Next we will add the pullback control to test.

2) Logistic stock selection, full-time retracement control

Trading subject: stocks

Adjustment cycle: every month

Backtest time: 2012.01.01~2017.01.01

Backtest time: 5 years

Retracement Control: 20%

yield curve

640?wx_fmt=png

640?wx_fmt=png

Earnings attribution

640?wx_fmt=png

640?wx_fmt=png

performance analysis

640?wx_fmt=pngAfter adding the full-time drawdown control, compared with strategy (1), the yield is accompanied by a drawdown, the volatility has dropped sharply, the Sharpe ratio has increased, and the strategy performance is more stable . However, it is worth noting that in the bear market in the second half of 2015, the holdings of the investment portfolio dropped to 0, that is, the stock market was completely withdrawn. Although this avoided the possible losses caused by the stock market crash to a certain extent, it also gave up potential gains thereafter. To avoid this, we changed the drawdown control to monthly, i.e. controlling the monthly drawdown not to exceed a certain threshold.

 

3) Logistic stock selection, monthly drawdown control

Trading subject: stocks

Adjustment cycle: every month

Drawback Control: Up to 20% per month

Backtest time: 2012.01.01~2017.01.01

Backtest time: 5 years

yield curve

640?wx_fmt=png

640?wx_fmt=png


Earnings attribution

 

640?wx_fmt=png

640?wx_fmt=png


performance analysis

640?wx_fmt=png

Compared with strategy (2), the rate of return, drawdown and volatility of strategy (3) have all increased, and the Sharpe ratio is slightly better than that of strategy (2), with little difference. The biggest difference between the two is that strategy (3) adopts monthly controlled drawdown, and every month is a new beginning. After the stock market crash is closed, it will still participate in investment next month and will still be a participant in the capital market. And strategy (2) no longer open positions after closing positions. Which control method to use should depend on confidence in the strategy and judgment on the general market trend.

640?wx_fmt=png

summary

LR is a classic method for dealing with classification problems with machine learning techniques, and it is an interesting attempt to apply it to stock selection strategies. From the backtest results, the LR stock selection strategy is effective. It is worth noting that although the LR model is a relatively simple and mature model in machine learning methods, there are still many different parameters that can be set and adjusted. This article does not do much exploration, and directly uses the default settings . In addition, the selection of factors is very important to the prediction effect of the model . In the process of selecting factors, in addition to paying attention to the direct correlation between factors and income, the direct correlation between factors should also be considered, and the selection of factors with low correlation itself should be considered. The combination of factors is beneficial to improve the stability of the model. Finally, the LR model only provides a quantitative relationship between factor combinations and outperforming the market, and the logic behind such a relationship is difficult to explain reasonably, and therefore may not guarantee the reliability and stability of future stock selection results. sex.

Recommended courses

Three months to teach you to start artificial intelligence from scratch! ! | Deep Learning Essentials Practical Course:

640?wx_fmt=png

640?wx_fmt=gif

Click "Read the original text" to open a new pose

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325728115&siteId=291194637