Random Forest Model and Case (Python)

Table of contents

1 Introduction to Integrated Models

1.1 Introduction to Bagging Algorithm

1.2 Introduction to Boosting algorithm

2 Basic principles of random forest model

3 Using sklearn to implement a random forest model

4 Case: stock rise and fall forecast model

4.1 Generation of stock derivative variables

4.1.1 Obtain stock basic data

4.1.2 Generating Simple Derived Variables

4.1.3 Generate moving average indicator MA value

4.1.4 Generate relative strength index RSI value with TA-Lib library

4.1.5 Use TA-Lib library to generate momentum indicator MOM value

4.1.6 Generate exponential moving average EMA with TA-Lib library

4.1.7 Use the TA-Lib library to generate the MACD value of the moving average of similarity and divergence

4.2 Model building

4.2.1 Import the library that needs to be built

4.2.2 Get data

4.2.3 Extract feature variables and target variables

4.2.4 Divide training set and test set

 4.2.5 Model building

4.3 Model evaluation and use

4.3.1 Predict the rise and fall of the stock price for the next day

4.3.2 Model Accuracy Evaluation

4.3.3 Analyzing feature importance of feature variables

4.4 Parameter tuning

 4.5 Earning Backtesting Curve Drawing

reference books


1 Introduction to Integrated Models

The ensemble learning model uses a series of weak learners (also known as the basic model or base model) for learning, and integrates the results of each weak learner to obtain better learning results than a single learner.

There are two common algorithms for ensemble learning models: Bagging algorithm and Boosting algorithm.

The typical machine learning model of the Bagging algorithm is the random forest model, while the typical machine learning models of the Boosting algorithm are the AdaBoost, GBDT, XGBoost and LightGBM models.

1.1 Introduction to Bagging Algorithm

The principle of the Bagging algorithm is similar to voting. Each weak learner has one vote. Finally, according to the votes of all weak learners, the final prediction result is generated according to the principle of "minority obeys the majority", as shown in the figure below.

Suppose there are 10,000 pieces of original data, and 10,000 times of data are randomly selected with replacement to form a new training set (because it is random sampling with replacement, it is possible that a certain piece of data may be drawn multiple times, or a certain piece of data may be selected once were not selected), each time using a training set to train a weak learner. In this way, after random sampling n times with replacement, n weak learners trained by different training sets can be obtained at the end of the training. According to the prediction results of the n weak learners, according to the principle of "minority obeys majority" , to obtain a more accurate and reasonable final prediction result.

Specifically, in the classification problem , n weak learners are used to vote to obtain the final result, and in the regression problem , the average of n weak learners is taken as the final result.

1.2 Introduction to Boosting algorithm

The essence of the Boosting algorithm is to promote a weak learner to a strong learner. The difference between it and the Bagging algorithm is that the Bagging algorithm treats all weak learners equally; while the Boosting algorithm treats weak learners differently. Generally speaking It is to focus on "cultivating elites" and "emphasizing mistakes".

"Cultivating elites" is to give greater weight to weak learners with more accurate prediction results after each round of training, and to reduce the weight of weak learners with poor performance . In this way, in the final prediction, the weight of the "excellent model" is large, which means that it can cast multiple votes, while the "general model" can only cast one vote or cannot vote.

"Attention to errors" is to change the weight or probability distribution of the training set after each round of training, by increasing the weight of the samples that were predicted incorrectly by the weak learner in the previous round, and reduce the prediction that was correctly predicted by the weak learner in the previous round. The weight of the sample is used to increase the emphasis of the weak learner on the wrongly predicted data, thereby improving the overall prediction effect of the model.

2 Basic principles of random forest model

Random Forest (Random Forest) is a classic Bagging model, and its weak learner is a decision tree model. As shown in the figure below, the random forest model will randomly sample from the original data set to form n different sample data sets, and then build n different decision tree models based on these data sets, and finally according to the average value of these decision tree models (for Regression model) or voting (for classification model) to get the final result.

In order to ensure the generalization ability (or universal ability) of the model, the random forest model often follows the two basic principles of "data randomness" and "feature randomness" when building each tree.

Data Random : Randomly extract data from all data with replacement as the training data for one of the decision tree models. For example, there are 1000 original data, which are extracted 1000 times with replacement to form a new set of data for training a certain decision tree model.

Feature Random : If the feature dimension of each sample is M, specify a constant k<M, and randomly select k features from M features.

Compared with a single decision tree model, because the random forest model integrates multiple decision trees, its prediction results will be more accurate, and it is not easy to cause overfitting, and its generalization ability is stronger.

3 Using sklearn to implement a random forest model

The random forest model can perform both classification analysis and regression analysis. The corresponding models are:

        · Random Forest Classifier (RandomForestClassifier)

        · Random Forest Regression Model (RandomForestRegressor)

The weak learner of the random forest classification model is a classification decision tree model, and the weak learner of the random forest regression model is a regression decision tree model.

code show as below.

from sklearn.ensemble import RandomForestClassifier
X = [[1,2],[3,4],[5,6],[7,8],[9,10]]
y = [0,0,0,1,1]

# 设置弱学习器数量为10
model = RandomForestClassifier(n_estimators=10,random_state=123)
model.fit(X,y)

model.predict([[5,5]])

# 输出为:array([0])

4 Case: stock rise and fall forecast model

4.1 Generation of stock derivative variables

This section explains how to use the basic data of stocks to obtain some derivative variable data, such as the moving average indicators commonly used in stock technical analysis, 5-day moving average price MA5 and 10-day moving average price MA10, relative strength index RSI, momentum indicator MOM, exponential moving average EMA , Moving average of similarities and differences, MACD, etc.

4.1.1 Obtain stock basic data

First, use the get_k_data() function to obtain the basic stock data from 2015-01-01 to 2019-12-31, the code is as follows.

The first 5 rows of data are shown in the figure below, and the missing data is holiday (non-trading day) data.

Use the set_index() function to set the date column as the row index , the code is as follows.

4.1.2 Generating Simple Derived Variables

Some simple derived variable data can be generated by the following code.

close-open means (closing price - opening price)/opening price;

high-low means (highest price - lowest price)/lowest price;

pre_close means yesterday’s closing price, use shift (1) to move all the data in the close column down by 1 row and form a new column, if it is shift (-1), it means move up by 1 row;

price_change indicates today's closing price - yesterday's closing price, that is, the stock price change of the day;

p_change represents the percentage change of the stock price on the day, also known as the rise and fall of the stock price on the day.

4.1.3 Generate moving average indicator MA value

The 5-day moving average and 10-day moving average of the stock price can be generated through the following code.

Note: the use of rolling function

Among them, MA means the moving average, "average" refers to the arithmetic mean of the closing price of the last n days, and "moving" means that the price data of the last n days is always used in the calculation.

Example: Calculation of MA5

 According to the above data, the MA5 value of No. 5 is (1.2+1.4+1.6+1.8+2.0)/5=1.6, while the MA5 value of No. 6 is (1.4+1.6+1.8+2.0+2.2)/5= 1.8, and so on. The moving average of stock prices in a period of time is connected into a curve, which is the moving average. Similarly, MA10 is the average stock price of the previous 10 days from the day of calculation.

When calculating data like MA5, because the data volume of the first 4 days is not enough, the moving average corresponding to these 4 days cannot be calculated, so the null value NaN will be generated. Usually, the dropna() function is used to delete null values ​​to avoid problems caused by null values ​​in subsequent calculations. The code is as follows.

You can see that the lines before the 16th are deleted.

4.1.4 Generate relative strength index RSI value with TA-Lib library

The relative strength indicator RSI value can be generated by the following code.

The RSI value can reflect the strength of the stock price rise relative to the fall in the short term , helping us better judge the rise and fall trend of the stock price.

The larger the RSI value, the stronger the rising trend is relative to the falling trend, and vice versa, the weaker the rising trend is relative to the falling trend.

The formula for calculating the RSI value is as follows.

Example:

 According to the data in the above table, taking N=6, the average rising price of the 6th day is (2+2+2)/6=1, and the average falling price of the 6th day is (1+1+1)/6=0.5, so the RSI value is (1/(1+0 .5)) x 100 = 66.7.

Under normal circumstances, the RSI value is between 20 and 80. If it exceeds 80, it is overbought, if it is lower than 20, it is oversold, and if it is equal to 50, it means that the buyer and seller are equal. For example, if the stock price has risen for 6 consecutive days, the average price of the 6th decline is 0, and the RSI value of the 6th is 100, indicating that the buyer of the stock is in a very strong position at this time, but it also reminds investors to be vigilant that it may also be a super market at this time. In the buying state, it is necessary to prevent the risk of the stock price falling.

4.1.5 Use TA-Lib library to generate momentum indicator MOM value

The momentum indicator MOM value can be generated by the following code.

MOM is the abbreviation of momentum (momentum), which reflects the speed of stock price rise and fall within a period of time , and the calculation formula is as follows.

Example:

 Suppose you want to calculate the MOM value of No. 6, and the parameter timeperiod is set to 5 in the previous code, then you need to subtract the closing price of No. 1 from the closing price of No. 6, that is, the MOM value of No. 6 is 2.2-1.2 = 1, Similarly, the MOM value of No. 7 is 2.4-1.4=1. Connecting the MOM values ​​for several consecutive days constitutes a curve reflecting the rise and fall of stock prices.

4.1.6 Generate exponential moving average EMA with TA-Lib library

The exponential moving average EMA can be generated by the following code.

EMA is an exponentially decreasing weighted moving average, and is analyzed based on the calculation results to determine the future trend of stock prices.

The calculation formula of EMA is as follows.

Among them, EMAtoday is the EMA value of the day; Pricetoday is the closing price of the day; EMAyesterday is the EMA value of yesterday; 2/7, the corresponding EMA is called EMA6, which is the 6-day exponential moving average. The formula keeps recursing until the first EMA value appears (the first EMA value is usually the average of the first 5 numbers).

Example: EMA6

Take the first EMA value as the average value of the first 5 numbers, so there is no EMA value in the first 5 days; the EMA value of the 6th is the first EMA value, which is the average value of the previous 5 days, that is, 1; the EMA value of the 7th For the second EMA value, the calculation process is as follows.

4.1.7 Use the TA-Lib library to generate the MACD value of the moving average of similarity and divergence

The MACD value of the moving average of similarity and difference can be generated by the following code.

MACD is a commonly used indicator in the stock market. It is a derivative variable based on the EMA value. The calculation method is relatively complicated, and interested readers can understand it by themselves. Here you only need to know that MACD is a trend indicator, its change represents the change of market trend , and the MACD of different K-line levels represents the buying and selling trend in the current level cycle.

4.2 Model building

4.2.1 Import the library that needs to be built

# 导入相关库
import tushare as ts
import numpy as np
import pandas as pd
import talib
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

4.2.2 Get data

# 1.股票基本数据获取
import tushare as ts
df = ts.get_k_data('000002',start='2015-01-01',end='2019-12-31')
df = df.set_index('date')

# 2.简单衍生变量数据构造
df['close-open'] = (df['close'] - df['open']) / df['open']
df['high-low'] = (df['high'] - df['low']) / df['low']
df['pre_close'] = df['close'].shift(1)
df['price_change'] = df['close'] - df['pre_close']
df['p_change'] = (df['close'] - df['pre_close']) / df['pre_close'] * 100

# 3.移动平均线相关数据构造
df['MA5'] = df['close'].rolling(5).mean()
df['MA10'] = df['close'].rolling(10).mean()
df.dropna(inplace=True)

# 4.通过TA-Lib库构造衍生变量数据
df['RSI'] = talib.RSI(df['close'],timeperiod=12)
df['MOM'] = talib.MOM(df['close'],timeperiod=5)
df['EMA12'] = talib.EMA(df['close'],timeperiod=12) #12日指移动平均值数
df['EMA26'] = talib.EMA(df['close'],timeperiod=26) #26日指移动平均值数
df['MACD'],df['MACDsignal'],df['MACDhist'] = talib.MACD(df['close'],fastperiod=6,slowperiod=12,signalperiod=9)
df.dropna(inplace=True)

4.2.3 Extract feature variables and target variables

X = df[['close','volume','close-open','MA5','MA10','high-low','RSI','MOM','EMA12','MACD','MACDsignal','MACDhist']]
y = np.where(df['price_change'].shift(-1) > 0,1,-1)

 First of all, emphasize the most core point: it should use the stock price data of the day to predict the stock price rise and fall of the next day, so the target variable y should be the stock price rise and fall of the next day. Why use the stock price data of the day to predict the stock price rise and fall of the next day? This is because a lot of data in the characteristic variables can only be determined after the end of the day's trading (for example, the closing price close is only available when the market closes), so the rise and fall of the stock price during the day's trading is unpredictable, and when the market closes, although The required data is available, but the rise and fall of the stock price of the day is a foregone conclusion, so there is no need to predict, so the stock price data of the day is used to predict the rise and fall of the stock price of the next day.

The where() function in the NumPy library is used in the second line of code . The meanings of the three parameters passed in are the judgment condition, the assignment that satisfies the condition, and the assignment that does not meet the condition. Among them, df['price_change'].shift(-1) is to use the shift() function to move all the data in the price_change (stock price change) column up by 1 row, so as to obtain the stock price change of the next day corresponding to each row. Therefore, the judgment condition here is whether the stock price change of the next day is greater than 0. If it is greater than 0, it means that the stock price has risen the next day, and y is assigned a value of 1; The assigned value is -1. There are only two classifications of 1 or -1 in the prediction result.

4.2.4 Divide training set and test set

It should be noted here that the division should be performed according to the time series, and the train_test_split() function cannot be used for random division. This is because the change trend of the stock price has time characteristics, and random division will destroy this feature, so it is necessary to predict the stock price rise and fall of the next day based on the stock price data of the day, and it is impossible to predict the stock price of the next day based on the stock price data of any day ups and downs.

The first 90% of the data is used as the training set, and the last 10% of the data is used as the test set. The code is as follows.

X_length = X.shape[0]
split = int(X_length * 0.9)
X_train,X_test = X[:split],X[split:]
y_train,y_test = y[:split],y[split:]

 4.2.5 Model building

model = RandomForestClassifier(max_depth=3,n_estimators=10,min_samples_leaf=10,random_state=123)
model.fit(X_train,y_train)

 Set the model parameters: the maximum depth max_depth of the decision tree is set to 3, that is, each decision tree has a maximum of 3 layers; the number of weak learners (that is, the decision tree model) n_estimators is set to 10, that is, there are 10 decisions in the random forest tree; the minimum number of samples min_samples_leaf of the leaf node is set to 10, that is, if the number of samples of the leaf node is less than 10, the split will stop; the function of the random state parameter random_state is to keep the results of each operation consistent, and the number 123 set here has no special meaning. Can be replaced with other numbers.

4.3 Model evaluation and use

4.3.1 Predict the rise and fall of the stock price for the next day

Use the predict_proba() function to predict the probability of belonging to each category, the code is as follows.

4.3.2 Model Accuracy Evaluation

The overall prediction accuracy can be viewed through the following code.

The printout score is 0.40, indicating that the model correctly predicts about 40% of the data in the entire test set. The accuracy of this forecast is not high, and it is indeed in line with the ever-changing characteristics of the stock market.

4.3.3 Analyzing feature importance of feature variables

The feature importance of each feature variable can be analyzed by the following code.

It can be seen from the figure that characteristic variables such as closing price close, MA5, and MACDhist related indicators have a great influence on the prediction accuracy of the next day's stock price rise and fall results.

4.4 Parameter tuning

from sklearn.model_selection import GridSearchCV
parameters={'n_estimators':[5,10,20],'max_depth':[2,3,4,5,6],'min_samples_leaf':[5,10,20,30]}
new_model = RandomForestClassifier(random_state=123)
grid_search = GridSearchCV(new_model,parameters,cv=6,scoring='accuracy')
grid_search.fit(X_train,y_train)
grid_search.best_params_

# 输出
# {'max_depth': 5, 'min_samples_leaf': 20, 'n_estimators': 5}

 4.5 Earning Backtesting Curve Drawing

The prediction accuracy of the model has been evaluated before, but in commercial practice, it is more concerned about its return test curve (also known as the net worth curve), that is, to see whether the results obtained according to the built model are better than those obtained without the model better.

# 在测试数据上添加一列,预测收益
X_test['prediction'] = model.predict(X_test)

# 计算每天的股价变化率
X_test['p_change'] = (X_test['close'] - X_test['close'].shift(1)) / X_test['close'].shift(1)

# 计算累积收益率
# 例如,初始股价是1,2天内的价格变化率为10%
# 那么用cumprod()函数可以求得2天后的股价为1×(1+10%)×(1+10%)=1.21
# 此结果也表明2天的收益率为21%。
X_test['origin'] = (X_test['p_change'] + 1).cumprod()

# 计算利用模型预测后的收益率
X_test['strategy'] = (X_test['prediction'].shift(1) * X_test['p_change'] + 1).cumprod()

X_test[['strategy','origin']].dropna().plot()
# 设置自动倾斜
plt.gcf().autofmt_xdate()
plt.show()

The visualization result is shown in the figure below. The upper curve in the figure is the yield curve obtained according to the model, and the lower curve is the yield curve of the stock itself. It can be seen that the income obtained by using the model is still good.

It should be noted that the content of quantitative finance explained here is relatively simple, and the model built is too idealistic. The real stock market is intricate and complicated, and stock trading also has many restrictions, such as no short selling, no T+0 trading, and handling fees and other factors.

The random forest model is a very important integrated model. It integrates many advantages of the decision tree model, and avoids the shortcomings of the decision tree model such as easy overfitting. It is widely used in actual combat.

reference books

"Python Big Data Analysis and Machine Learning Business Case Practice"

Guess you like

Origin blog.csdn.net/qq_42433311/article/details/124319618