The basic principle and code implementation of random forest model
Introduction to Ensemble Models
Ensemble learning models are a very important part of machine learning.
Ensemble learning is a machine learning method that uses a series of weak learners (or basic models) for learning, and integrates the results of each weak learner to obtain a better learning effect than a single learner.
There are two common algorithms for ensemble learning models:
- The typical machine learning model of Bagging algorithm is this random forest model
- The typical machine learning models of the Boosting algorithm are AdaBoost, GBDT, XGBoost and LightGBM models.
Fundamentals of Random Forest Models
As shown below, the random forest model will randomly sample in the original data set to form n different sample data sets, and then build n different decision tree models based on these data sets, and finally according to the average value of these decision tree models (for regression models) or voting (for classification models) to get the final result.
In order to ensure the generalization ability of the model, random forests often follow two basic principles when building each tree:
- Data Random: Randomly extract data from all data with replacement as the data of one of the decision trees for training. For example, there are 1000 original data, which are extracted 1000 times with replacement to form a new set of data (because it is extraction with replacement, some data may be selected multiple times, and some data may not be selected), as The data of a certain decision tree is used for model training.
- Random features: If the feature dimension of each sample is M, specify a constant k<M, and randomly select k features from M features. When using Python to construct a random forest model, the default number of features k is M square root of: M \sqrt MM
Like the decision tree model, the random forest can be used for classification analysis and regression analysis.
Code
Random Forest Classification Model:
Random Forest Regression Model:
Quantitative Finance - Stock Data Acquisition
Basic introduction to tushare library
Get the entire history of a certain day by date
Single stock for a certain day:
#多个股票
df = pro.daily(ts_code='000001.SZ,600000.SH', start_date='20180701', end_date='20180718')
ts_code trade_date open high low close pre_close change pct_chg \
0 600000.SH 20180718 9.51 9.64 9.48 9.51 9.44 0.07 0.74
1 000001.SZ 20180718 8.75 8.85 8.69 8.70 8.72 -0.02 -0.23
2 000001.SZ 20180717 8.74 8.75 8.66 8.72 8.73 -0.01 -0.11
3 600000.SH 20180717 9.41 9.48 9.38 9.44 9.41 0.03 0.32
4 000001.SZ 20180716 8.85 8.90 8.69 8.73 8.88 -0.15 -1.69
5 600000.SH 20180716 9.50 9.54 9.34 9.41 9.49 -0.08 -0.84
6 600000.SH 20180713 9.57 9.58 9.46 9.49 9.47 0.02 0.21
7 000001.SZ 20180713 8.92 8.94 8.82 8.88 8.88 0.00 0.00
8 000001.SZ 20180712 8.60 8.97 8.58 8.88 8.64 0.24 2.78
9 600000.SH 20180712 9.41 9.61 9.39 9.57 9.38 0.19 2.03
10 000001.SZ 20180711 8.76 8.83 8.68 8.78 8.98 -0.20 -2.23
11 600000.SH 20180711 9.37 9.44 9.32 9.38 9.57 -0.19 -1.99
12 000001.SZ 20180710 9.02 9.02 8.89 8.98 9.03 -0.05 -0.55
13 600000.SH 20180710 9.61 9.65 9.50 9.57 9.60 -0.03 -0.31
14 000001.SZ 20180709 8.69 9.03 8.68 9.03 8.66 0.37 4.27
15 600000.SH 20180709 9.37 9.63 9.37 9.60 9.37 0.23 2.45
16 600000.SH 20180706 9.31 9.43 9.17 9.37 9.26 0.11 1.19
17 000001.SZ 20180706 8.61 8.78 8.45 8.66 8.60 0.06 0.70
18 600000.SH 20180705 9.26 9.35 9.22 9.26 9.31 -0.05 -0.54
19 000001.SZ 20180705 8.62 8.73 8.55 8.60 8.61 -0.01 -0.12
20 600000.SH 20180704 9.34 9.42 9.28 9.31 9.35 -0.04 -0.43
21 000001.SZ 20180704 8.63 8.75 8.61 8.61 8.67 -0.06 -0.69
22 000001.SZ 20180703 8.69 8.70 8.45 8.67 8.61 0.06 0.70
23 600000.SH 20180703 9.29 9.38 9.20 9.35 9.29 0.06 0.65
24 600000.SH 20180702 9.55 9.55 9.23 9.29 9.56 -0.27 -2.82
25 000001.SZ 20180702 9.05 9.05 8.55 8.61 9.09 -0.48 -5.28
vol amount
0 189227.00 180858.003
1 525152.77 460697.377
2 375356.33 326396.994
3 137134.95 129512.091
4 689845.58 603427.713
5 144141.19 135697.106
6 150263.39 142708.347
7 603378.21 535401.175
8 1140492.31 1008658.828
9 197048.37 188206.858
10 851296.70 744765.824
11 152039.33 142450.919
12 896862.02 803038.965
13 124028.37 118668.133
14 1409954.60 1255007.609
15 221725.65 212109.327
16 225944.43 210564.106
17 988282.69 852071.526
18 164954.38 152978.661
19 835768.77 722169.579
20 144647.77 135000.876
21 711153.37 617278.559
22 1274838.57 1096657.033
23 241235.51 224816.757
24 226690.89 212743.905
25 1315520.13 1158545.868
Stock Derivative Variable Generation
pro = ts.pro_api()
df = pro.query('daily', ts_code='000002.SZ', start_date='20180701', end_date='20180718')
Calculation of simple derived variables:
through the following code, some simple derived variables can be constructed first:
df['close-open']=(df['close']-df['open'])/df['open']
df['high-low']=(df['high']-df['low'])/df['low']
df['pre_close']=df['close'].shift(1)#该列所以往下移一行形成昨日收盘价
df['price-change']=(df['close']-df['pre_close'])
df['p_change']=(df['close']-df['pre_close'])/df['pre_close']*100
Stock Derivative Variable Generation
The MA value of the moving average indicator
The 5-day moving average and the 10-day moving average of the stock price can be obtained through the following code:
df['MA5']=df['close'].rolling(5).mean()
df['MA10']=df['close'].rolling(10).mean()
Since when we are calculating data like MA5, the average value corresponding to the first four days of the data cannot be calculated (because the amount of data in the first four days is not enough to calculate the average value of the 5th day), so a null value will be generated, usually through The dropna() function removes null values to avoid problems caused by null values in subsequent calculations.
code show as below:
df.dropna(inplace=True) #删除空行,也可以写作df=df.dropna()