【Random forest model】

The basic principle and code implementation of random forest model

Introduction to Ensemble Models

Ensemble learning models are a very important part of machine learning.
Ensemble learning is a machine learning method that uses a series of weak learners (or basic models) for learning, and integrates the results of each weak learner to obtain a better learning effect than a single learner.

There are two common algorithms for ensemble learning models:

  • The typical machine learning model of Bagging algorithm is this random forest model
  • The typical machine learning models of the Boosting algorithm are AdaBoost, GBDT, XGBoost and LightGBM models.

Fundamentals of Random Forest Models

As shown below, the random forest model will randomly sample in the original data set to form n different sample data sets, and then build n different decision tree models based on these data sets, and finally according to the average value of these decision tree models (for regression models) or voting (for classification models) to get the final result.
insert image description here
In order to ensure the generalization ability of the model, random forests often follow two basic principles when building each tree:

  • Data Random: Randomly extract data from all data with replacement as the data of one of the decision trees for training. For example, there are 1000 original data, which are extracted 1000 times with replacement to form a new set of data (because it is extraction with replacement, some data may be selected multiple times, and some data may not be selected), as The data of a certain decision tree is used for model training.
  • Random features: If the feature dimension of each sample is M, specify a constant k<M, and randomly select k features from M features. When using Python to construct a random forest model, the default number of features k is M square root of: M \sqrt MM

Like the decision tree model, the random forest can be used for classification analysis and regression analysis.

Code

Random Forest Classification Model:
insert image description here
Random Forest Regression Model:
insert image description here

Quantitative Finance - Stock Data Acquisition

Basic introduction to tushare library

insert image description here
Get the entire history of a certain day by date
insert image description here
Single stock for a certain day:
insert image description here

#多个股票
df = pro.daily(ts_code='000001.SZ,600000.SH', start_date='20180701', end_date='20180718')
    ts_code trade_date  open  high   low  close  pre_close  change  pct_chg  \
0   600000.SH   20180718  9.51  9.64  9.48   9.51       9.44    0.07     0.74   
1   000001.SZ   20180718  8.75  8.85  8.69   8.70       8.72   -0.02    -0.23   
2   000001.SZ   20180717  8.74  8.75  8.66   8.72       8.73   -0.01    -0.11   
3   600000.SH   20180717  9.41  9.48  9.38   9.44       9.41    0.03     0.32   
4   000001.SZ   20180716  8.85  8.90  8.69   8.73       8.88   -0.15    -1.69   
5   600000.SH   20180716  9.50  9.54  9.34   9.41       9.49   -0.08    -0.84   
6   600000.SH   20180713  9.57  9.58  9.46   9.49       9.47    0.02     0.21   
7   000001.SZ   20180713  8.92  8.94  8.82   8.88       8.88    0.00     0.00   
8   000001.SZ   20180712  8.60  8.97  8.58   8.88       8.64    0.24     2.78   
9   600000.SH   20180712  9.41  9.61  9.39   9.57       9.38    0.19     2.03   
10  000001.SZ   20180711  8.76  8.83  8.68   8.78       8.98   -0.20    -2.23   
11  600000.SH   20180711  9.37  9.44  9.32   9.38       9.57   -0.19    -1.99   
12  000001.SZ   20180710  9.02  9.02  8.89   8.98       9.03   -0.05    -0.55   
13  600000.SH   20180710  9.61  9.65  9.50   9.57       9.60   -0.03    -0.31   
14  000001.SZ   20180709  8.69  9.03  8.68   9.03       8.66    0.37     4.27   
15  600000.SH   20180709  9.37  9.63  9.37   9.60       9.37    0.23     2.45   
16  600000.SH   20180706  9.31  9.43  9.17   9.37       9.26    0.11     1.19   
17  000001.SZ   20180706  8.61  8.78  8.45   8.66       8.60    0.06     0.70   
18  600000.SH   20180705  9.26  9.35  9.22   9.26       9.31   -0.05    -0.54   
19  000001.SZ   20180705  8.62  8.73  8.55   8.60       8.61   -0.01    -0.12   
20  600000.SH   20180704  9.34  9.42  9.28   9.31       9.35   -0.04    -0.43   
21  000001.SZ   20180704  8.63  8.75  8.61   8.61       8.67   -0.06    -0.69   
22  000001.SZ   20180703  8.69  8.70  8.45   8.67       8.61    0.06     0.70   
23  600000.SH   20180703  9.29  9.38  9.20   9.35       9.29    0.06     0.65   
24  600000.SH   20180702  9.55  9.55  9.23   9.29       9.56   -0.27    -2.82   
25  000001.SZ   20180702  9.05  9.05  8.55   8.61       9.09   -0.48    -5.28   

           vol       amount  
0    189227.00   180858.003  
1    525152.77   460697.377  
2    375356.33   326396.994  
3    137134.95   129512.091  
4    689845.58   603427.713  
5    144141.19   135697.106  
6    150263.39   142708.347  
7    603378.21   535401.175  
8   1140492.31  1008658.828  
9    197048.37   188206.858  
10   851296.70   744765.824  
11   152039.33   142450.919  
12   896862.02   803038.965  
13   124028.37   118668.133  
14  1409954.60  1255007.609  
15   221725.65   212109.327  
16   225944.43   210564.106  
17   988282.69   852071.526  
18   164954.38   152978.661  
19   835768.77   722169.579  
20   144647.77   135000.876  
21   711153.37   617278.559  
22  1274838.57  1096657.033  
23   241235.51   224816.757  
24   226690.89   212743.905  
25  1315520.13  1158545.868  

Stock Derivative Variable Generation

pro = ts.pro_api()
df = pro.query('daily', ts_code='000002.SZ', start_date='20180701', end_date='20180718')

insert image description here
Calculation of simple derived variables:
insert image description here
through the following code, some simple derived variables can be constructed first:

df['close-open']=(df['close']-df['open'])/df['open']
df['high-low']=(df['high']-df['low'])/df['low']

df['pre_close']=df['close'].shift(1)#该列所以往下移一行形成昨日收盘价
df['price-change']=(df['close']-df['pre_close'])
df['p_change']=(df['close']-df['pre_close'])/df['pre_close']*100

Stock Derivative Variable Generation

The MA value of the moving average indicator
The 5-day moving average and the 10-day moving average of the stock price can be obtained through the following code:

df['MA5']=df['close'].rolling(5).mean()
df['MA10']=df['close'].rolling(10).mean()

insert image description here
Since when we are calculating data like MA5, the average value corresponding to the first four days of the data cannot be calculated (because the amount of data in the first four days is not enough to calculate the average value of the 5th day), so a null value will be generated, usually through The dropna() function removes null values ​​to avoid problems caused by null values ​​in subsequent calculations.
code show as below:

df.dropna(inplace=True) #删除空行,也可以写作df=df.dropna()

insert image description here

Stock rise and fall forecast model construction

Guess you like

Origin blog.csdn.net/Algernon98/article/details/128659251