9.1.3 Code Implementation of Random Forest Model
Like the decision tree model, the random forest model can be used for both classification analysis and regression analysis.
The corresponding models are random forest classification model (RandomForestClassifier) and random forest regression model (RandomForestRegressor). The base model of the random forest classification model is the classification decision tree model (see section 5.1.2 for details), and the base model of the random forest regression model is the regression decision tree model (see section 5.1.3 for details).
# 随机森林分类模型简单代码演示如下所示:
from sklearn.ensemble import RandomForestClassifier
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [0, 0, 0, 1, 1]
model = RandomForestClassifier(n_estimators=10, random_state=123)
model.fit(X, y)
print(model.predict([[5, 5]]))
[0]
# 随机森林回归模型简单代码演示如下所示:
from sklearn.ensemble import RandomForestRegressor
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [1, 2, 3, 4, 5]
model = RandomForestRegressor(n_estimators=10, random_state=123)
model.fit(X, y)
print(model.predict([[5, 5]]))
[2.8]
9.2 Quantitative Finance - Stock Data Acquisition
9.2.1 Acquisition of stock basic data
Here is a free Python interface package for financial data: Tushare library, through which we can freely call historical market data for analysis. Its official address is: http://tushare.org/
If you want to view stock market data, you can visit the corresponding website: http://tushare.org/trading.html
1. Basic introduction of Tushare library
It is recommended to install the Tushare library through the PIP installation method. Take the Windows system as an example. The specific method is: use the Win + R key combination to call out the run box, enter cmd and press Enter, then enter pip install tushare in the pop-up box and press Enter Press the Enter key to install. If you install it in the Jupyter Notebook editor mentioned in Section 1.2.3, you only need to enter !pip instll tushare in the code box (note that it is in English format!) and then run the line of code box.
(1) Obtain daily market data
import tushare as ts
df = ts.get_hist_data('000002', start='2018-01-01', end='2019-01-31')
df.head()
本接口即将停止更新,请尽快使用Pro版接口:https://tushare.pro/document/2
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date |
Note that if you do not write the start and end dates, directly writing ts.get_hist_data('000002') will retrieve the data from the current day to the previous 3 years by default. In addition, the above code can also be abbreviated as:
df = ts.get_hist_data('000002','2018-01-01', '2019-01-31')
df.head()
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||
2019-01-31 | 27.39 | 28.15 | 27.75 | 27.00 | 411857.59 | 0.54 | 1.99 | 26.800 | 26.153 | 25.641 | 426579.02 | 351523.31 | 320269.20 |
2019-01-30 | 26.70 | 27.82 | 27.21 | 26.63 | 592303.19 | 0.33 | 1.23 | 26.332 | 25.875 | 25.457 | 391193.72 | 334927.14 | 310794.00 |
2019-01-29 | 25.91 | 26.88 | 26.88 | 25.87 | 368071.62 | 0.82 | 3.15 | 25.952 | 25.696 | 25.292 | 302102.48 | 302443.43 | 293529.36 |
2019-01-28 | 26.20 | 26.62 | 26.06 | 25.86 | 308906.56 | -0.04 | -0.15 | 25.656 | 25.524 | 25.139 | 304355.52 | 302512.15 | 291266.32 |
2019-01-25 | 25.51 | 26.35 | 26.10 | 25.49 | 451756.16 | 0.69 | 2.71 | 25.574 | 25.420 | 25.008 | 293674.18 | 289949.63 | 293446.08 |
Supplementary knowledge points: get_k_data() function
Because the get_hist_data() function not only obtains the basic price information of the stock, but also obtains derivative variables such as price changes and moving average prices, so it can only retrieve the data of the previous 3 years at most. If you want to retrieve the data of more than 3 years Daily level data, you have to use the ts.get_k_data() function, which only gets the basic data of the stock price, the code is as follows:
df = ts.get_k_data('000002', start='2000-01-01', end='2019-01-31')
df.head()
date | open | close | high | low | volume | code | |
---|---|---|---|---|---|---|---|
0 | 2000-01-04 | 0.584 | 0.614 | 0.620 | 0.572 | 45747.08 | 000002 |
1 | 2000-01-05 | 0.617 | 0.599 | 0.623 | 0.596 | 46136.73 | 000002 |
2 | 2000-01-06 | 0.596 | 0.627 | 0.632 | 0.587 | 71920.31 | 000002 |
3 | 2000-01-07 | 0.631 | 0.655 | 0.656 | 0.624 | 136349.36 | 000002 |
4 | 2000-01-10 | 0.673 | 0.721 | 0.721 | 0.665 | 142424.86 | 000002 |
The data obtained through the get_k_data() function does not set the date as the row index by default like the get_hist_data() function. The date here is still used as an ordinary column (date column). If you want to convert the date column here into a row index, you can Use the set_index() function to set the index, the code is as follows:
df = df.set_index('date') # 或者写成:df.set_index('date', inplace=True)
df.head()
open | close | high | low | volume | code | |
---|---|---|---|---|---|---|
date | ||||||
2000-01-04 | 0.584 | 0.614 | 0.620 | 0.572 | 45747.08 | 000002 |
2000-01-05 | 0.617 | 0.599 | 0.623 | 0.596 | 46136.73 | 000002 |
2000-01-06 | 0.596 | 0.627 | 0.632 | 0.587 | 71920.31 | 000002 |
2000-01-07 | 0.631 | 0.655 | 0.656 | 0.624 | 136349.36 | 000002 |
2000-01-10 | 0.673 | 0.721 | 0.721 | 0.665 | 142424.86 | 000002 |
(2) Obtain minute-level data
Minute-level data can be obtained by setting the ktype parameter, the code is as follows:
df = ts.get_hist_data('000002', ktype='5')
df.head()
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||
2020-01-03 15:00:00 | 32.06 | 32.07 | 32.06 | 32.05 | 3920.32 | 0.00 | 0.00 | 32.122 | 32.113 | 32.0350 | 15322.7 | 17669.5 | 13041.0 | 0.00 |
2020-01-03 14:55:00 | 32.11 | 32.11 | 32.07 | 32.03 | 8377.52 | -0.04 | -0.12 | 32.136 | 32.103 | 32.0290 | 19359.3 | 17817.5 | 13428.9 | 0.01 |
2020-01-03 14:50:00 | 32.20 | 32.21 | 32.12 | 32.11 | 13402.00 | -0.08 | -0.25 | 32.154 | 32.093 | 32.0175 | 23136.3 | 17962.0 | 13959.7 | 0.01 |
2020-01-03 14:45:00 | 32.16 | 32.21 | 32.20 | 32.12 | 24470.90 | 0.04 | 0.12 | 32.160 | 32.078 | 32.0050 | 24442.3 | 17137.9 | 13903.3 | 0.03 |
2020-01-03 14:40:00 | 32.13 | 32.18 | 32.16 | 32.13 | 26443.00 | 0.03 | 0.09 | 32.132 | 32.056 | 31.9880 | 23976.3 | 15128.1 | 13491.1 | 0.03 |
(3) 获得实时行情数据
通过如下代码可以实时取得股票当前报价和成交信息:
df = ts.get_realtime_quotes('000002')
df
name | open | pre_close | price | high | low | bid | ask | volume | amount | ... | a2_p | a3_v | a3_p | a4_v | a4_p | a5_v | a5_p | date | time | code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 万 科A | 32.710 | 32.560 | 32.050 | 32.810 | 31.780 | 32.040 | 32.050 | 80553629 | 2584309903.290 | ... | 32.060 | 3005 | 32.070 | 119 | 32.080 | 344 | 32.090 | 2020-01-03 | 15:00:03 | 000002 |
1 rows × 33 columns
其运行结果就是当时的股价信息,如果收盘后运行的话获得的就是当日收盘价相关信息。如果觉得列数过多,可以通过DataFrame选取列的方法选取相应的列,代码如下:
df = df[['code','name','price','bid','ask','volume','amount','time']]
df
code | name | price | bid | ask | volume | amount | time | |
---|---|---|---|---|---|---|---|---|
0 | 000002 | 万 科A | 32.050 | 32.040 | 32.050 | 80553629 | 2584309903.290 | 15:00:03 |
如果想同时获得多个股票代码的实时数据,可以用如下代码:
df = ts.get_realtime_quotes(['000002','000980','000981'])
df
name | open | pre_close | price | high | low | bid | ask | volume | amount | ... | a2_p | a3_v | a3_p | a4_v | a4_p | a5_v | a5_p | date | time | code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 万 科A | 32.710 | 32.560 | 32.050 | 32.810 | 31.780 | 32.040 | 32.050 | 80553629 | 2584309903.290 | ... | 32.060 | 3005 | 32.070 | 119 | 32.080 | 344 | 32.090 | 2020-01-03 | 15:00:03 | 000002 |
1 | 众泰汽车 | 3.010 | 3.000 | 3.020 | 3.040 | 2.970 | 3.010 | 3.020 | 32495074 | 97566972.190 | ... | 3.030 | 4849 | 3.040 | 3840 | 3.050 | 2811 | 3.060 | 2020-01-03 | 15:00:03 | 000980 |
2 | ST银亿 | 1.870 | 1.890 | 1.810 | 1.920 | 1.800 | 1.810 | 1.820 | 40518670 | 74744476.400 | ... | 1.830 | 2939 | 1.840 | 4163 | 1.850 | 1449 | 1.860 | 2020-01-03 | 15:00:03 | 000981 |
3 rows × 33 columns
(4) 获得分笔数据
通过如下代码可以获得历史分笔数据,分笔数据也即每笔成交的信息:
df = ts.get_tick_data('000002', date='2018-12-12', src='tt')
df.head()
D:\Anaconda\Anaconda\lib\site-packages\tushare\stock\trading.py:182: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
skiprows=[0])
time | price | change | volume | amount | type | |
---|---|---|---|---|---|---|
0 | 09:25:04 | 26.31 | 0.34 | 6077 | 15988903 | 卖盘 |
1 | 09:30:00 | 26.33 | 0.02 | 197 | 518651 | 买盘 |
2 | 09:30:04 | 26.33 | 0.00 | 4623 | 12173863 | 卖盘 |
3 | 09:30:06 | 26.34 | 0.01 | 391 | 1030134 | 买盘 |
4 | 09:30:09 | 26.35 | 0.01 | 3289 | 8664911 | 买盘 |
(5) 获得指数信息
通过如下代码可以获得上证指数等指数信息:
df = ts.get_index()
df.head() # 目前的tushare获得的指数的列名有点错乱-2020-01-04备注
code | name | change | open | preclose | close | high | low | volume | amount | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 00上证指数 | 3089.0220 | 0.33 | 3085.1976 | 3083.7858 | 3093.8192 | 3074.5178 | 0.0 | 2.899917e+11 | 0.0 |
2 | 00A股指数 | 3236.7077 | 0.33 | 3232.6892 | 3231.1885 | 3241.7436 | 3221.4906 | 0.0 | 2.899041e+11 | 0.0 |
3 | 00B股指数 | 261.0510 | 0.00 | 261.1236 | 261.7619 | 261.7619 | 260.2429 | 0.0 | 8.764934e+07 | 0.0 |
8 | 00综合指数 | 3006.0295 | 0.39 | 2999.1744 | 3006.5318 | 3018.1699 | 2998.4266 | 0.0 | 6.499701e+10 | 0.0 |
9 | 0上证380 | 4885.0267 | 0.23 | 4881.7235 | 4879.5471 | 4890.8838 | 4858.4325 | 0.0 | 5.888844e+10 | 0.0 |
9.2.2 股票衍生变量生成
1.生成股票基本数据
这里首先通过上一节的get_k_data()函数获取从2015-01-01到2019-12-31的股票基本数据:
df = ts.get_k_data('000002',start='2015-01-01',end='2019-12-31')
df.head()
date | open | close | high | low | volume | code | |
---|---|---|---|---|---|---|---|
0 | 2015-01-05 | 12.436 | 12.885 | 13.214 | 12.289 | 6560835.0 | 000002 |
1 | 2015-01-06 | 12.617 | 12.410 | 12.954 | 12.142 | 3346346.0 | 000002 |
2 | 2015-01-07 | 12.324 | 12.298 | 12.531 | 12.099 | 2642051.0 | 000002 |
3 | 2015-01-08 | 12.375 | 11.745 | 12.419 | 11.632 | 2639394.0 | 000002 |
4 | 2015-01-09 | 11.701 | 11.624 | 12.289 | 11.485 | 3294584.0 | 000002 |
# 通过set_index()函数可以将日期列设置为行索引:
df = df.set_index('date')
df.head()
open | close | high | low | volume | code | |
---|---|---|---|---|---|---|
date | ||||||
2015-01-05 | 12.436 | 12.885 | 13.214 | 12.289 | 6560835.0 | 000002 |
2015-01-06 | 12.617 | 12.410 | 12.954 | 12.142 | 3346346.0 | 000002 |
2015-01-07 | 12.324 | 12.298 | 12.531 | 12.099 | 2642051.0 | 000002 |
2015-01-08 | 12.375 | 11.745 | 12.419 | 11.632 | 2639394.0 | 000002 |
2015-01-09 | 11.701 | 11.624 | 12.289 | 11.485 | 3294584.0 | 000002 |
2.简单衍生变量的计算
通过如下代码我们可以先构造一些简单的衍生变量:
df['close-open'] = (df['close'] - df['open'])/df['open']
df['high-low'] = (df['high'] - df['low'])/df['low']
df['pre_close'] = df['close'].shift(1) # 该列所有往下移一行形成昨日收盘价
df['price_change'] = df['close']-df['pre_close']
df['p_change'] = (df['close']-df['pre_close'])/df['pre_close']*100
df.head()
open | close | high | low | volume | code | close-open | high-low | pre_close | price_change | p_change | |
---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||
2015-01-05 | 12.436 | 12.885 | 13.214 | 12.289 | 6560835.0 | 000002 | 0.036105 | 0.075271 | NaN | NaN | NaN |
2015-01-06 | 12.617 | 12.410 | 12.954 | 12.142 | 3346346.0 | 000002 | -0.016406 | 0.066875 | 12.885 | -0.475 | -3.686457 |
2015-01-07 | 12.324 | 12.298 | 12.531 | 12.099 | 2642051.0 | 000002 | -0.002110 | 0.035705 | 12.410 | -0.112 | -0.902498 |
2015-01-08 | 12.375 | 11.745 | 12.419 | 11.632 | 2639394.0 | 000002 | -0.050909 | 0.067658 | 12.298 | -0.553 | -4.496666 |
2015-01-09 | 11.701 | 11.624 | 12.289 | 11.485 | 3294584.0 | 000002 | -0.006581 | 0.070004 | 11.745 | -0.121 | -1.030226 |
3.移动平均线指标MA值
通过如下代码可以获得股价的5日移动平均值和10日移动平均值:
df['MA5'] = df['close'].rolling(5).mean()
df['MA10'] = df['close'].rolling(10).mean()
df.head(15) # head(15)表示展示前15行,因为要展示10行以上,才能看到MA10有值
open | close | high | low | volume | code | close-open | high-low | pre_close | price_change | p_change | MA5 | MA10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||
2015-01-05 | 12.436 | 12.885 | 13.214 | 12.289 | 6560835.0 | 000002 | 0.036105 | 0.075271 | NaN | NaN | NaN | NaN | NaN |
2015-01-06 | 12.617 | 12.410 | 12.954 | 12.142 | 3346346.0 | 000002 | -0.016406 | 0.066875 | 12.885 | -0.475 | -3.686457 | NaN | NaN |
2015-01-07 | 12.324 | 12.298 | 12.531 | 12.099 | 2642051.0 | 000002 | -0.002110 | 0.035705 | 12.410 | -0.112 | -0.902498 | NaN | NaN |
2015-01-08 | 12.375 | 11.745 | 12.419 | 11.632 | 2639394.0 | 000002 | -0.050909 | 0.067658 | 12.298 | -0.553 | -4.496666 | NaN | NaN |
2015-01-09 | 11.701 | 11.624 | 12.289 | 11.485 | 3294584.0 | 000002 | -0.006581 | 0.070004 | 11.745 | -0.121 | -1.030226 | 12.1924 | NaN |
2015-01-12 | 11.511 | 11.338 | 11.511 | 11.019 | 2436341.0 | 000002 | -0.015029 | 0.044650 | 11.624 | -0.286 | -2.460427 | 11.8830 | NaN |
2015-01-13 | 11.278 | 11.295 | 11.563 | 11.209 | 1664610.0 | 000002 | 0.001507 | 0.031582 | 11.338 | -0.043 | -0.379256 | 11.6600 | NaN |
2015-01-14 | 11.295 | 11.321 | 11.494 | 11.122 | 1646818.0 | 000002 | 0.002302 | 0.033447 | 11.295 | 0.026 | 0.230190 | 11.4646 | NaN |
2015-01-15 | 11.347 | 11.900 | 11.952 | 11.235 | 2429686.0 | 000002 | 0.048735 | 0.063818 | 11.321 | 0.579 | 5.114389 | 11.4956 | NaN |
2015-01-16 | 11.900 | 11.684 | 11.900 | 11.572 | 2129475.0 | 000002 | -0.018151 | 0.028344 | 11.900 | -0.216 | -1.815126 | 11.5076 | 11.8500 |
2015-01-19 | 10.803 | 10.517 | 11.148 | 10.517 | 3603625.0 | 000002 | -0.026474 | 0.059998 | 11.684 | -1.167 | -9.988018 | 11.3434 | 11.6132 |
2015-01-20 | 10.543 | 10.673 | 10.889 | 10.422 | 2914688.0 | 000002 | 0.012330 | 0.044809 | 10.517 | 0.156 | 1.483313 | 11.2190 | 11.4395 |
2015-01-21 | 10.656 | 11.278 | 11.407 | 10.457 | 3555294.0 | 000002 | 0.058371 | 0.090848 | 10.673 | 0.605 | 5.668509 | 11.2104 | 11.3375 |
2015-01-22 | 11.252 | 11.736 | 11.796 | 11.166 | 3224727.0 | 000002 | 0.043015 | 0.056421 | 11.278 | 0.458 | 4.061004 | 11.1776 | 11.3366 |
2015-01-23 | 11.727 | 12.030 | 12.177 | 11.494 | 3310408.0 | 000002 | 0.025838 | 0.059422 | 11.736 | 0.294 | 2.505112 | 11.2468 | 11.3772 |
# 删除空值
df.dropna(inplace=True) # 删除空值行,也可以写成df = df.dropna()
df.head()
open | close | high | low | volume | code | close-open | high-low | pre_close | price_change | p_change | MA5 | MA10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||
2015-01-16 | 11.900 | 11.684 | 11.900 | 11.572 | 2129475.0 | 000002 | -0.018151 | 0.028344 | 11.900 | -0.216 | -1.815126 | 11.5076 | 11.8500 |
2015-01-19 | 10.803 | 10.517 | 11.148 | 10.517 | 3603625.0 | 000002 | -0.026474 | 0.059998 | 11.684 | -1.167 | -9.988018 | 11.3434 | 11.6132 |
2015-01-20 | 10.543 | 10.673 | 10.889 | 10.422 | 2914688.0 | 000002 | 0.012330 | 0.044809 | 10.517 | 0.156 | 1.483313 | 11.2190 | 11.4395 |
2015-01-21 | 10.656 | 11.278 | 11.407 | 10.457 | 3555294.0 | 000002 | 0.058371 | 0.090848 | 10.673 | 0.605 | 5.668509 | 11.2104 | 11.3375 |
2015-01-22 | 11.252 | 11.736 | 11.796 | 11.166 | 3224727.0 | 000002 | 0.043015 | 0.056421 | 11.278 | 0.458 | 4.061004 | 11.1776 | 11.3366 |
4.股票衍生变量生成库:TA-Lib库的安装
下面要讲的衍生变量指标都是通过股票衍生变量生成库:TA-Lib库生成的,所以这里我们先讲解一下如何安装Ta-Lib库:
以Windows操作系统为例,如果你的系统是Windows的64位系统,直接使用pip install talib语句会报错,原因在于python pip源中TA-Lib是32位的,不能安装在64位系统平台上。
正确的方法是下载64位的安装包后本地安装,下载推荐使用加州大学的python扩展库,地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/
进入网址后Ctrl + F键搜索“ta_lib”,如下图所示,
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FpvDHWnj-1681175692727)( https://uploader.shimo.im/f/rd7iXLJw6RMZPkbV.png!thumbnail)]
选择对应的文件TA_Lib-0.4.17-cp37-cp37m-win_amd64.whl(cp后的37表示的是Python3.7版本)下载到自己选择的文件夹,读者在下载时也要根据自己Python的版本进行下载。
如何查看自己Python的版本,可以通过Win + R键调出运行框,然后输入cmd,在弹出界面中输入python,然后按一下Enter回车键即可查看相关版本,如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6KXw9yAF-1681175692728)( https://uploader.shimo.im/f/90luFuZqHt46OZko.png)]
下载完成后,在自己选择的文件夹中(例如笔者保存在的文件夹“E:\机器学习与大数据分析\随机森林”),如下图所示,在搜索框中输入cmd后按一下Enter回车键搜索:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Wp2cI7Zm-1681175692728)( https://uploader.shimo.im/f/EnabNoMQKT0tYdaz.png!thumbnail)]
在弹出框中输入如下内容,然后Enter回车键安装即可。
pip install TA_Lib-0.4.17-cp37-cp37m-win_amd64.whl
5.通过TA-Lib库生成相对强弱指标RSI值
import talib
df['RSI'] = talib.RSI(df['close'], timeperiod=12)
6.通过TA-Lib库生成动量指标MOM值
df['MOM'] = talib.MOM(df['close'], timeperiod=5)
7.通过TA-Lib库生成指数移动平均值EMA
df['EMA12'] = talib.EMA(df['close'], timeperiod=12) # 12日指数移动平均线
df['EMA26'] = talib.EMA(df['close'], timeperiod=26) # 26日指数移动平均线
8.通过TA-Lib库生成异同移动平均线MACD值
df['MACD'], df['MACDsignal'], df['MACDhist'] = talib.MACD(df['close'], fastperiod=12, slowperiod=26, signalperiod=9)
df.dropna(inplace=True) # 删除空行
df.tail() # 和head()相对,通过tail()函数可以查看后五行
open | close | high | low | volume | code | close-open | high-low | pre_close | price_change | p_change | MA5 | MA10 | RSI | MOM | EMA12 | EMA26 | MACD | MACDsignal | MACDhist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||
2019-12-25 | 30.40 | 30.29 | 30.63 | 30.18 | 685037.0 | 000002 | -0.003618 | 0.014911 | 30.38 | -0.09 | -0.296248 | 30.878 | 30.075 | 63.075563 | -0.02 | 29.908556 | 28.973211 | 0.935345 | 0.772958 | 0.162387 |
2019-12-26 | 30.50 | 31.12 | 31.30 | 30.50 | 888790.0 | 000002 | 0.020328 | 0.026230 | 30.29 | 0.83 | 2.740178 | 30.896 | 30.387 | 68.890164 | 0.09 | 30.094932 | 29.132233 | 0.962699 | 0.810906 | 0.151793 |
2019-12-27 | 31.23 | 31.00 | 31.32 | 30.81 | 703096.0 | 000002 | -0.007365 | 0.016553 | 31.12 | -0.12 | -0.385604 | 30.760 | 30.672 | 67.220611 | -0.68 | 30.234173 | 29.270586 | 0.963587 | 0.841442 | 0.122145 |
2019-12-30 | 31.35 | 31.57 | 31.79 | 31.02 | 915751.0 | 000002 | 0.007018 | 0.024823 | 31.00 | 0.57 | 1.838710 | 30.872 | 30.884 | 70.877814 | 0.56 | 30.439685 | 29.440913 | 0.998772 | 0.872908 | 0.125864 |
2019-12-31 | 31.35 | 32.18 | 32.45 | 31.32 | 663497.0 | 000002 | 0.026475 | 0.036079 | 31.57 | 0.61 | 1.932214 | 31.232 | 31.057 | 74.233951 | 1.80 | 30.707426 | 29.643808 | 1.063618 | 0.911050 | 0.152567 |
补充内容:Talib库的一些验证
RSI指标的验证
import pandas as pd
import talib
data = pd.DataFrame()
data['close'] = [10, 12, 11, 13, 12, 14, 13]
data['RSI'] = talib.RSI(data['close'], timeperiod=6)
data
close | RSI | |
---|---|---|
0 | 10 | NaN |
1 | 12 | NaN |
2 | 11 | NaN |
3 | 13 | NaN |
4 | 12 | NaN |
5 | 14 | NaN |
6 | 13 | 66.666667 |
9.3 量化金融 - 股票涨跌预测模型搭建
9.3.1 多因子模型搭建
1.引入之后需要用到的库
import tushare as ts # 股票基本数据相关库
import numpy as np # 科学计算相关库
import pandas as pd # 科学计算相关库
import talib # 股票衍生变量数据相关库
import matplotlib.pyplot as plt # 引入绘图相关库
from sklearn.ensemble import RandomForestClassifier # 引入分类决策树模型
from sklearn.metrics import accuracy_score # 引入准确度评分函数
import warnings
warnings.filterwarnings("ignore") # 忽略警告信息,警告非报错,不影响代码执行
2.股票数据处理与衍生变量生成
我们这里将8.2节股票基本数据和股票衍生变量数据的相关代码汇总,方便之后的股票涨跌预测模型的搭建:
# 1.股票基本数据获取
df = ts.get_k_data('000002',start='2015-01-01',end='2019-12-31')
df = df.set_index('date') # 设置日期为索引
# 2.简单衍生变量构造
df['close-open'] = (df['close'] - df['open'])/df['open']
df['high-low'] = (df['high'] - df['low'])/df['low']
df['pre_close'] = df['close'].shift(1) # 该列所有往下移一行形成昨日收盘价
df['price_change'] = df['close']-df['pre_close']
df['p_change'] = (df['close']-df['pre_close'])/df['pre_close']*100
# 3.移动平均线相关数据构造
df['MA5'] = df['close'].rolling(5).mean()
df['MA10'] = df['close'].rolling(10).mean()
df.dropna(inplace=True) # 删除空值
# 4.通过Ta_lib库构造衍生变量
df['RSI'] = talib.RSI(df['close'], timeperiod=12) # 相对强弱指标
df['MOM'] = talib.MOM(df['close'], timeperiod=5) # 动量指标
df['EMA12'] = talib.EMA(df['close'], timeperiod=12) # 12日指数移动平均线
df['EMA26'] = talib.EMA(df['close'], timeperiod=26) # 26日指数移动平均线
df['MACD'], df['MACDsignal'], df['MACDhist'] = talib.MACD(df['close'], fastperiod=12, slowperiod=26, signalperiod=9) # MACD值
df.dropna(inplace=True) # 删除空值
本接口即将停止更新,请尽快使用Pro版接口:https://tushare.pro/document/2
# 查看此时的df后五行
df.tail()
open | close | high | low | volume | code | close-open | high-low | pre_close | price_change | p_change | MA5 | MA10 | RSI | MOM | EMA12 | EMA26 | MACD | MACDsignal | MACDhist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||
2019-12-25 | 27.165 | 27.055 | 27.395 | 26.945 | 685037.0 | 000002 | -0.004049 | 0.016701 | 27.145 | -0.09 | -0.331553 | 27.643 | 26.840 | 63.081344 | -0.02 | 26.673555 | 25.737103 | 0.936452 | 0.774585 | 0.161867 |
2019-12-26 | 27.265 | 27.885 | 28.065 | 27.265 | 888790.0 | 000002 | 0.022740 | 0.029342 | 27.055 | 0.83 | 3.067825 | 27.661 | 27.152 | 68.895291 | 0.09 | 26.859932 | 25.896207 | 0.963725 | 0.812413 | 0.151311 |
2019-12-27 | 27.995 | 27.765 | 28.085 | 27.575 | 703096.0 | 000002 | -0.008216 | 0.018495 | 27.885 | -0.12 | -0.430339 | 27.525 | 27.437 | 67.225542 | -0.68 | 26.999173 | 26.034636 | 0.964537 | 0.842838 | 0.121699 |
2019-12-30 | 28.115 | 28.335 | 28.555 | 27.785 | 915751.0 | 000002 | 0.007825 | 0.027713 | 27.765 | 0.57 | 2.052944 | 27.637 | 27.649 | 70.882335 | 0.56 | 27.204685 | 26.205033 | 0.999651 | 0.874201 | 0.125451 |
2019-12-31 | 28.115 | 28.945 | 29.215 | 28.085 | 663497.0 | 000002 | 0.029522 | 0.040235 | 28.335 | 0.61 | 2.152815 | 27.997 | 27.822 | 74.238064 | 1.80 | 27.472426 | 26.407994 | 1.064432 | 0.912247 | 0.152185 |
3.特征变量和目标变量提取
X = df[['close', 'volume', 'close-open', 'MA5', 'MA10', 'high-low', 'RSI', 'MOM', 'EMA12', 'MACD', 'MACDsignal', 'MACDhist']]
y = np.where(df['price_change'].shift(-1)> 0, 1, -1)
首先强调最核心的一点:应该是今天的股价信息预测下一天的股价涨跌情况,所以y应该是下一天的股价变化情况。
其中Numpy库中的where()函数的使用方法如下所示:
np.where(判断条件,满足条件的赋值,不满足条件的赋值)
其中df[‘price_change’].shift(-1)则是利用shift()函数将price_change(股价变化)这一列往上移动一行,这样就获得了每一行对应的下一天股价涨跌情况。
因此这里的判断条件就是下一天股价是否大于0,如果下一天股价涨了的我们则y赋值为数字1,下一天股价跌了的,则y赋值为数字-1。这个下一天的股价涨跌情况就是我们根据当天股票基本数据以及衍生变量预测的内容。
3.训练集和测试集数据划分
接下来,我们要将原始数据集进行分割,我们要注意到一点,训练集与测试集的划分要按照时间序列划分,而不是像之前利用train_test_split()函数进行划分。原因在于股票价格的变化趋势具有时间性,如果我们随机划分,则会破坏时间性特征,因为我们是根据当天数据来预测下一天的股价涨跌情况,而不是任意一天的股票数据来预测下一天的股价涨跌情况。
因此,我们将前90%的数据作为训练集,后10%的数据作为测试集,代码如下:
X_length = X.shape[0] # shape属性获取X的行数和列数,shape[0]即表示行数
split = int(X_length * 0.9)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
4.模型搭建
model = RandomForestClassifier(max_depth=3, n_estimators=10, min_samples_leaf=10, random_state=1)
model.fit(X_train, y_train)
RandomForestClassifier(max_depth=3, min_samples_leaf=10, n_estimators=10,random_state=1)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">RandomForestClassifier</label><div class="sk-toggleable__content"><pre>RandomForestClassifier(max_depth=3, min_samples_leaf=10, n_estimators=10, random_state=1)</pre></div></div></div></div></div>
9.3.2 模型使用与评估
1.预测下一天的涨跌情况
y_pred = model.predict(X_test)
print(y_pred)
[-1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1
1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 -1 -1 -1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1]
a = pd.DataFrame() # 创建一个空DataFrame
a['预测值'] = list(y_pred)
a['实际值'] = list(y_test)
a.head()
预测值 | 实际值 | |
---|---|---|
0 | -1 | -1 |
1 | 1 | -1 |
2 | -1 | -1 |
3 | 1 | -1 |
4 | 1 | 1 |
# 查看预测概率
y_pred_proba = model.predict_proba(X_test)
y_pred_proba[0:5]
array([[0.53462409, 0.46537591],
[0.49852513, 0.50147487],
[0.53687766, 0.46312234],
[0.49733765, 0.50266235],
[0.49733765, 0.50266235]])
2.模型准确度评估
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print(score)
0.5428571428571428
# 此外,我们还可以通过模型自带的score()函数记性打分,代码如下:
model.score(X_test, y_test)
0.5428571428571428
3.分析数据特征的重要性
model.feature_importances_
array([0.15132672, 0.09957677, 0.05021545, 0.06514831, 0.079073 ,
0.11447561, 0.04576496, 0.17559964, 0.04713332, 0.07061667,
0.08866083, 0.01240873])
# 通过如下代码可以更好的展示特征及其特征重要性:
features = X.columns
importances = model.feature_importances_
a = pd.DataFrame()
a['特征'] = features
a['特征重要性'] = importances
a = a.sort_values('特征重要性', ascending=False)
a
特征 | 特征重要性 | |
---|---|---|
7 | MOM | 0.175600 |
0 | close | 0.151327 |
5 | high-low | 0.114476 |
1 | volume | 0.099577 |
10 | MACDsignal | 0.088661 |
4 | MA10 | 0.079073 |
9 | MACD | 0.070617 |
3 | MA5 | 0.065148 |
2 | close-open | 0.050215 |
8 | EMA12 | 0.047133 |
6 | RSI | 0.045765 |
11 | MACDhist | 0.012409 |
9.3.3 参数调优
from sklearn.model_selection import GridSearchCV # 网格搜索合适的超参数
# 指定分类器中参数的范围
parameters = {
'n_estimators':[5, 10, 20], 'max_depth':[2, 3, 4, 5], 'min_samples_leaf':[5, 10, 20, 30]}
new_model = RandomForestClassifier(random_state=1) # 构建分类器
grid_search = GridSearchCV(new_model, parameters, cv=6, scoring='accuracy') # cv=6表示交叉验证6次,scoring='roc_auc'表示以ROC曲线的AUC评分作为模型评价准则, 默认为'accuracy', 即按准确度评分
grid_search.fit(X_train, y_train) # 传入数据
grid_search.best_params_ # 输出参数的最优值
{'max_depth': 2, 'min_samples_leaf': 20, 'n_estimators': 10}
9.3.4 收益回测曲线绘制
X_test['prediction'] = model.predict(X_test)
X_test['p_change'] = (X_test['close'] - X_test['close'].shift(1)) / X_test['close'].shift(1)
X_test['origin'] = (X_test['p_change'] + 1).cumprod()
X_test['strategy'] = (X_test['prediction'].shift(1) * X_test['p_change'] + 1).cumprod()
X_test[['strategy', 'origin']].tail()
strategy | origin | |
---|---|---|
date | ||
2019-12-25 | 1.248484 | 1.059319 |
2019-12-26 | 1.210183 | 1.091817 |
2019-12-27 | 1.215391 | 1.087118 |
2019-12-30 | 1.190439 | 1.109436 |
2019-12-31 | 1.164811 | 1.133320 |
# 通过如下代码将收益情况删除空值后可视化,并设置X轴刻度自动倾斜:
X_test[['strategy', 'origin']].dropna().plot()
plt.gcf().autofmt_xdate()
plt.show()