The 117th original article focuses on "personal growth and wealth freedom, the logic of world operation, and AI quantitative investment".
The epidemic situation in Beijing exceeded 4,000 yesterday, and the number of people in the society was 800+. It seems that the three-day expectation is too optimistic, and I don't know how to develop it. Just like the short-term trend of the capital market, no one can predict it. But looking forward three years, I believe that everything that seems to be a big thing now will no longer be a problem.
Continue to optimize our AI quantification platform.
01 dataloader plus cache
dataloader does data feature engineering and data automatic labeling. If there are many factors, the amount of calculation is large, and it is calculated once every time it is started, which affects the efficiency. We can use the ability of hdf5 to store dataframes to store the calculated factors and labeling results, and load them directly from the cache next time.
The data loader reads data from the database, csv file or hdf5 storage format, usually the time series data corresponding to each symbol, and loads it into the memory in the format of pandas dataframe. Then use the expression manager to calculate the features and save them in the corresponding dataframe.
# encoding:utf8 import pandas as pd from loguru import logger from engine.datafeed.expr.expr_mgr import ExprMgr from engine.datafeed.datafeed_hdf5 import Hdf5DataFeed from engine.config import DATA_DIR_HDF5_CACHE class Dataloader: def __init__(self, symbols, names, fields, load_from_cache=False): self.expr = ExprMgr() self.feed = Hdf5DataFeed() self.symbols = symbols self.names = names self.fields = fields with pd.HDFStore(DATA_DIR_HDF5_CACHE.resolve()) as store: key = 'features' if load_from_cache and '/' + key in store.keys(): # 注意判断keys需要前面加“/” logger.info('Load from cache...') self.data = store[key] else: self.data = self.load_one_df() store[key] = self.data def load_one_df(self): dfs = self .load_dfs() all = pd.concat(dfs) all.sort_index(ascending=True, inplace=True) all.dropna(inplace=True) self.data = all return all def load_dfs(self): dfs = [] for code in self.symbols: # Add fields directly in memory for easy reuse df = self.feed.get_df(code) for name, field in zip(self.names, self.fields): exp = self.expr.get_expression(field) # Multiple sequences may be returned here se = exp.load(code) if type(se) is pd.Series: df[name] = se if type(se) is tuple: for i in range(len(se)): df[name + '_' + se[i].name] = se[i] df['code'] = code dfs.append(df) return dfs
dataloader accepts 4 parameters: symbols, names, fields and load_from_cache.
symbols: List of securities to be loaded.
names: Feature names.
fields: list of factor expressions.
load_from_cache: Whether to load from the cache.
Load_dfs, traverse by symbol. After the original dataframe of each symbol is read into the memory, calculate the factor values for the feature columns that need to be calculated for names and fields, and save them in the dataframe.
Load_one_df merges multiple dataframes returned by load_dfs into one dataframe and returns it, and saves it in the cache for later use.
02 Random forest upgrade to boosting GBDT
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,HistGradientBoostingRegressor e.add_model(SklearnModel(AdaBoostRegressor()), split_date='2020-01-01', feature_names=feature_names)
Several major algorithms of GBDT have similar sklearn interfaces.
03 Keras deep learning framework
Deep learning is one of the most cutting-edge artificial intelligence technologies. However, compared with traditional machine learning, such as the sklearn framework, deep learning has a much higher learning threshold. Deep learning requires the user to build the network structure, set the learning rate, select the optimization objective function, etc. The functions are flexible and powerful, and the learning curve is also high.
The two most popular deep learning frameworks are factbook's pytorch and google's tensorflow. Purely comparing the two frameworks, pytorch has a much lower learning curve than tensorflow, but still requires beginners to have mathematical knowledge of matrix operations, calculus, etc.
The emergence of Keras has greatly lowered the threshold for using tensorflow.
Keras is a high-level neural network API written in Python that can run with TensorFlow, CNTK, or Theano as a backend. Keras has been developed with a focus on enabling rapid experimentation. Being able to convert your ideas into experimental results with minimal cost is the key to doing good research.
Official usage scenarios:
· Allows for easy and fast prototyping (due to user-friendliness, high modularity, scalability).
· Support both convolutional neural networks and recurrent neural networks, as well as combinations of the two.
· Runs seamlessly on both CPU and GPU.
The biggest advantage of Keras is simple and fast prototyping, which is very important for beginners. Our goal is to quantify investment and apply deep learning to quantification instead of studying the details of deep learning itself. Therefore, meeting our demands at the lowest cost is the key. The framework for subsequent selection of deep reinforcement learning also follows this principle.
Keras built-in dataset:
Data conversion: convert N 28*28 data into N*784 two-dimensional data:
The label is converted to the format of one hot:
The actual modeling code is short:
10 epochs, the accuracy rate is 92%.
recent articles:
ETF rotation + RSRS timing, plus Kaman filter: annualized 48.41%, Sharpe ratio 1.89