Dataloader refactoring and keras entry experience

The 117th original article focuses on "personal growth and wealth freedom, the logic of world operation, and AI quantitative investment".

The epidemic situation in Beijing exceeded 4,000 yesterday, and the number of people in the society was 800+. It seems that the three-day expectation is too optimistic, and I don't know how to develop it. Just like the short-term trend of the capital market, no one can predict it. But looking forward three years, I believe that everything that seems to be a big thing now will no longer be a problem.

Continue to optimize our AI quantification platform.

01 dataloader plus cache

dataloader does data feature engineering and data automatic labeling. If there are many factors, the amount of calculation is large, and it is calculated once every time it is started, which affects the efficiency. We can use the ability of hdf5 to store dataframes to store the calculated factors and labeling results, and load them directly from the cache next time.

The data loader reads data from the database, csv file or hdf5 storage format, usually the time series data corresponding to each symbol, and loads it into the memory in the format of pandas dataframe. Then use the expression manager to calculate the features and save them in the corresponding dataframe.

# encoding:utf8
import pandas as pd
from loguru import logger

from engine.datafeed.expr.expr_mgr import ExprMgr
from engine.datafeed.datafeed_hdf5 import Hdf5DataFeed
from engine.config import DATA_DIR_HDF5_CACHE


class Dataloader:
    def __init__(self, symbols, names, fields, load_from_cache=False):
        self.expr = ExprMgr()
        self.feed = Hdf5DataFeed()
        self.symbols = symbols
        self.names = names
        self.fields = fields

        with pd.HDFStore(DATA_DIR_HDF5_CACHE.resolve()) as store:
            key = 'features'
            if load_from_cache and '/' + key in store.keys():  # 注意判断keys需要前面加“/”

                logger.info('Load from cache...') 
                self.data = store[key] 
            else: 
                self.data = self.load_one_df() 
                store[key] = self.data 

    def load_one_df(self): 
        dfs = self .load_dfs() 
        all = pd.concat(dfs) 
        all.sort_index(ascending=True, inplace=True) 
        all.dropna(inplace=True) 
        self.data = all 
        return all 

    def load_dfs(self): 
        dfs = [] 
        for code in self.symbols: 
            # Add fields directly in memory for easy reuse 
            df = self.feed.get_df(code) 
            for name, field in zip(self.names, self.fields):
                exp = self.expr.get_expression(field)
                # Multiple sequences may be returned here 
                se = exp.load(code) 
                if type(se) is pd.Series: 
                    df[name] = se 
                if type(se) is tuple: 
                    for i in range(len(se)): 
                        df[name + '_' + se[i].name] = se[i] 
            df['code'] = code 
            dfs.append(df) 

        return dfs

dataloader accepts 4 parameters: symbols, names, fields and load_from_cache.

symbols: List of securities to be loaded.

names: Feature names.

fields: list of factor expressions.

load_from_cache: Whether to load from the cache.

Load_dfs, traverse by symbol. After the original dataframe of each symbol is read into the memory, calculate the factor values ​​for the feature columns that need to be calculated for names and fields, and save them in the dataframe.

Load_one_df merges multiple dataframes returned by load_dfs into one dataframe and returns it, and saves it in the cache for later use.

02 Random forest upgrade to boosting GBDT

from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,HistGradientBoostingRegressor

e.add_model(SklearnModel(AdaBoostRegressor()), split_date='2020-01-01', feature_names=feature_names)

Several major algorithms of GBDT have similar sklearn interfaces.

03 Keras deep learning framework

Deep learning is one of the most cutting-edge artificial intelligence technologies. However, compared with traditional machine learning, such as the sklearn framework, deep learning has a much higher learning threshold. Deep learning requires the user to build the network structure, set the learning rate, select the optimization objective function, etc. The functions are flexible and powerful, and the learning curve is also high.

The two most popular deep learning frameworks are factbook's pytorch and google's tensorflow. Purely comparing the two frameworks, pytorch has a much lower learning curve than tensorflow, but still requires beginners to have mathematical knowledge of matrix operations, calculus, etc.

The emergence of Keras has greatly lowered the threshold for using tensorflow.

Keras is a high-level neural network API written in Python that can run with TensorFlow, CNTK, or Theano as a backend. Keras has been developed with a focus on enabling rapid experimentation. Being able to convert your ideas into experimental results with minimal cost is the key to doing good research.

Official usage scenarios:

· Allows for easy and fast prototyping (due to user-friendliness, high modularity, scalability).

· Support both convolutional neural networks and recurrent neural networks, as well as combinations of the two.

· Runs seamlessly on both CPU and GPU.

The biggest advantage of Keras is simple and fast prototyping, which is very important for beginners. Our goal is to quantify investment and apply deep learning to quantification instead of studying the details of deep learning itself. Therefore, meeting our demands at the lowest cost is the key. The framework for subsequent selection of deep reinforcement learning also follows this principle.

Keras built-in dataset:

Data conversion: convert N 28*28 data into N*784 two-dimensional data:

The label is converted to the format of one hot:

The actual modeling code is short:

10 epochs, the accuracy rate is 92%.

recent articles:

ETF rotation + RSRS timing, plus Kaman filter: annualized 48.41%, Sharpe ratio 1.89

Guess you like

Origin blog.csdn.net/weixin_38175458/article/details/128063239