AI Quantification and Machine Learning Process: From Data to Model

The 99th original article focuses on "personal growth and wealth freedom, the logic of world operation, AI quantitative investment".

It's the 99th article, and the first 100 original small goals are about to be completed, which is very good.

What will be the effect if you insist on doing one thing for 1000 days? Let's wait and see.

The articles in the past few days have talked about data ( hdf5: suitable quantified storage format for dataframe compatible with pandas ), feature engineering ( factor feature engineering: based on pandas and talib (code) ), single factor evaluation ( [Weekly research report] AI Alphalens of quantitative feature engineering: a set of general tools for analyzing alpha factors ), today I will talk about the model.

First combine financial modeling with the machine learning process:

01 Data preparation, feature engineering and labeling

def feature_engineer(df):
    features = []
    for p in [1, 5, 20, 60]:
        features.append('mom_{}'.format(p))
        df['mom_{}'.format(p)] = df['close'].pct_change(p)

    df['f_return_1'] = np.sign(df['close'].shift(-1) / df['close'] - 1)
    features.append('code')
    features.append('f_return_1')
    print(features)
    df = df[features]
    return df

import pandas as pd

symbols = ['SPX', '000300.SH']
dfs = []
with pd.HDFStore('data/index.h5') as store:
    for symbol in symbols:
        df = store[symbol]
        df['close'] = df['close'] / df['close'].iloc[0]
        df = feature_engineer(df)
        dfs.append(df)
all = pd.concat(dfs)
# all.set_index([all.index,'code'],inplace=True)
all.sort_index(ascending=True, level=0, inplace=True)
all.dropna(inplace=True)
all

Here we get the feature engineering and labeled data set,

This process is generic:

02 Divide the dataset

import numpy as np
import datetime

def get_date_by_percent(start_date,end_date,percent):
    days = (end_date - start_date).days
    target_days = np.trunc(days * percent)
    target_date = start_date + datetime.timedelta(days=target_days)
    #print days, target_days,target_date
    return target_date

def split_dataset(df,input_column_array,label,split_ratio):
    split_date = get_date_by_percent(df.index[0],df.index[df.shape[0]-1],split_ratio)

    input_data = df[input_column_array]
    output_data = df[label]

    # Create training and test sets
    X_train = input_data[input_data.index < split_date]
    X_test = input_data[input_data.index >= split_date]
    Y_train = output_data[output_data.index < split_date]
    Y_test = output_data[output_data.index >= split_date]

    return X_train,X_test,Y_train,Y_test

03 Baseline model

Implement three benchmark models, namely logistic regression, random Sensen and SVM.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC

def do_logistic_regression(x_train,y_train):
    classifier = LogisticRegression()
    classifier.fit(x_train, y_train)
    return classifier


def do_random_forest(x_train,y_train):
    classifier = RandomForestClassifier()
    classifier.fit(x_train, y_train)
    return classifier


def do_svm(x_train,y_train):
    classifier = SVC()
    classifier.fit(x_train, y_train)

return classifier

04 test

def test_predictor(classifier, x_test, y_test):
    pred = classifier.predict(x_test)

    hit_count = 0
    total_count = len(y_test)
    for index in range(total_count):
        if (pred[index]) == (y_test[index]):
            hit_count = hit_count + 1

    hit_ratio = hit_count / total_count
    score = classifier.score(x_test, y_test)
    print("hit_count=%s, total=%s, hit_ratio = %s" % (hit_count, total_count, hit_ratio))

    return hit_ratio, score

Financial quantification is injected into the whole process of machine learning, mainly data preprocessing, factor calculation, automatic data labeling, model training, model prediction and so on.

 This process is almost standard, and later it is mainly to add more and better factors, different labeling methods, which can be categorical or continuous. At the same time, the factor can also do some preprocessing, filtering and the like.

Then there are better, more capable models.

 

Guess you like

Origin blog.csdn.net/weixin_38175458/article/details/127766817