The 99th original article focuses on "personal growth and wealth freedom, the logic of world operation, AI quantitative investment".
It's the 99th article, and the first 100 original small goals are about to be completed, which is very good.
What will be the effect if you insist on doing one thing for 1000 days? Let's wait and see.
The articles in the past few days have talked about data ( hdf5: suitable quantified storage format for dataframe compatible with pandas ), feature engineering ( factor feature engineering: based on pandas and talib (code) ), single factor evaluation ( [Weekly research report] AI Alphalens of quantitative feature engineering: a set of general tools for analyzing alpha factors ), today I will talk about the model.
First combine financial modeling with the machine learning process:
01 Data preparation, feature engineering and labeling
def feature_engineer(df): features = [] for p in [1, 5, 20, 60]: features.append('mom_{}'.format(p)) df['mom_{}'.format(p)] = df['close'].pct_change(p) df['f_return_1'] = np.sign(df['close'].shift(-1) / df['close'] - 1) features.append('code') features.append('f_return_1') print(features) df = df[features] return df
import pandas as pd symbols = ['SPX', '000300.SH'] dfs = [] with pd.HDFStore('data/index.h5') as store: for symbol in symbols: df = store[symbol] df['close'] = df['close'] / df['close'].iloc[0] df = feature_engineer(df) dfs.append(df) all = pd.concat(dfs) # all.set_index([all.index,'code'],inplace=True) all.sort_index(ascending=True, level=0, inplace=True) all.dropna(inplace=True) all
Here we get the feature engineering and labeled data set,
This process is generic:
02 Divide the dataset
import numpy as np import datetime def get_date_by_percent(start_date,end_date,percent): days = (end_date - start_date).days target_days = np.trunc(days * percent) target_date = start_date + datetime.timedelta(days=target_days) #print days, target_days,target_date return target_date def split_dataset(df,input_column_array,label,split_ratio): split_date = get_date_by_percent(df.index[0],df.index[df.shape[0]-1],split_ratio) input_data = df[input_column_array] output_data = df[label] # Create training and test sets X_train = input_data[input_data.index < split_date] X_test = input_data[input_data.index >= split_date] Y_train = output_data[output_data.index < split_date] Y_test = output_data[output_data.index >= split_date] return X_train,X_test,Y_train,Y_test
03 Baseline model
Implement three benchmark models, namely logistic regression, random Sensen and SVM.
from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import LinearSVC, SVC def do_logistic_regression(x_train,y_train): classifier = LogisticRegression() classifier.fit(x_train, y_train) return classifier def do_random_forest(x_train,y_train): classifier = RandomForestClassifier() classifier.fit(x_train, y_train) return classifier def do_svm(x_train,y_train): classifier = SVC() classifier.fit(x_train, y_train)
return classifier
04 test
def test_predictor(classifier, x_test, y_test): pred = classifier.predict(x_test) hit_count = 0 total_count = len(y_test) for index in range(total_count): if (pred[index]) == (y_test[index]): hit_count = hit_count + 1 hit_ratio = hit_count / total_count score = classifier.score(x_test, y_test) print("hit_count=%s, total=%s, hit_ratio = %s" % (hit_count, total_count, hit_ratio)) return hit_ratio, score
Financial quantification is injected into the whole process of machine learning, mainly data preprocessing, factor calculation, automatic data labeling, model training, model prediction and so on.
This process is almost standard, and later it is mainly to add more and better factors, different labeling methods, which can be categorical or continuous. At the same time, the factor can also do some preprocessing, filtering and the like.
Then there are better, more capable models.