AI Quantitative Model Prediction Challenge baseline (study notes) (1)

1 CatBoost method personal interpretation
2 personal summary

Contest address:AI Quantitative Model Prediction Challenge
The data set structure is as follows:
Insert image description here

1 CatBoost method personal interpretation

1.1 Import module

Import the required modules

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss, mean_squared_log_error
import tqdm, sys, os, gc, argparse, warnings
warnings.filterwarnings('ignore')

CatBoost is a GBDT framework based on the symmetric decision tree (oblivious trees) algorithm with few parameters, support for categorical variables and high accuracy. In-depth understanding of CatBoost - Zhihu (zhihu.com)
warnings.filterwarnings('ignore'): The warning filter is used to control the behavior of warning messages, such as ignoring, displaying or converting to errors (throwing exceptions). The warning filter maintains an ordered list of filtering rules, and the matching rules are used to determine how to handle warnings. , any specific warning will be matched against each filter rule in the list in turn until a match is found. Python’s warnings module (warnings.filterwarnings(“ignore”) code analysis) - Xijiuxingcheng - Blog Garden (cnblogs.com)

1.2 Data exploration

Data exploratory analysis is to understand the data set, the relationship between variables and the relationship between variables and predicted values, thereby helping us to better perform feature engineering and build models later, which is a very important step in machine learning.

1.2.1 Reading data

# 读取数据
path = 'AI量化模型预测挑战赛公开数据/'

train_files = os.listdir(path+'train')
train_df = pd.DataFrame()
for filename in tqdm.tqdm(train_files):
    tmp = pd.read_csv(path+'train/'+filename)
    tmp['file'] = filename
    train_df = pd.concat([train_df, tmp], axis=0, ignore_index=True)

test_files = os.listdir(path+'test')
test_df = pd.DataFrame()
for filename in tqdm.tqdm(test_files):
    tmp = pd.read_csv(path+'test/'+filename)
    tmp['file'] = filename
    test_df = pd.concat([test_df, tmp], axis=0, ignore_index=True)

os.listdir(path+'train'): The os.listdir() method is used to return a list of the names of files or folders contained in the specified folder. It is different from the glob we saw before. Glob returns a path. The os.listdir() function in python reads the folder - Rogn - Blog Park (cnblogs.com)
train_df = pd.concat([train_df, tmp], axis=0, ignore_index=True): Splice the two data frames train_df and tmp along the row direction (axis=0), and the original index is ignored. pandas data merging pd.concat() usage_xue_11’s blog-CSDN blog pandas data merging article Understanding pd.concat() - Zhihu (zhihu.com)tmptrain_df

1.2.2 Visual analysis

Visual analysis of buying and selling prices

Select any stock data for visual analysis and observe the relationship between the buying price and selling price. The following is a brief introduction to the buying and selling prices:

The bid price refers to the highest price a buyer is willing to pay for a stock/asset.
The ask price refers to the lowest price that the seller is willing to accept for a stock/asset.
The difference between these two prices is called the spread; the smaller the spread, the higher the liquidity of the instrument.

cols = ['n_bid1','n_bid2','n_ask1','n_ask2']
tmp_df = train_df[train_df['file']=='snapshot_sym7_date22_pm.csv'].reset_index(drop=True)[-500:]
tmp_df = tmp_df.reset_index(drop=True).reset_index()
for num, col in enumerate(cols):
    plt.figure(figsize=(20,5))
   
    plt.subplot(4,1,num+1)
    plt.plot(tmp_df['index'],tmp_df[col])
    plt.title(col)
plt.show()
plt.figure(figsize=(20,5))

for num, col in enumerate(cols):
    plt.plot(tmp_df['index'],tmp_df[col],label=col)
plt.legend(fontsize=12)

train_df[train_df['file']=='snapshot_sym7_date22_pm.csv']: Filter out the rows whose column value is from the data frame of train_df. filesnapshot_sym7_date22_pm.csv
reset_index(drop=True): Used to reset the index column of the data frame to the default integer index and delete the original index column. Parameter drop=True means to delete the original index column when resetting the index instead of adding it as a new column.
enumerate(iterable, start=0):iterable represents the iterable object to be traversed, start represents the starting value of the index and returns an enumeration object. You can use a for loop to iterate over the object. On each iteration, it returns a tuple containing two elements, the first element is the index of the current element, and the second element is the corresponding element value.
plt.figure(figsize=(20,5)):Create a new graphics window and set its size to 20x5 inches.
plt.subplot(4,1,num+1):Create a subgraph in the current graphics window. Parameter 4,1 means dividing the graphics window into 4 rows and 1 column, and then select the num+1 subgraph for drawing.
plt.plot(tmp_df['index'],tmp_df[col],label=col): Draw a line on the current graph and use the label parameter to add a label to the line.
plt.legend(fontsize=12): Add a legend in the graphics window. The content of the legend is the label of each subfigure. By setting the fontsize parameter, you can adjust the size of the legend font. The size 12 font is used above.

Insert image description here

Median Price Visualization

The middle price is the average of the buying price and selling price. It is directly given in the data, and we can also calculate it ourselves.

plt.figure(figsize=(20,5))

for num, col in enumerate(cols):
    
    plt.plot(tmp_df['index'],tmp_df[col],label=col)
    
plt.plot(tmp_df['index'],tmp_df['n_midprice'],label="n_midprice",lw=10)
plt.legend(fontsize=12)

lw=10:lw is a parameter of the plot function in the matplotlib library, used to set the line width of the line.

Insert image description here

Weighted Average Price (WAP) Visualization

Volatility is an important statistical indicator of the price change of a given stock, so to calculate the price change we first need to perform a stock valuation at regular intervals. We will use the weighted average price (WAP) of the data provided for visualization, with changes in WAP reflecting stock volatility.

train_df['wap1'] = (train_df['n_bid1']*train_df['n_bsize1'] + train_df['n_ask1']*train_df['n_asize1'])/(train_df['n_bsize1'] + train_df['n_asize1'])
test_df['wap1'] = (test_df['n_bid1']*test_df['n_bsize1'] + test_df['n_ask1']*test_df['n_asize1'])/(test_df['n_bsize1'] + test_df['n_asize1'])

tmp_df = train_df[train_df['file']=='snapshot_sym7_date22_pm.csv'].reset_index(drop=True)[-500:]
tmp_df = tmp_df.reset_index(drop=True).reset_index()
plt.figure(figsize=(20,5))
plt.plot(tmp_df['index'], tmp_df['wap1'])

is mainly the operation of DataFrame. Pandas for data analysis (4) DataFrame operation_python pandas table operation_Timojun’s blog-CSDN blog

![[Pasted image 20230806000855.png]] Insert image description here

1.3 Feature Engineering

In the feature engineering stage, basic time features are constructed and related features such as hours and minutes are extracted, mainly to characterize the differential information that may exist at different time stages. It should be noted that the data is stored in multiple files, so the files need to be merged before subsequent work.

# 时间相关特征
train_df['hour'] = train_df['time'].apply(lambda x:int(x.split(':')[0]))
test_df['hour'] = test_df['time'].apply(lambda x:int(x.split(':')[0]))

train_df['minute'] = train_df['time'].apply(lambda x:int(x.split(':')[1]))
test_df['minute'] = test_df['time'].apply(lambda x:int(x.split(':')[1]))

# 入模特征
cols = [f for f in test_df.columns if f not in ['uuid','time','file']]

train_df['hour'] = train_df['time'].apply(lambda x:int(x.split(':')[0])): Use ":" to split each element of the time column of time_df. For example, 12:30 becomes ["12","30"], the first one It is the hour, the second is the minute, and then converted to int type.

1.4 Model training and verification

Choosing to use the CatBoost model is usually used as the baseline model in machine learning competitions. It can obtain relatively stable scores without the need for process parameter adjustment. Here, a five-fold cross-validation method is used for data segmentation verification, and the five model results are finally averaged as the final submission.

def cv_model(clf, train_x, train_y, test_x, clf_name, seed = 2023):
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    oof = np.zeros([train_x.shape[0], 3])
    test_predict = np.zeros([test_x.shape[0], 3])
    cv_scores = []
    
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
       
        if clf_name == "cat":
            params = {
    
    'learning_rate': 0.2, 'depth': 6, 'bootstrap_type':'Bernoulli','random_seed':2023,
                      'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False,
                      'loss_function': 'MultiClass'}
            
            model = clf(iterations=5000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      metric_period=1000,
                      use_best_model=True, 
                      cat_features=[],
                      verbose=1)
            
            val_pred  = model.predict_proba(val_x)
            test_pred = model.predict_proba(test_x)
        
        oof[valid_index] = val_pred
        test_predict += test_pred / kf.n_splits
        
        F1_score = f1_score(val_y, np.argmax(val_pred, axis=1), average='macro')
        cv_scores.append(F1_score)
        print(cv_scores)
        
    return oof, test_predict
    
for label in ['label_5','label_10','label_20','label_40','label_60']:
    print(f'=================== {
      
      label} ===================')
    cat_oof, cat_test = cv_model(CatBoostClassifier, train_df[cols], train_df[label], test_df[cols], 'cat')
    train_df[label] = np.argmax(cat_oof, axis=1)
    test_df[label] = np.argmax(cat_test, axis=1)

kf = KFold(n_splits=folds, shuffle=True, random_state=seed):
KFold is a class in the sklearn library that is used to divide the data set into k consecutive folds. Each fold can be used as a validation set and the remaining folds as a training set. This method is often used in cross-validation.
n_splits=folds means dividing the data set into folds.
shuffle=True means that the data set is randomly shuffled before dividing.
random_state=seed means setting a random number seed to ensure that the data set is divided in the same way every time the code is run.
kf.split(train_x, train_y): Use the KFold object kf to divide the data set into a training set and a validation set. The return value is a tuple containing two elements, each element is an index array containing feature data and label data.
iloc[list]:iloc can accept an index array and return the required rows of data. How to use loc and iloc functions for data indexing in pandas (Introduction) - Zhihu (zhihu.com)
clf(iterations=5000, **params): The two * in front of params represent keyword parameters, allowing 0 or any number of parameters containing parameter names to be passed in. These keyword parameters are automatically assembled into a dict inside the function. Python function parameter transfer (params, * params, ** params)_ params python_Chercheer’s blog-CSDN blog
f1_score(val_y, np.argmax(val_pred, axis=1), average='macro'):val_y is the true label of the validation set, and np.argmax(val_pred, axis=1) is the predicted category index of each sample obtained after the model predicts the validation set. Then, use the average='macro' parameter to mean averaging the F1 scores of all samples. sklearn.metrics.f1_score How to use_Zhuangzhuang is not too fat^QwQ’s blog-CSDN blog

This competition uses macro-F1 score for evaluation. The highest score among the five items label_5, label_10, label_20, label_40, and label_60 is taken as the final score. Therefore, during the initial modeling, the corresponding five targets need to be modeled to determine the score. The highest goal, when optimizing later, you only need to model the optimal goal, which greatly saves time and focuses on single-goal optimization.

1.5 Result output

The submitted results need to comply with the sample submission results, and then compress the folder into zip format for submission.

import pandas as pd
import os

# 指定输出文件夹路径
output_dir = './submit'

# 如果文件夹不存在则创建
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# 首先按照'file'字段对 dataframe 进行分组
grouped = test_df.groupby('file')

# 对于每一个group进行处理
for file_name, group in grouped:
    # 选择你所需要的列
    selected_cols = group[['uuid', 'label_5', 'label_10', 'label_20', 'label_40', 'label_60']]
    
    # 将其保存为csv文件，file_name作为文件名
    selected_cols.to_csv(os.path.join(output_dir, f'{
      
      file_name}'), index=False)

grouped = test_df.groupby('file'): Group the test_df data frame according to the value of the file column, grouped the variable is a dictionary, where the key are the different values of thefile column, and the value is the corresponding sub-data frame (DataFrame). You can use the grouped.groups attribute to view all group names, and grouped['file'].unique() to view all unique file names.

2 personal summary

The first thing is to recognize some syntax and the usage of some functions, such as groupby, f1_score, subplot, < /span>. I hope that in the future study, I will become familiar with them little by little, and finally be able to apply what I have learned and greatly improve efficiency. KFold, listdir
I got to know the CatBoost decision tree, but the specific principle is not clear. I plan to learn how to use it first and then explore the principle.
Learned some feature construction methods.