Convert time series to classification problem

This article will use stock trading as an example. We use the AI ​​model to predict whether the stock will rise or fall the next day. In this context, the classification algorithms XGBoost, random forests and logistic classifiers are compared. Another focus of the article is data preparation. How we have to transform the data so the model can process it.

In this article, we will follow the CRISP-DM process model so that we take a structured approach to solving business cases. CRISP-DM is particularly useful for latent analytics and is often used in industry to structure data science projects.

The other thing is that we will use the Python package openbb. This package contains some data sources from the financial sector, which we can use conveniently.

The first is to install the necessary libraries:

  pip install pandas numpy “openbb[all]” swifter scikit-learn

business understanding

First of all, we should understand the problem we want to solve. In our example, the problem can be defined as follows:

  预测股票代码 AAPL 的股价第二天会上涨还是下跌。

Then there is the question of what kind of machine learning model you should consider. We want to predict whether the stock will go up or down the next day. So it's a classification problem (1: the stock went up the next day or 0: the stock went down the next day). In classification problems, we predict a class. In our case, a binary classification of 0 and 1 classes.

Data Understanding and Preparation

The data understanding phase focuses on identifying, collecting, and analyzing data sets. In the first step, we download the Apple stock data. Here's how to do it with openbb:

 data = openbb.stocks.load(
     symbol = 'AAPL',
     start_date = '2023-01-01',
     end_date = '2023-04-01',
     monthly = False)
 data

The code downloads data between 2023-01-01 and 2023-04-01. The downloaded data contains the following information:

  • Open: USD daily opening price
  • High: the highest price of the day (USD)
  • Low: the lowest price of the day (USD)
  • Close: Daily closing price in USD
  • Adj Close: The adjusted closing price associated with a dividend or stock split
  • Volume: the number of shares traded
  • Dividends: Dividends paid
  • Stock Splits: Stock Split Execution

We have downloaded the data, but the data is not yet suitable for modeling a classification model. So there is still a need to prepare the data for modeling. So I wrote a function to download the data and then transform it for modeling. The following code shows this functionality:

 defget_training_data(symbol, start_date, end_date, monthly_bool=True, lookback=10):
     data=openbb.stocks.load(
         symbol=symbol,
         start_date=start_date,
         end_date=end_date,
         monthly=monthly_bool)
     data=get_label(data)
     data_up_down=data['up_down'].to_numpy()
     training_data=get_sequence_data(data_up_down, lookback)
     returntraining_data

The first function included here is get_label():

 defencoding(n):
     ifn>0:
         return1
     else:
         return0
 defget_label(data):
     data['Delta'] =data['Close'] -data['Open']
     data['up_down'] =data['Delta'].swifter.apply(lambdad: encoding(d))
     returndata

His main job is: to calculate the difference between the closing price and the opening price. We then mark with 1 all dates where the stock price went up and all days when the stock price went down with 0. An additional up_down column contains whether the stock price was up or down on a particular date. The swifter.apply() function is used here instead of pandas apply() because swifter provides multi-core support.

The second function is get_sequence_data(). The parameter lookback specifies how many days in the past to include in the forecast. get_sequence_data() code is as follows:

 defget_sequence_data(data_up_down, lookback):
     shape= (data_up_down.shape[0] -lookback+1, lookback)
     strides=data_up_down.strides+ (data_up_down.strides[-1],)
     returnnp.lib.stride_tricks.as_strided(data_up_down, shape=shape, strides=strides)

This function has two parameters: data_up_down and lookback. It returns a new NumPy array representing a sliding window view of the data_up_down array with the specified window size determined by the lookback argument. To illustrate how this function works, let's look at a small example.

 get_sequence_data(np.array([1, 2, 3, 4, 5, 6]), 3)

The result is as follows:

 array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6]])

In the following, we download the data for the Apple stock and transform it for modeling. We use a lookback window of 10 days.

 data = get_training_data(symbol = 'AAPL', start_date = '2023-01-01', end_date = '2023-04-01', monthly_bool = False, lookback=10)
 pd.DataFrame(data).to_csv("data/data_aapl.csv")

Now that the data is ready, we start modeling and evaluating the model.

modeling

Read data into data and generate test and training data.

 data = pandas.read_csv("./data/data_aapl.csv")
 X=data.iloc[:,:-1]
 Y=data.iloc[:,-1]
 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=4284, stratify=Y)

Logistic regression:

This classifier is a linear-based model and is often used as a baseline model. We use scikit-learn's implementation:

 model_lr = LogisticRegression(random_state = 42)
 model_lr.fit(X_train,y_train)
 y_pred = model_lr.predict(X_test)

XGBoost:

XGBoost is an implementation of gradient boosting decision trees designed for speed and performance. It belongs to the tree boosting algorithm, which connects many weak tree classifiers sequentially.

 model_xgb = XGBClassifier(random_state = 42)
 model_xgb.fit(X_train, y_train)
 y_pred = model_xgb.predict(X_test)

Random Forest:

Random Forest builds multiple decision trees. This method is called ensemble learning, because multiple learners are connected to each other, and the algorithm belongs to the bagging method. The acronym "bagging" stands for bootstrapped aggregation. Here is also an implementation using scikit-learn:

 model_rf = RandomForestClassifier(random_state = 42)
 model_rf.fit(X_train, y_train)
 y_pred = model_rf.predict(X_test)

Evaluate

After modeling and training the model, we need to check the performance of the model on the test data. The metrics are Recall, Precision and F1-Score. The table below shows the results.

It can be seen that the logistic classifier (logistic regression) and random forest have achieved significantly better results than the XGBoost model. What is the reason for this? This is because the data is relatively simple, with only a few dimensions of features, and the length of the data is also small, and all our models have not been tuned.

Summarize

The main purpose of our article is to introduce how to convert the time series of stock prices into a classification problem, and demonstrate how to use the window function to convert the time series into a sequence during data processing. As for the model, there is not much tuning. So simpler models perform better for effect evaluation.

https://avoid.overfit.cn/post/57a12ff0cf964fbebf1b27bc72fb2bbb

By Tinz Twins

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/130438960