Stock trend prediction using K-nearest neighbor (KNN) machine learning algorithm - Python

What is K-nearest neighbor (KNN)

K-Nearest Neighbors (KNN, K-Nearest Neighbors) is one of the simplest machine learning algorithms and can be used for regression and classification. KNN is a "lazy" learning algorithm, which technically does not train a model to make predictions. The logic of K nearest neighbor is that assuming there is an observation value, this observation value is predicted to belong to the class with the largest proportion among the k nearest observations to it. The KNN method is a direct attempt to approximate the conditional expectation using actual data.

For regression, the predicted value is the mean of K neighbors, and the estimator is
f ^ ( x ) = A v e r a g e [ y i ∣ x i ∈ N k ( x ) ] \hat{f}(x) = Average[y_i | x_i \in N_k(x)] f^(x)=Averag e[yixiNk(x)]
N k ( x ) N_k(x) Nk(x) is the neighborhood of x containing the k closest observations.

For classification, the predicted label is the class with the "more votes wins" property, i.e. which class is the most representative among neighboring classes. This is equivalent to taking a majority vote among the k nearest neighbors. For each class j = 1,…, K, we then calculate the conditional probability
Pr ⁡ ( G = j ∣ X = x 0 ) = 1 k ∑ i ∈ N k ( x ) I ( y i = j ) \Pr(G = j | X = x_0) = \frac1{k} \sum_{i\in N_k(x)} I(y_i = j) Pr(G=jX=x0)=k1iNk(x)I(yi=j)
Classify the observed values ​​into the category with the highest conditional probability. Here I ( y i = j ) I(y_i = j) I(yi=j)This is the indicating function, the result y i = j y_i = j andi=joutward 1 1 1,No way out 0 0 0

KNN’s “three-step” process

KNN is popular because it is easy to understand and explain. Its accuracy is generally comparable to, or even better than, more complex algorithms. Once k is specified, finding the nearest neighbor is a three-step process.

step Remark
step 1 Calculate distance, usually Euclidean distance
Step 2 Sort by distance in ascending order and find k nearest neighbors
Step 3 Calculate the mean or probability of KNN observations

Find neighbors

For example, we have a two-dimensional rectangular coordinate, and the horizontal axis and the vertical axis represent two different characteristics. Now all the data points reflecting these two characteristics have been marked in the coordinate system.
Intuitively, distance can be thought of as a measure of similarity. To put it bluntly, in this coordinate system, the closer the two points are, the more similar the features represented by the two points are.
Euclidean distance is most commonly used, but other distance measures such as Manhattan distance are also possible. The generalized distance measure is called Minkowski distance, defined as
d = ( ∑ n = 1 n ∣ x i − y i ∣ p ) 1 p d = (\sum_{n=1 }^n |x_i - y_i|^p)^{\frac1{p}} d=(n=1nxiandip)p1
inside x i x_i xisum y i y_i andi is using hyperparameter (ie integer p p p)Calculation distance d d Two observations of d.

p = 1 p = 1 p=When 1, the Minkowski distance is the Manhattan distance, when p = 2 p = 2 p=When 2, the Minkowski distance is just the standard Euclidean distance. After identifying K neighbors using a distance metric, the algorithm can use the label values ​​of the neighbors for classification or prediction.

KNN model

KNN is a non-parametric method that does not assume any function because there are no parameters to estimate. The selection of the number of neighbors, that is, k, is done using the training set. Choosing k close to 1 gives the most flexibility (low bias) but also the highest variability (high variance), while on the other hand choosing a larger k gives the most flexibility (high bias) but also has Minimal variability (low variance). Note that when k = N k = N k=N will assign all new test observations to a single class. The best way to choose k is through cross validation, which we will expand on later. Another alternative is to use the Elbow method to choose k.

Case overview

Next we will perform the actual operation of the KNN algorithm. The goal is that we will use KNN to predict market trends and formulate trading strategies based on this. Here we simply use SPY data.

Import library

The old rule is to import the libraries we need for this project first.

# Base libraries
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12,8)
plt.style.use('fivethirtyeight')

# Preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit

# Classifier
from sklearn.neighbors import KNeighborsClassifier

# Metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import plot_confusion_matrix, auc, roc_curve, plot_roc_curve

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Retrieve data

Data source: yfinance, SPY, 20001.1-2023.4.5

# Load locally stored data
df = pd.read_csv('./SPY.csv', index_col=0, parse_dates=True)
# log相减(取差分)
df['Forward Returns'] = np.log(df['Adj Close']).diff().shift(-1)

# Check first 5 values
df = df.dropna()
df

Insert image description here

# Check for missing values
df.isnull().sum()

Insert image description here

feature engineering

Features here are independent variables used to determine the value of the target variable. We will create features and targets (labels) from the original dataset.

Target or tag definition

The label or target variable is the dependent variable. Here, the target variable is to identify whether SPY will rise or fall at the close of the next trading day. If tomorrow's return is greater than the median forward return, we will buy SPY, otherwise we will sell SPY.

We assign +1 to the buy signal and -1 to the sell signal of the target variable. The goal can be described as:
Insert image description here
where Q 2 ( r t + 1 ) Q_2(r_{t+1}) Q2(rt+1) is the median (second quartile) of SPY's next day's return.

The reason for choosing to use the median rather than using 0 as the dividing line for the target variable is that it is difficult to use returns particularly close to 0 as a signal that a stock is about to rise or fall. If 0 is used as a signal for the rise and fall of stocks, the distribution of the original data or the distribution of +1 or -1 will not be very uniform, which will make the effect of the classifier worse.

Looking back at the previous data, we already have the opening price, closing price, etc. of SPY. These indicators (Predictors) can also be used directly, but we first build two technical indicators as feature values ​​for prediction. The code is as follows.

# Predictors 
df['O-C'] = df.Open - df.Close
df['H-L'] = df.High - df.Low

# X是合成的工具,变为数组
X = df[['O-C','H-L']].values
X[:5]

Insert image description here
Take the median of the target variable to ensure that the plus and minus numbers are as equal as possible; the following code sets half of the numbers to +1 and half to -1.

# Target 
y = np.where(df['Forward Returns'] >= np.quantile(df['Forward Returns'], q=0.5), +1, -1)
y
y.shape  # Target label should be ID
# Value counts for class +1 and -1
pd.Series(y).value_counts()

The output is as follows.
Insert image description here

Split the dataset into training and test data

# Splitting the datasets into training anf testing data
# Always keep shuffle = Fales for financial time series
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Output of the train and test data size
print(f"Train and Test size {
      
      len(X_train),len(X_test)}")

Insert image description here

Fit model

Since the KNN model calculates distance, the dataset needs to be scaled for the model to work properly. All features should have similar scales. For details and examples about data scaling and scaling, please see my first blog (Preprocessing of financial data). Scaling can be done by using the "MinMaxScaler" converter.

# Scale and fit the model
pipe = Pipeline([
    ("scaler", MinMaxScaler()),
    ("classifier", KNeighborsClassifier())
])

pipe.fit(X_train, y_train)

Insert image description here

# Target Classes
class_names = pipe.classes_
class_names

Insert image description here

Predictions and Probabilities

# Predicting the test dataset
y_pred = pipe.predict(X_test)

acc_train = accuracy_score(y_train, pipe.predict(X_train))
acc_test = accuracy_score(y_test, y_pred)

print(f"Train accuracy: {
      
      acc_train:0.4}, Test accuracy: {
      
      acc_test:0.4}")

The result I get is this
Insert image description here

# predict probabilities
probs = pipe.predict_proba(X_test)
probs[:10]

Insert image description here

# predict y
y_pred[:10]

Insert image description here

Prediction quality

confusion matrix

A confusion matrix is ​​a tool used to evaluate the prediction quality of a classification model or algorithm. It can show the correspondence between the model's prediction results on different categories and the true labels. Through the confusion matrix, we can calculate various evaluation indicators, such as accuracy, recall, precision and F1 score, etc., to evaluate the performance of the model.

The following is an example of a confusion matrix for a two-classification problem:

                      预测结果
                     正例  负例
真实值           正例  TP   FN
                 负例  FP   TN

In a confusion matrix, there are four key terms:

  • True Positive (TP): The number of samples that the model correctly predicts as positive examples.
  • False Negative (FN): The number of samples that the model incorrectly predicts as negative examples.
  • False Positive (FP): The number of samples that the model incorrectly predicts as positive.
  • True Negative (TN): The number of samples that the model correctly predicts as negative examples.

From these values ​​we can calculate the following evaluation metrics:

  1. Accuracy: Indicates the ratio of the number of samples predicted correctly by the model to the total number of samples. The calculation formula is (TP + TN) / (TP + TN + FP + FN).

  2. Recall: Indicates the ratio of the number of positive samples that the model can correctly predict to the number of real positive samples. The calculation formula is TP / (TP + FN).

  3. Precision: Indicates the proportion of samples predicted as positive by the model that are actually positive. The calculation formula is TP / (TP + FP).

  4. F1 Score: It takes precision and recall into consideration and is the harmonic average of precision and recall. The calculation formula is 2 * (precision * recall) / (precision + recall).

These indicators can help us judge the prediction performance of the model on different categories and select appropriate evaluation indicators to evaluate the model according to the requirements of specific problems.

In the confusion matrix, "positive examples" and "negative examples" refer to the two categories to be classified. The exact meaning of these categories depends on the specific application scenario and problem.

Usually, "positive examples" represent events we are concerned about or targets of interest, while "negative examples" represent other events or non-targets. For example, in a spam classification problem, "positive examples" can represent spam emails, and "negative examples" can represent normal emails.

"True" and "false" refer to whether the prediction corresponds to external judgments (sometimes called "observations", "truth").

# Confusion matrix for binary classification
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(tn, fp, fn, tp)

The output result is: 246 338 258 329

# Plot confusion matrix
plot_confusion_matrix(pipe, X_test, y_test, cmap='Blues', values_format='.4g')
plt.title('Confusion Matrix')
plt.grid(False)

Insert image description here
Classification reports measure the quality of classification algorithm predictions. Next, we output a classification report, in which evaluation indicators such as accuracy are calculated.

# Classification report
print(classification_report(y_test, y_pred))

Insert image description here
In the evaluation report of a classification problem (classification report), in addition to displaying the indicators of each category, two overall evaluation indicators, macro average and weighted average, are also provided. These metrics can help us understand the overall performance of the model.

The macro average is to calculate the average value of each category indicator, regardless of the difference in the number of category samples. Specific steps are as follows:

  1. For each category, the evaluation metrics for that category (such as accuracy, recall, precision, F1 score, etc.) are calculated.
  2. The evaluation metrics of all categories are averaged to obtain the macro average.

The calculation method of macro average is simple and clear, giving equal weight to the performance of each category, and is not affected by the number of samples in the category.

The weighted average is to calculate the weighted average of indicators in each category, taking into account the difference in the number of samples in each category. Specific steps are as follows:

  1. For each category, the evaluation metric for that category is calculated.
  2. Based on the number of samples in each category, the corresponding weight is calculated.
  3. The evaluation indicators of each category are multiplied by the corresponding weights and a weighted sum is performed.
  4. Divide the result of the weighted sum by the total number of samples in all categories to get the weighted average.

The weighted average takes into account the differences in the number of samples in different categories, and categories with larger sample sizes have greater weight when calculating the average.

Both macro averaging and weighted averaging provide a perspective on overall performance evaluation, but they handle imbalanced data sets differently. In cases where the number of samples is relatively balanced, both may give similar results, whereas in cases where the number of samples is unbalanced, the weighted average favors the performance of the larger category.

The choice between macro averaging or weighted averaging depends on the specific problem and focus. If all categories are equally important to you, use macro averaging. If you are more interested in categories with a larger number of samples, you can use weighted average.

The indicators are almost all 0.5, and the F1 score even appears 0.45. The prediction results of the model do not have obvious advantages in some aspects. This may indicate that the model's predictive ability is moderate and cannot effectively distinguish between positive and negative examples.

Area under the ROC curve

The area under the ROC Curve (AUC) is a common indicator used to measure the performance of a binary classification model. The ROC curve is based on different classification thresholds and draws the relationship curve between the true positive rate (True Positive Rate, TPR, also known as the recall rate) and the false positive rate (False Positive Rate, FPR).

AUC represents the area under the ROC curve, and its value ranges from 0 to 1. The meaning of AUC is the ability of the model to correctly classify: the closer the AUC is to 1, it means that the model has a higher true positive rate under different thresholds and a lower false positive rate, that is, the model has better performance. The closer the AUC is to 0.5, the closer the model's performance is to random classification.

AUC has the following characteristics:

  1. Not affected by the classification threshold: AUC takes into account the model performance under different thresholds and is therefore not affected by the selection of the classification threshold.

  2. Friendly to class-imbalanced data sets: AUC is more stable for the problem of sample class imbalance than indicators such as accuracy, and can better reflect the model's discrimination between different classes.

  3. Provides the ability of the model to sort: AUC can convert the performance of the model into the sorting ability, that is, for any pair of samples, the probability of the model being correctly sorted.

Generally speaking, AUC is a commonly used evaluation metric used to measure the performance of two-classification models. A higher AUC value indicates that the model has better classification ability, while a lower AUC value may require further improvement of the model or consideration of other methods to improve prediction performance.

# Random Prediction
r_prob = [0 for _ in range(len(y_test))]
r_fpr, r_tpr, _ = roc_curve(y_test, r_prob, pos_label=1)

# Plot ROC curve
plot_roc_curve(pipe, X_test, y_test)
plt.plot(r_fpr, r_tpr, linestyle='dashed', label='Random Prediction')
plt.title("Receiver Operating Characteristc for Up Moves ")
plt.legend(loc=9)
plt.show()

Insert image description here
Let’s take a look at the classification report again.

# Classification Report
print(classification_report(y_test, y_pred))

Insert image description here
The result doesn't look very good. Do I need to make some adjustments to the parameters? The situation in practical applications is generally that we think about parameter adjustment, but most of the actual results are that even with parameter adjustment, the results are mediocre. It's okay, let's take a look first.

Hyperparameter tuning

Hyperparameter Tuning refers to systematically trying different combinations of hyperparameters in machine learning or deep learning models to find the best hyperparameter configuration to improve the performance and generalization capabilities of the model.

Hyperparameters refer to parameters that need to be set before training the model. Their values ​​cannot be automatically learned through the training process, but are set by data scientists or machine learning engineers based on experience or experiments. Common hyperparameters include learning rate, regularization parameters, batch size, number of iterations, etc.

The goal of hyperparameter tuning is to find the best combination of hyperparameters so that the model can show the best performance on unseen data. Typically, a model is trained and evaluated by trying different hyperparameter combinations, and then adjusting the hyperparameter values ​​based on the evaluation results, iterating until the optimal hyperparameter configuration is found.

Commonly used hyperparameter tuning methods include:

  1. Grid Search: Within the predefined hyperparameter range, exhaust all possible hyperparameter combinations and select the best combination based on cross-validation or performance evaluation on the validation set.

  2. Random Search: Within the predefined hyperparameter range, randomly select some hyperparameter combinations for training and evaluation, and adjust the hyperparameter search range based on the performance evaluation results to gradually approach the best hyperparameter combination.

  3. Bayesian Optimization: Use the Bayesian method to build a surrogate model. Based on the existing hyperparameter combinations and corresponding performance evaluation results, use probabilistic inference to select the next attempted hyperparameter combination to gradually optimize the model. performance.

  4. Reinforcement Learning: Model the hyperparameter tuning problem as a reinforcement learning problem, and learn to find the best hyperparameter combination through experiments and feedback in a simulation environment.

Hyperparameter tuning is an important step in optimizing model performance and can improve the accuracy, generalization ability and robustness of the model. By selecting the optimal hyperparameter configuration, the predictive ability of the model can be improved and better results achieved in practical applications.

The hyperparameter of the KNN method in this example is the number of neighbors k used in the model.

Time series cross-validation

Time Series Cross-Validation is a validation method used to evaluate and select time series models. Unlike traditional cross-validation methods (such as K-fold cross-validation), time series cross-validation takes into account the temporal order of time series data to ensure that no future information leakage occurs during the model evaluation process.

In time series cross-validation, the data set is divided into multiple consecutive training and test sets in chronological order. Specific division methods include the following common methods:

  1. Simple Cross-Validation: The data set is divided into two parts: a training set and a test set, usually in chronological order, the last part is used as the test set, and the remaining parts are used as the training set.

  2. Rolling Window Cross-Validation: Starting from the starting point of the data set, slide a fixed-size window in sequence, use the data in the window as the training set, and the data at the next time step as the test set, and repeat the model Evaluate.

  3. Expanding Window Cross-Validation: Starting from the starting point of the data set, gradually expand the size of the window, use the data within the window as the training set, and the data at the next time step as the test set, and repeat the model evaluation.

The key to the time series cross-validation method is to ensure that the samples in the test set are after the time period in the training set to avoid the model using future information for training and prediction. This can more accurately simulate model performance in real application scenarios and provide an assessment of the model's stability and generalization ability.

During the time series cross-validation process, various evaluation indicators (such as mean square error, mean absolute error, prediction accuracy, etc.) can be used to measure the performance of the model, and the best model configuration, hyperparameters or features can be selected based on the evaluation results. .

In summary, time series cross-validation is a validation method for time series data that can more accurately evaluate the performance of a time series model and provide an estimate of the predictive ability of future data.

In time series analysis, forward chain is a method used to generate and predict future observations. It predicts future observations by recursively applying a model based on a sequence of values ​​for currently known observations. To maintain order and make the training set occur before the test set, we use a forward chain approach where the model is initially trained and tested with the same window size. And, for each subsequent fold, the size of the training window increases to include the previous training data and test data. The new test window again follows the training window but remains the same length.

Let us use forward chain to cross-validate 5 time series data and get new training and test sets. Recall that goals are created by looking into the future because we have to look at stock returns one day in advance to predict whether the stock will rise or fall. For data leakage of forward-looking variables, we must set gap=1.

# Cross-validdation
tscv = TimeSeriesSplit(n_splits=5, gap=1) 

Grid Search (grid search)

The traditional way of optimizing hyperparameters is usually to use grid search (also called parameter sweep). It is a manually specified exhaustive search of the hyperparameter space of a learning algorithm. Grid search algorithms must rely on some kind of performance metric for guidance, usually measured by cross-validation on a training set or evaluation on a validation set.

Grid Search performs an exhaustive search for specified parameter values ​​for a given estimator. It implements methods such as "fit" and "score". The parameters of the estimators used in applying these methods were optimized through a cross-validated grid search on the parameter grid.

# Get parameters list
pipe.get_params()

Insert image description here

# Perform GridSearch and fit
param_grid = {
    
    "classifier__n_neighbors" : np.arange(1, 51, 1)}    # k 选取1到51,间隔为1, 参数自给自定

grid_search = GridSearchCV(pipe, param_grid, scoring='roc_auc', n_jobs=-1, cv=tscv, verbose=1) 
# 原先默认regressior是R-sqr;现在指定AUC为优化目标,进行调参,找到一个最好的模型使得AUC面积最大
# 整好了,开始拟合
grid_search.fit(X_train, y_train)

Insert image description here

# Best Params
grid_search.best_params_

Insert image description here

# Best Score
grid_search.best_score_

Insert image description here
It’s not very good either, because the classification is too even. After getting K=35, let's train again.

# Instantiate KNN model with search param 不用调参,不用规模化处理,不用pipe了
clf = KNeighborsClassifier(n_neighbors=grid_search.best_params_["classifier__n_neighbors"])
# Fit the model
clf.fit(X_train, y_train)

Insert image description here

# Predicting the test dataset
y_pred = clf.predict(X_test)
# Measure Accuracy
acc_train = accuracy_score(y_train, clf.predict(X_train))
acc_test = accuracy_score(y_test, y_pred)
# Print Accuracy
print(f"\n Training Accuracy \t: {
      
      acc_train :0.4} \n Test Accuracy \t\t: {
      
      acc_test :0.4}" )

Insert image description here

# Confusion Matrix for binary classification
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(tn,fp, fn, tp)

Insert image description here

# Plot Confusion Matrix
plot_confusion_matrix(clf, X_test, y_test, cmap='Blues', values_format='.4g')
plt.title("Confusion Matrix")
plt.grid(False)

Insert image description here

# Plot ROC curve 
# 画出来0.51,也没改变多少;一般来说得到一个0.6的,就是很好的了
plot_roc_curve(clf, X_test, y_test)

Insert image description here

# Classification report
print(classification_report(y_test, y_pred))

Insert image description here

Typically, a model is said to be better when the AUC value reaches 0.6 or higher. Numerical test accuracy and AUC values ​​show that the model performs slightly better than random guessing, but has limited predictive power. An AUC value of 0.51 indicates that the model is no better than random guessing at distinguishing the two classes. An AUC value of 0.5 means that the model is the same as random guessing, so a value of 0.51 only represents a small improvement.

The test accuracy is 0.51, which means that the model can correctly predict the results of the binary classification task for more than half of the test samples. However, this is not a very high accuracy, indicating that there is still room for significant improvement.

Because we are just giving examples here, you can try it using data from other stocks or constructing a portfolio.

Trading straregy

Based on the target (or classifier) ​​we define, we formulate a trading strategy. Assuming no transaction costs, if a buy signal occurs, we will buy one share of SPY today; if a sell signal occurs, we will short one share of SPY today.

The reason we only operate one stock at a time is to make comparisons between strategy returns and future returns more intuitive.

After improving the accuracy, apply the KNN prediction model and generate a new variable called "Signal", which represents the prediction of the rise or fall of the stock tomorrow. For predictions of increases, signal=1; for predictions of decreases, signal=-1.

The strategy return will be calculated as:
Strategy return = Future return * Signal

# Subsume into a new dataframe
df1 = df.copy()                   # df[-len(X_test)]
df1['Signal'] = clf.predict(X)    # clf.predict(X_test),预测出来的正负一,代表当天预测出来的明天的涨跌
# Strategy Returns
df1['Strategy'] = df1['Forward Returns'] * df1['Signal'].fillna(0)
# Check the output
df1.tail(10)

Insert image description here

# plot
plt.plot(np.cumsum(df1['Strategy']))          #策略的累计回报
plt.plot(np.cumsum(df1['Forward Returns']))   #市场自身的累计回报

The chart below shows the cumulative sum of market returns (red line) and strategy returns (blue line).
Insert image description here

Return analysis

Here is some code to draw the graph.

# !pip install pyfolio
import pyfolio as pf
# Create Tear sheet using pyfolio for outsample - for X_test,这个策略回报率必须对应时间,不然参数对应不上
pf.create_simple_tear_sheet(df1['Strategy'])

Insert image description here
Insert image description here
You can also set a start time. Examples are as follows:

# Live start date 2016.4.7
pf.create_simple_tear_sheet(df1['Strategy'], live_start_date='2016-04-07')

Insert image description here
Draw heatmap.

pf.plot_monthly_returns_heatmap(df1['Strategy'])

Insert image description here
Draw a bar chart of annual returns.

pf.plot_annual_returns(df1['Strategy'])

Insert image description here

pf.plot_monthly_returns_dist(df1['Strategy'])

Insert image description here

Guess you like

Origin blog.csdn.net/Pet_a_Cat/article/details/131516273