[Data analysis] Predictive analysis using machine learning algorithms (3): K-Nearest Neighbours (2021-01-17)

Machine learning methods in time series forecasting (3): K-Nearest Neighbours

This article is the third article in the series of " Machine Learning Methods in Time Series Forecasting ". If you are interested, you can read the previous article first:
[Data Analysis] Using Machine Learning Algorithms for Predictive Analysis (1): Moving Average (Moving Average) Average)
[Data Analysis] Predictive analysis using machine learning algorithms (2): Linear Regression

1 Introduction

Let me give you a simple example. Based on the known age, height and weight of ten people, infer the weight of the eleventh person whose age and height are known.
Insert picture description here
Insert picture description here
The idea of ​​KNN here is to select the weight of people around No. 11 to average. From the above figure, we can see that in the two dimensions of height and age, there are 4, 6, 5, and 1 that are relatively close to 11. If the cluster size is set to 3, then we can choose three arbitrarily Neighboring points to calculate, for example, choose 1, 5, 6: from
Insert picture description here
this it is inferred that the weight of No. 11 is (77+72+60)/3=69.66.

2. Stock price prediction based on nearest neighbor algorithm

The data set is the same as the previous two articles, and the purpose is to compare the prediction effects of different algorithms on the same data set. The data set and code are on my GitHub , and friends who need it can download it by themselves.

Import the package and read in the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading the data
df = pd.read_csv('NSE-TATAGLOBAL11.csv')

The following data processing operation with the article linear regression similar, are not repeated here.

# setting the index as date
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']

#creating dataframe with date and the target variable
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])

for i in range(0,len(data)):
     new_data['Date'][i] = data['Date'][i]
     new_data['Close'][i] = data['Close'][i]
#create features
from fastai.tabular import add_datepart
add_datepart(new_data, 'Date')
new_data.drop('Elapsed', axis=1, inplace=True)  #elapsed will be the time stamp

new_data['mon_fri'] = 0
for i in range(0,len(new_data)):
    if (new_data['Dayofweek'][i] == 0 or new_data['Dayofweek'][i] == 4):
        new_data['mon_fri'][i] = 1
    else:
        new_data['mon_fri'][i] = 0
#split into train and validation
train = new_data[:987]
valid = new_data[987:]

x_train = train.drop('Close', axis=1)
y_train = train['Close']
x_valid = valid.drop('Close', axis=1)
y_valid = valid['Close']

Import the packages related to the nearest neighbor model.

#importing libraries
from sklearn import neighbors
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler 

Perform normalization processing.

#scaling data
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train) #对x_train进行归一化处理
x_train = pd.DataFrame(x_train_scaled)
x_valid_scaled = scaler.fit_transform(x_valid) #对x_valid进行归一化处理
x_valid = pd.DataFrame(x_valid_scaled)

Take a look at what the normalized data looks like.

x_train

Insert picture description here
Use GridSearchCV to find the best parameters.

#using gridsearch to find the best parameter
params = {
    
    'n_neighbors':[2,3,4,5,6,7,8,9]}
knn = neighbors.KNeighborsRegressor()
model = GridSearchCV(knn, params, cv=5)

Adapt the model and make predictions.

#fit the model and make predictions
model.fit(x_train,y_train)
preds = model.predict(x_valid)

The size of RMSE reflects the size of the error to a certain extent.

#rmse
rmse = np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2)))
rmse 

Insert picture description here
Visually observe the forecasting situation by drawing.

#plot
valid['Predictions'] = 0
valid['Predictions'] = preds
plt.figure(figsize=(16,8))
plt.plot(valid[['Close', 'Predictions']])
plt.plot(train['Close'])
plt.show()

Insert picture description here
It can be seen that the prediction effect of KNN on this data set is not very good.

Guess you like

Origin blog.csdn.net/be_racle/article/details/112747349