Time series forecasting is an important technique in data science and business analysis for predicting future value based on historical data. It has a wide range of applications, from demand planning and sales forecasting to econometric analysis. Python has become a popular predictive programming language due to its versatility and availability of specialized libraries. One such library tailored for time series forecasting tasks is skforecast.
In this article, we introduce skforecast and demonstrate how to use it to generate forecasts on time series data. A valuable feature of the skforecast library is its ability to use data without datetime indexes for training and forecasting.
data set
The dataset I used in this article comes from Kaggle, which provides a comprehensive window into various physical activities through accelerometer data. Here we only extracted the acceleration signal representing the walking activity of one of the participants.
The data set is available here: https://www.kaggle.com/datasets/gaurav2022/mobile-health
Hyperparameter tuning and lag selection
Step 1: Divide the time series signal into a training set, a validation set and a test set.
end_train = 2500
end_val = 2750
data_train = acc_x_walking[:end_train]
data_val = acc_x_walking[end_train:end_val]
data_test = acc_x_walking[end_val:]
Skforecast uses a structure similar to Sickit-Learn, a framework that many people are familiar with. So hyperparameter tuning and lag selection for the five models is a simple process.
RandomForestRegressor, GradientBoostingRegressor, Ridge, LGBMRegressor and XGBRegressor can all be used to predict regression models with continuous values. So we can determine the optimal parameters for each model to minimize the mean square error. These improved parameters will then be integrated into models to predict walking activity.
Lag determines the maximum number of lag values (time steps) in the past that will be used as features to predict the future. It indicates how many past observations will be considered as input features for predicting the next observation.
The step size specifies the number of steps into the future to make a prediction. It represents the prediction horizon or the number of time steps the model should predict.
# Models to compare
models = [RandomForestRegressor(random_state=42),
GradientBoostingRegressor(random_state=42),
Ridge(random_state=42),
LGBMRegressor(random_state=42),
XGBRegressor(random_state=42)]
# Hyperparameter to search for each model
param_grids = {'RandomForestRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]},
'GradientBoostingRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]},
'Ridge': {'alpha': [0.01, 0.1, 1]},
'LGBMRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]},
'XGBRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]}}
# Lags used as predictors
lags_grid = [2, 5, 7]
df_results = pd.DataFrame()
for i, model in enumerate(models):
print(f"Grid search for regressor: {model}")
print(f"-------------------------")
forecaster = ForecasterAutoreg(
regressor = model,
lags = 2
)
# Regressor hyperparameters
param_grid = param_grids[list(param_grids)[i]]
results = grid_search_forecaster(
forecaster = forecaster,
y = data_train,
param_grid = param_grid,
lags_grid = lags_grid,
steps = 250,
refit = False,
metric = 'mean_squared_error',
initial_train_size = 50,
fixed_train_size = True,
return_best = False,
n_jobs = 'auto',
verbose = False,
show_progress = True
)
# Create a column with model name
results['model'] = list(param_grids)[i]
df_results = pd.concat([df_results, results])
df_results = df_results.sort_values(by='mean_squared_error')
df_results.head(10)
The result of the hyperparameter tuning process is a DF representing the model used and its respective mean square error and various parameters as shown below.
Through hyperparameter tuning, the optimal parameters of the model obtained are:
GradientBoostingRegressor
max_depth=30
n_estimators=10
lags = 2
Ridge
alpha=1
lags = 2
RandomForestRegressor
max_depth=5
n_estimators=100
lags = 7
LGBMRegressor
max_depth=15
n_estimators=10
lags = 5
XGBRegressor
max_depth=5
n_estimators=10
lags = 2
predict
We now know the best parameters to apply to the model and can start training. Split the data into training and test sets. The reason why we separated the verification machine and the test set above is that the test set does not participate in the hyperparameter tuning process, and it is still completely unknown to the model.
# Split train-test
step_size = 250
data_train = acc_x_walking[:-step_size]
data_test = acc_x_walking[-step_size:]
The next step is to create and fit a predictive model.
# Create and fit forecaster
# GradientBoostingRegressor
gb_forecaster = ForecasterAutoreg(
regressor = GradientBoostingRegressor(random_state=42, max_depth=30, n_estimators=10),
lags = 2
)
# Ridge
r_forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=42, alpha=1),
lags = 2
)
# RandomForestRegressor
rf_forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=42, max_depth=5, n_estimators=100),
lags = 7
)
# LGBMRegressor
lgbm_forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=42, max_depth=15, n_estimators=10),
lags = 5,
)
# XGBRegressor
xgb_forecaster = ForecasterAutoreg(
regressor = XGBRegressor(random_state=42, max_depth=5, n_estimators=10),
lags = 2,
)
# Fit
gb_forecaster.fit(y=data_train)
r_forecaster.fit(y=data_train)
rf_forecaster.fit(y=data_train)
lgbm_forecaster.fit(y=data_train)
xgb_forecaster.fit(y=data_train)
# Predict
gb_predictions = gb_forecaster.predict(steps=step_size)
r_predictions = r_forecaster.predict(steps=step_size)
rf_predictions = rf_forecaster.predict(steps=step_size)
lgbm_predictions = lgbm_forecaster.predict(steps=step_size)
xgb_predictions = xgb_forecaster.predict(steps=step_size)
The figure below shows the prediction results of the five models. It is obvious that all models except gradient boosting produce flat line predictions. There are many reasons here. For example, for several other models, because we are introducing skforecast, all hyperparameters are not set, which may not be fitted yet. This can be adjusted again.
in conclusion
skforecast is a very good choice for mastering time series forecasting in Python. It's easy to use and a great tool for predicting future value based on historical data.
Throughout the exploration of this article, features of skforecast were used to tune hyperparameters and select lags for basic regression models such as RandomForestRegressor, GradientBoostingRegressor, Ridge, LGBMRegressor, and XGBRegressor.
A significant advantage of skforecast is the user-friendly documentation, which clearly explains the model's functions and parameters. If you are looking for an easy and efficient way to explore time series forecasting, skforecast is a very good choice.
Recommended reading:
My 2022 Internet School Recruitment Sharing
A brief discussion on the difference between algorithm positions and development positions
Internet school recruitment R&D salary summary
Public account:AI snail car
Stay humble, stay disciplined, keep improving
Send [Snail] to get a copy of "Hand-in-Hand AI Project" (AI Snail Cart)
Send [1222] to get a good leetcode test note
Send [Four Classic Books on AI] to get four classic AI e-books