Time series forecasting using skforecast

Time series forecasting is an important technique in data science and business analysis for predicting future value based on historical data. It has a wide range of applications, from demand planning and sales forecasting to econometric analysis. Python has become a popular predictive programming language due to its versatility and availability of specialized libraries. One such library tailored for time series forecasting tasks is skforecast.

In this article, we introduce skforecast and demonstrate how to use it to generate forecasts on time series data. A valuable feature of the skforecast library is its ability to use data without datetime indexes for training and forecasting.

data set

The dataset I used in this article comes from Kaggle, which provides a comprehensive window into various physical activities through accelerometer data. Here we only extracted the acceleration signal representing the walking activity of one of the participants.

The data set is available here: https://www.kaggle.com/datasets/gaurav2022/mobile-health

Hyperparameter tuning and lag selection

Step 1: Divide the time series signal into a training set, a validation set and a test set.

end_train = 2500
 end_val = 2750
 
 data_train = acc_x_walking[:end_train]
 data_val = acc_x_walking[end_train:end_val]
 data_test = acc_x_walking[end_val:]

d2873bf4ad5e1419a78c72dd63b02d92.png

Skforecast uses a structure similar to Sickit-Learn, a framework that many people are familiar with. So hyperparameter tuning and lag selection for the five models is a simple process.

RandomForestRegressor, GradientBoostingRegressor, Ridge, LGBMRegressor and XGBRegressor can all be used to predict regression models with continuous values. So we can determine the optimal parameters for each model to minimize the mean square error. These improved parameters will then be integrated into models to predict walking activity.

Lag determines the maximum number of lag values ​​(time steps) in the past that will be used as features to predict the future. It indicates how many past observations will be considered as input features for predicting the next observation.

The step size specifies the number of steps into the future to make a prediction. It represents the prediction horizon or the number of time steps the model should predict.

# Models to compare
 models = [RandomForestRegressor(random_state=42), 
           GradientBoostingRegressor(random_state=42),
           Ridge(random_state=42),
           LGBMRegressor(random_state=42),
           XGBRegressor(random_state=42)]
 
 # Hyperparameter to search for each model
 param_grids = {'RandomForestRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]},
                'GradientBoostingRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]},
                'Ridge': {'alpha': [0.01, 0.1, 1]},
                'LGBMRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]},
                'XGBRegressor': {'n_estimators': [10, 50, 100], 'max_depth': [5, 15, 30, 45, 60]}}
 
 # Lags used as predictors
 lags_grid = [2, 5, 7]
 
 df_results = pd.DataFrame()
 for i, model in enumerate(models):
 
     print(f"Grid search for regressor: {model}")
     print(f"-------------------------")
 
     forecaster = ForecasterAutoreg(
                      regressor = model,
                      lags      = 2
                  )
 
     # Regressor hyperparameters
     param_grid = param_grids[list(param_grids)[i]]
 
     results = grid_search_forecaster(
                   forecaster         = forecaster,
                   y                  = data_train,
                   param_grid         = param_grid,
                   lags_grid          = lags_grid,
                   steps              = 250,
                   refit              = False,
                   metric             = 'mean_squared_error',
                   initial_train_size = 50,
                   fixed_train_size   = True,
                   return_best        = False,
                   n_jobs             = 'auto',
                   verbose            = False,
                   show_progress      = True
               )
     
     # Create a column with model name
     results['model'] = list(param_grids)[i]
     
     df_results = pd.concat([df_results, results])
 
 df_results = df_results.sort_values(by='mean_squared_error')
 df_results.head(10)

The result of the hyperparameter tuning process is a DF representing the model used and its respective mean square error and various parameters as shown below.

d9ce23c83e203240c9e6fa3fc3997922.png

Through hyperparameter tuning, the optimal parameters of the model obtained are:

GradientBoostingRegressor

  • max_depth=30

  • n_estimators=10

  • lags = 2

Ridge

  • alpha=1

  • lags = 2

RandomForestRegressor

  • max_depth=5

  • n_estimators=100

  • lags = 7

LGBMRegressor

  • max_depth=15

  • n_estimators=10

  • lags = 5

XGBRegressor

  • max_depth=5

  • n_estimators=10

  • lags = 2

predict

We now know the best parameters to apply to the model and can start training. Split the data into training and test sets. The reason why we separated the verification machine and the test set above is that the test set does not participate in the hyperparameter tuning process, and it is still completely unknown to the model.

# Split train-test
 step_size = 250
 data_train = acc_x_walking[:-step_size]
 data_test = acc_x_walking[-step_size:]

552f59f5903a34d4df23d3ba9e9cc4f0.png

The next step is to create and fit a predictive model.

# Create and fit forecaster
 # GradientBoostingRegressor
 gb_forecaster = ForecasterAutoreg(
                  regressor = GradientBoostingRegressor(random_state=42, max_depth=30, n_estimators=10),
                  lags      = 2
              )
 
 # Ridge
 r_forecaster = ForecasterAutoreg(
                  regressor = Ridge(random_state=42, alpha=1),
                  lags      = 2
              )
 
 # RandomForestRegressor
 rf_forecaster = ForecasterAutoreg(
                  regressor = RandomForestRegressor(random_state=42, max_depth=5, n_estimators=100),
                  lags      = 7
              )
 
 # LGBMRegressor
 lgbm_forecaster = ForecasterAutoreg(
                  regressor       = LGBMRegressor(random_state=42, max_depth=15, n_estimators=10),
                  lags            = 5,
              )
 
 # XGBRegressor
 xgb_forecaster = ForecasterAutoreg(
                  regressor       = XGBRegressor(random_state=42, max_depth=5, n_estimators=10),
                  lags            = 2,
              )
 
 # Fit
 gb_forecaster.fit(y=data_train)
 r_forecaster.fit(y=data_train)
 rf_forecaster.fit(y=data_train)
 lgbm_forecaster.fit(y=data_train)
 xgb_forecaster.fit(y=data_train)
 
 # Predict
 gb_predictions = gb_forecaster.predict(steps=step_size)
 r_predictions = r_forecaster.predict(steps=step_size)
 rf_predictions = rf_forecaster.predict(steps=step_size)
 lgbm_predictions = lgbm_forecaster.predict(steps=step_size)
 xgb_predictions = xgb_forecaster.predict(steps=step_size)

The figure below shows the prediction results of the five models. It is obvious that all models except gradient boosting produce flat line predictions. There are many reasons here. For example, for several other models, because we are introducing skforecast, all hyperparameters are not set, which may not be fitted yet. This can be adjusted again.

65e86e24d55062000c1316735fd1c68d.png

in conclusion

skforecast is a very good choice for mastering time series forecasting in Python. It's easy to use and a great tool for predicting future value based on historical data.

Throughout the exploration of this article, features of skforecast were used to tune hyperparameters and select lags for basic regression models such as RandomForestRegressor, GradientBoostingRegressor, Ridge, LGBMRegressor, and XGBRegressor.

A significant advantage of skforecast is the user-friendly documentation, which clearly explains the model's functions and parameters. If you are looking for an easy and efficient way to explore time series forecasting, skforecast is a very good choice.

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 summary

A brief discussion on the difference between algorithm positions and development positions

Internet school recruitment R&D salary summary

The current situation of Internet job hunting in 2022, gold 9 silver 10 will soon become bronze 9 iron 10! !

Public account:AI snail car

Stay humble, stay disciplined, keep improving

ad0fff808394e2dde9ef83574035340f.jpeg

Send [Snail] to get a copy of "Hand-in-Hand AI Project" (AI Snail Cart)

Send [1222] to get a good leetcode test note

Send [Four Classic Books on AI] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/135027345