Dry information | Exploration and application of deep multivariate time series models in Ctrip key indicator prediction scenarios

About the Author

Doublering is a senior algorithm engineer at Ctrip, focusing on natural language processing, LLMs, time series prediction and other fields.

1. Background

In the Internet industry, there are many key indicators that directly affect the company's future planning and decisions, such as traffic, order volume, sales, etc. Effectively predicting these key indicators can help companies make corresponding budgets, plans, and decisions in advance to maximize profits.

Forecasting key indicators is actually a typical time series forecasting problem, that is, predicting the value of a future period of time based on the historical real data of the indicator. There are also some related business scenarios in Ctrip. This article will take forecasting traffic, order volume, and GMV as examples to introduce some of the methods and thinking we use in time series forecasting.

2. Problem Definition and Difficulties

2.1 Caliber definition

Forecast target values: key indicators such as traffic, order volume and GMV.

Forecast duration: next 30 days.

Focus on forecasting during holidays, including statutory holidays such as Tomb-Sweeping Day and Labor Day, as well as the period before and after statutory holidays. Forecasts are required to be made some time in advance of statutory holidays to provide reference for business launch at key time nodes.

2.2 Difficulties

In real-life scenarios, time series forecasting will be affected by various factors such as macro policies, natural disasters, and social movements, and these factors are difficult to quantify. This is manifested as mutation points and non-periodicity in time series data. Furthermore, the period of time series data is usually measured in days, and training samples in some fields with short history will be insufficient. At the same time, the time series model needs to support multiple features to assist prediction, such as holiday features, time features, and various covariates. In order to plan earlier, long-term predictions are needed, that is, predicting values ​​for the next month, half a year, or one year.

3. Scheme design

3.1 Data selection and feature construction

Select the time series data of key indicators in recent years, and the time granularity is "days". When we draw the historical data of various indicators, we can see the obvious holiday effect. Each peak corresponds to a legal holiday or winter and summer vacation periods, and the rise and fall of the value corresponds to working days and non-working days.

From this, we retrospectively construct 7 holiday/time features based on the time series sequence, which are: whether the prediction day is a holiday, whether the prediction day is a working day, what day is the prediction day among holidays, and the distance between the prediction day and the next holiday. The number of days, the day of the week the forecast day is (Sunday is 1), the week number of the year the forecast day is located, the season the forecast day is located, etc.

566e9199f3ebb8eaeabd5e1dc65bfab9.png

At the same time, we noticed that the overall trend between various indicators is related. When predicting one of the indicators, other indicators can also be entered into the model as features to improve the accuracy of prediction. Therefore, a total of about 20 features were constructed. .

3.2 Model introduction

Generally speaking, time series prediction models can be roughly divided into three categories: one is traditional time series prediction models, such as moving average, ARIMA, exponential smoothing, etc., and the other is machine learning models, such as linear regression, tree model, Prophet, etc. The third is deep learning models, such as temporal convolutional network (TCN), LSTM, Transformer, etc.

In current industrial practice, it is common to use traditional time series forecasting models for forecasting. Traditional time series forecasting models have the advantages of interpretability, simplicity and intuitiveness, and theoretical maturity. However, they can usually only be forecasted in a univariate manner. Our task involves the estimation of multiple key indicators, and multiple characteristics will affect the prediction. These indicators also influence each other. In addition, when traditional time series forecasting models deal with multi-step forecasts, they often adopt a rolling forecasting strategy, that is, using the forecast value of the previous period as the actual value to add to the model to obtain the forecast of the next period. This strategy will lead to Errors accumulate, making multi-step predictions increasingly less accurate.

Machine learning methods can use multi-variables for prediction and learn complex patterns and trends in time series data. However, they need to train a model for each indicator, and when it comes to multi-step prediction, there will also be the disadvantage of error accumulation.

In recent years, methods based on deep learning have also been widely used in time series prediction tasks, such as time series convolutional networks (TCN), LSTM, Transformer, etc. These methods overcome the above shortcomings and have the advantages of supporting the input of multiple variables, adaptively extracting features, performing multi-step predictions, and outputting the predicted values ​​of multiple indicators at one time. Therefore, the practical part of this article utilizes deep learning methods. The following briefly introduces several models or methods involved in practice.

3.2.1 Prophet

Prophet is a time series prediction model developed by Facebook. It has the advantages of simplicity, ease of use, high operating efficiency, and strong interpretability. Prophet treats the time series as a function about t, and decomposes the time series into trend items, seasonal items, holiday items, etc. It has two modes: additive model and multiplicative model. The core formula of the additive model is as follows:

y(t)=g(t)+s(t)+h(t)+ϵt

Among them, g(t) represents the trend term, s(t) represents the seasonal term, h(t) represents the holiday term (or generally refers to external variables), and ϵt represents the noise term.

Trend items are used to fit non-periodic trend changes in time series, such as upward and downward trends. According to the trend mode, it can be divided into linear trend and non-linear trend. The formula of linear trend is:

g(t)=kt+m

The formula for a nonlinear trend is:

9fcd0a92be57f87dde0d4d69d8c58a7e.png

Among them, C represents the upper limit capacity, that is, the upper bound that g(t) can reach; k represents the growth rate; m represents the offset parameter, the turning point at which the slope of the trend term changes. Adjusting m can translate the curve left and right. In the specific implementation, C and k are both functions of time t, and the change of k is discontinuous. In order to make g(t) continuous, a series of complex transformations will be introduced, which will not be described here.

The seasonal term is used to fit the periodic trends of weeks, months, quarters, etc., and is approximated using Fourier series:

c2511e91411004ba126d98bfdca42cdc.png

remember

5551cfa4d754eacabacc4509128b8f69.png

Then, s(t)=X(t)β, β needs to be learned5a1170ed42fe371035670e8d41c47298.png, and obeys the normal distribution N(0,σ²).

Among them, P represents the period (365.25 is used for the year and 7 is used for the week), and N represents the number of approximation terms used (10 is used for the year and 3 is used for the week).

Holiday items are used to represent the impact of potential jump points on forecasts, such as holidays, emergencies, etc. Since each holiday has a different impact on the time series, different holidays can be regarded as independent models. And different front and rear window values ​​can be set for different holidays, which means that the holiday will affect the time series for a period of time before and after. Assume that there are L different kinds of holidays, Di represents the date set of the holiday window, and the holiday items can be expressed as:

Z(t)=[1(t∈D1,…,1(t∈DL)]

h(t)=Z(t)κ

Among them, κ also obeys the normal distribution N(0,ν²), and ν is called holidays_prior_scale. The default value is 10. When the value is larger, it means the greater the impact of holidays on the model; when the value is smaller, it means the effect of holidays on the model. The smaller.

Finally, the noise term is used to represent unpredicted random fluctuations.

According to the above introduction, the Prophet model is actually training: k, m in the trend term; β in the seasonal term; κ in the holiday term; and the error term ϵt. The Prophet model is suitable for forecasting time series data that meets the following conditions:

  • Training data: Have at least one complete cycle of data to allow the model to fully learn the rules;

  • Data trend: Data has certain normal periodic effects, such as weekend effects, seasonal effects, etc.;

  • Jump situation: clarify the time point and window period when the jump may occur, such as Double Eleven, Spring Festival, etc.;

  • Missing value situation: Missing values ​​and outliers in historical data are kept within a reasonable range.

3.2.2 Informer

Informer is a time series prediction model based on Transformer architecture. Transformer consists of an encoder and a decoder with a self-attention mechanism. The design goal of Informer is to solve the challenges of traditional time series prediction models in long sequence and multi-scale prediction, so that the model can better capture long-term dependencies and global context information in the sequence.

Key features of Informer include:

  • A sparse self-attention mechanism that achieves O(LlnL) in time complexity and memory usage.

  • The self-attention distillation mechanism performs one-dimensional convolution on the results of each self-attention layer, and then passes a maximum pooling layer to halve the output of each layer to highlight the dominant attention and effectively handle overly long inputs. sequence.

  • The parallel generative decoder mechanism performs a forward calculation on a long sequence and outputs all prediction results instead of predicting step by step, which greatly improves the inference speed of long sequence prediction.

bfb8a36d9a71a9da253b1246a80084ea.png

Informer has achieved good performance in time series prediction tasks, especially in long series and multi-scale prediction. It has been widely used in weather forecasting, traffic flow forecasting, financial market forecasting and other fields. The paper mentioned that the model reaches SOTA on standard data sets such as ETT (transformer temperature), ECL (electricity consumption), and Weather (weather).

3.2.3 Autoformer

The previous time series prediction model Informer based on Transformer used a self-attention mechanism to capture the dependencies between moments and made some progress in time series prediction. However, there are still deficiencies in long-term sequence forecasting:

  • Complex temporal patterns in long sequences make it difficult for attention mechanisms to discover reliable temporal dependencies.

  • Informer has to use a sparse form of attention mechanism to deal with the quadratic complexity problem, but it creates a bottleneck in information utilization.

In order to break through the above problems, the author proposed a model called Autoformer, which mainly includes the following innovations:

  • Breaking through the traditional method of using sequence decomposition as preprocessing, a deep decomposition architecture (Decomposition Architecture) is proposed, which can decompose more predictable components from complex time patterns.

  • Based on the stochastic process theory, the Auto-Correlation Mechanism is proposed to replace the attention mechanism of point-to-point connections, achieve sequence-level connections and O(LlnL) complexity, and break the information utilization bottleneck.

a34b97fa2697cefad61af65b347550b4.png

Time series decomposition refers to decomposing a time series into several components, each component representing a type of potential time pattern, such as periodic items and trend items. Due to the unknowability of the future in forecasting problems, past sequences are usually decomposed first and then forecasted separately. But this will cause the prediction results to be limited by the decomposition effect and ignore the future interactions between the components. The author proposes a deep decomposition layer to embed sequence decomposition into the encoder-decoder as an internal unit of Autoformer. During the prediction process, the model alternately optimizes the prediction results and decomposes the sequence, that is, gradually separates the trend term and the periodic term from the hidden variables to achieve progressive decomposition.

The sequence decomposition unit is based on the idea of ​​moving average, smoothing periodic items and prominent trend items:

Xt= AvgPool(Padding(χ))

XS=X-Xt

Among them, χ is the hidden variable to be decomposed, Xt and XS are the trend term and the periodic term respectively. The above formula is recorded as: Xt, XS=SeriesDecomp(χ).

In addition, Autoformer implements efficient sequence-level connections through autocorrelation mechanisms, thereby expanding information utility. Observing that similar phases of different periods usually exhibit similar sub-processes, the authors take advantage of the inherent periodicity of this sequence to design an autocorrelation mechanism, including period-based dependency discovery and delay information aggregation.

In the end, the author stated that in the long-term prediction problem, Autoformer significantly surpassed the previous SOTA in the five major time series fields of energy, transportation, economy, meteorology, and disease, achieving a relative improvement of 38%.

3.2.4 DLinear

The authors of DLinear try to question the effectiveness of Transformer-based models for time series forecasting. The author believes that the Transformer-based model ignores the time relationship between ordered consecutive points, and the position information in time series is very important in time series prediction, so the DLinear model is proposed.

In fact, the structure of DLinear is very simple, just adding a fully connected layer after the decomposition layer of Autoformer. The model uses decomposition layers to decompose the input time series into a residual part (seasonal) and a trend part. Each part is then input to its own linear layer, which outputs its own result. The final output is the sum of the two parts of the output. The author stated that DLinear has surpassed other deep learning models on data sets in the fields of energy, transportation, economics, meteorology, diseases, etc.

64b37dfdfef82d8f903cfb131db71c11.png

3.2.5 TimesNet

Time series data in the real world are often the superposition of multiple processes, such as daily and weekly changes in traffic data, daily and annual changes in weather data, etc. This inherently multi-cycle nature makes timing changes extremely complex.

For a specific periodic process, the timing changes at each time point are not only related to the nearby moment, but also highly related to adjacent cycles, that is, there are two types of timing changes within the cycle and between the cycles. The intra-cycle changes correspond to the short-term process within a cycle, while the inter-cycle changes can reflect the long-term trends in consecutive cycles.

TimesNet attempts to analyze timing changes from a new multi-cycle perspective. The one-dimensional time series selects multiple periods through fast Fourier transform, and folds the one-dimensional data based on the multiple periods to obtain multiple two-dimensional tensors. The columns and rows of each two-dimensional tensor respectively reflect the relationship between the period and Timing changes between cycles. In this way, one-dimensional time series data is extended to two-dimensional space for analysis.

TimesNet consists of several stacked TimesBlocks. Each TimesBlock has steps such as dimensionality increase of time series data, 2D convolution extraction representation, dimensionality reduction of intermediate results, and adaptive fusion. The input sequence X first passes through the embedding layer to obtain deep features654052ea2ab090bf9be5d8d7a45dc53a.png. For the l-th layer time block, the input is e2cf54272c4e25149bd4cb26dac895b0.png, and the output is obtained after internal processing of the time block: 28d0a53516d849971c3955cbb8a946de.png. Note that each layer of time blocks is processed residual connection.

The internal processing of time blocks:

1) Dimensionality increase of time series data

First, use fast Fourier transform to extract the period from the input one-dimensional time series features5329a409e57ec2fe5b1ed64b886af549.png, and select the k frequencies {f1,...,fk} with the highest intensity, which correspond to the highest Significant k periods {p1,…,pk} are then converted into a two-dimensional tensor. The relevant formulas are as follows:

ab6bb3eda5da89733ae9dc690eb3979e.png

Among them, 37cc66a5dc50946f7e7e47ec3e860a3d.png represents the intensity of each frequency component, and Period represents the process of fast Fourier transform and selection of topk frequencies and periods. Reshape represents the process of converting the one-dimensional result into a two-dimensional tensor, and Padding represents the zero padding operation during the convolution process.

2) 2D convolution extraction representation

For the two-dimensional tensor of each frequency component9ecf566b23c31210241b7169a3ffa30d.png, 2D convolution is used to extract information, and the Inception model is used here:

c9bf80dc4fc03b07dbcd82383ee118d7.png

3) Dimensionality reduction of intermediate results

Convert the two-dimensional features to one-dimensional space and continue information aggregation: 

32f94a94ac6892e0d918f0d63446ab8a.png

Among them, Trunc means removing the zeros added by the Padding operation in step 1).

4) Adaptive fusion

The one-dimensional representation obtained in the previous stepc3b1483be878004dbbc5649cdfbd93b4.png is weighted and summed with its corresponding frequency intensity to obtain the final output of a time block.

e18664688764c38ecb5c428412e74654.png

Through the design of time blocks, the model completes the time series modeling process of extracting two-dimensional time series changes from multiple cycles and then performing adaptive fusion. Residual connections are made between each time block to facilitate the direct return of gradients, which can accelerate model convergence and avoid gradient disappearance.

In the end, TimesNet achieved the best results in data sets in the fields of energy, transportation, economics, meteorology, diseases, etc.

b9bd10d4dff56d29a0c2e0541a582435.png

89d22bdde15a3f6bc1e010e7f720cc49.png

The above are the models we use to compare prediction effects. Since this article focuses on practice, if you want to have a deeper understanding of the specific details of each model, you can read the original paper by yourself. The link to the paper is given in the references.

4. Actual combat

4.1 Data preprocessing

After analysis, the traffic, order volume and other indicators of a certain channel in the natural state have obvious periodicity and trend. However, this periodicity and trend cannot be effectively reflected during the epidemic. As shown in the red box in the figure, during the epidemic The time series values ​​of various indicators have no obvious periodicity and trend compared with before and after the epidemic, and are greatly affected by epidemic policies. After the rapid peak of the epidemic in China from the end of 2022 to the beginning of 2023, various indicator sequences in 2023 have basically restored the cyclicality and trend before the epidemic. Our goal is to predict the indicators after the natural state is restored in 23 years, so consider Data during the epidemic period (2020/01/20~2023/01/19) are excluded.

1463ea903eb93ca2a3fbd7823b62354e.png

Taking the traffic of a certain channel as an example, draw the trend chart after removing the epidemic data. The red line is divided on the annual scale. It can be seen that the data has significant annual periodicity.

022e11dcaa0e012f83dddcb8be255711.png

4.2 Model training and evaluation

Use the historical data and other characteristics of each indicator as a data set, and use the model introduced in 3.2 for prediction (the Prophet model only accepts a single input, so it does not make unified comparisons). The following takes the traffic of a certain channel as an example. The data has been desensitized.

After testing: Informer's effect is not as good as Autoformer in most cases, so subsequent comparisons are conducted among Autoformer, DLinear and TimesNet.

First define several parameters of the input data:

  • model_in: the input dimension of the model, set to 20;

  • model_out: The output dimension of the model, set to 20;

  • seq_len: The size of the encoder time window when training the model, which can be understood as the time period for the model to look back;

  • label_len: The size of the decoder time window when training the model. label_len should not exceed seq_len, and is generally set to half of seq_len;

  • pred_len: The size of the prediction window, that is, how many time steps to predict in the future. In this article, pred_len is equal to 30.

Evaluation indicators: MSE (mean squared error) and MAE (mean absolute error).

* Autoformer                                

4e8698d738357846237ff4365384e67b.png

* DLinear

66ff2f26212f95ee4de98e60b8171246.png

* TimesNet

fc0a778433ddee1c24fd632dbef0d28f.png

We conducted comparative experiments on different seq_len and label_len combinations. The results of the test set show that when seq_len=180 and label_len=90, the prediction results are the best. The horizontal comparison between models shows that TimesNet has the best effect.

In addition, we made a prediction of T=120 for possible mutation points, and the results are as follows:

fc63e5d1b30c8536c0446091935d1be9.png

The image shows that the TimesNet model is also unable to capture changes in mutation points, which may be caused by too little training data. In order to capture mutation points, traditional models such as Prophet that are sensitive to mutation points are integrated. The results show that the integrated model can better capture the changes of mutation points while ensuring prediction accuracy, and the MSE and MAE are both smaller than the original model. ,As shown below.

It is worth mentioning that the input and output dimensions of the above method can be defined by yourself, that is, single input and single output, multiple input and single output, multiple input and multiple output, etc. can be used.

45f83868b654437a760cb347291f92e6.png

4.3 Model deployment and backtesting

After completing the offline training model, we deployed the model online and updated the T+30D results every day. The process is as shown in the figure:

c1be0b946786d9f766cee7ba2d063eea.png

After the model goes online, a certain monitoring mechanism is required to correct the model in time when the model prediction effect is not good. We continuously monitor the T+3D, T+7D, T+14D, T+21D, and T+30D deviations between predicted values ​​and real values, and display them in the form of reports.

ff6da5e487083ff7a869910eef036a9b.png

Taking a certain channel as an example, we applied the model to predict the traffic of a certain channel during the May Day holiday. The average prediction deviation rate was +1.03%, and the peak prediction deviation rate was +0.43%. The model achieved very excellent results. In addition, selecting the end date of the May Day holiday as the cut-off point, the average prediction deviation rates for 3 days, 7 days, 14 days, 21 days, and 30 days are: +3.7%, +1.8%, +1.4%, +5.5% respectively. , -9.6%.

6f8fbf2bbdde589a5377916f6e9a909b.png

54587dab28df87816b247c67e6289222.png

5. Summary and Outlook

Starting from the task of predicting key indicators, this article discusses related methods of time series prediction, model training and evaluation, as well as model online deployment and backtesting, and some of the content has been simplified.

In general, time series prediction methods based on deep learning have been booming in recent years. However, deep learning methods rely heavily on the amount of data. The more training data, the easier it is for the model to discover potential patterns in time series. In real-life scenarios, the amount of data is not always sufficient, which will affect the effect of the deep learning model, and sometimes is even inferior to traditional methods. On the other hand, due to the formulaic mathematical characteristics of time series data itself, it is unknown whether deep learning methods can provide formal representations.

However, with the development of time, there will be more and more data. At the same time, more and more research is currently integrating traditional time series prediction methods into deep learning time series prediction models. This is also the integration of mathematical characteristics into deep learning methods. an attempt.

In the future, we will continue to optimize the time series model based on deep learning, such as constructing more features and covariates, increasing prediction confidence intervals, and improving the evaluation criteria of the model, etc.

6. References

[1] Taylor S J, Letham B. Forecasting at scale[J]. The American Statistician, 2018, 72(1): 37-45.

[2] Zhou H, Zhang S, Peng J, et al. Informer: Beyond efficient transformer for long sequence time-series forecasting[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(12): 11106-11115.

[3] Wu H, Xu J, Wang J, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting[J]. Advances in Neural Information Processing Systems, 2021, 34: 22419-22430.

[4] Zeng A, Chen M, Zhang L, et al. Are transformers effective for time series forecasting?[C]//Proceedings of the AAAI conference on artificial intelligence. 2023, 37(9): 11121-11128.

[5] Wu H, Hu T, Liu Y, et al. Timesnet: Temporal 2d-variation modeling for general time series analysis[J]. arXiv preprint arXiv:2210.02186, 2022.

[6] https://zhuanlan.zhihu.com/p/421710621

[Reading recommendations]

0b20d41ce441813e7d0e4594867c32a6.jpeg

 “Ctrip Technology” public account

  Share, communicate, grow

Guess you like

Origin blog.csdn.net/ctrip_tech/article/details/134067354