[Time series] Interpretable multilevel wavelet decomposition network for time series analysis

 1. Article information

The paper I read this week is a paper entitled "Multilevel Wavelet Decomposition Network for Interpretable Time Series Analysis" published in "Proceedings of the 24 th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining" in 2018 involving time series data forecasting article.

 2. Summary

In recent years, time series applications have seen an unprecedented rise in almost all academic and industrial fields. Various types of deep neural network models have been introduced into time series analysis, but there is still a lack of effective modeling of important frequency information. Based on this, the article proposes a wavelet-based neural network structure called multilevel Wavelet Decomposition network (mWDN), which is used to establish a frequency-aware deep learning model for time series analysis. The mWDN model retains the advantages of multi-level discrete wavelet decomposition in continuous learning, while all parameters can be adjusted under the framework of deep neural network. On the basis of mWDN, two deep learning models for time series classification and prediction are further proposed: residual classification (Residual Classification Flow, RCF) and multi-frequency long short-term memory (multi-frequency Long Short-Term Memory, mLSTM ). These two models take all or part of the subsequences decomposed by mWDN at different frequencies as input, and learn all the parameters globally through the backpropagation algorithm, so that wavelet-based frequency analysis can be seamlessly embedded into the deep learning framework. Extensive experiments on 40 UCR datasets and real user volume datasets show that the mWDN-based time series model has good performance. In particular, the article proposes an importance analysis method based on the mWDN model, which successfully identifies time series elements and mWDN layers that are critical to time series analysis. This actually illustrates the interpretability advantage of mWDN, which can be seen as an in-depth exploration of interpretable deep learning.

 3. Introduction

In recent years, with the rapid development of the field of deep learning, different types of deep neural network models have been applied to time series processing and analysis and have achieved satisfactory results in real life, such as recurrent neural networks (RNN), which use memory nodes to captures the correlation of sequence nodes, but most of these models do not exploit the frequency information of the time series.

Wavelet decomposition is a common method to characterize time series characteristics in time domain and frequency domain. In short, it can be used as a feature extraction tool for data preprocessing before deep model modeling. While this loosely coupled approach may improve the predictive performance of the original neural network model, it does not employ an independent parametric inference process for global optimization. How to integrate wavelet decomposition into deep learning models is still challenging.

This article proposes a neural network model based on wavelet decomposition, called multilevel wavelet decomposition network (mWDN), which builds a frequency-aware deep learning model for time series analysis. Similar to the standard multilevel discrete wavelet decomposition model (MDWD), the mWDN model can decompose a time series into a set of subsequences with frequencies ranging from high to low, which is the key to the model obtaining frequency factors. But unlike the MDWD model with fixed parameters, all parameters in mWDN can be learned to adapt to the training data of different learning tasks. That is to say, the mWDN model can not only analyze time series by using wavelet decomposition, but also use the learning ability of deep neural network to learn parameters.

Based on mWDN, the article designs two deep learning models for time series classification (TSC) and time series forecasting (TSF), namely Residual Classification Flow (RCF) and multi-frequency Long Short-Term Memory (mLSTM). The key issue of TSC is to extract representative features from time series data as much as possible. Therefore, the RCF model uses the decomposition results of different levels of mWDN as input, and adopts the residual learning method and classifier stack to mine the features hidden in the subsequence. . As for the TSF problem, the key lies in inferring the future state of time series data according to the hidden trends at different frequencies. Therefore, the mLSTM model puts all high-frequency subsequence data decomposed by mWDN into independent LSTM models, and integrates the outputs of all LSTM models for final prediction. It is worth noting that all parameters of the RCF and mLSTM models, including those of mWDN, are trained using an end-to-end backpropagation algorithm. In this way, wavelet-based frequency analysis can be seamlessly embedded into deep learning models.

 4. Model

1. Multi-level Discrete Wavelet Decomposition

Multilevel discrete wavelet decomposition (as shown in Figure 1) is a discrete signal analysis method based on wavelet transform. This method divides the time series into low-frequency and high-frequency subsequences step by step by decomposing the time series, thereby extracting multi-level time series - Frequency characteristics.

Taking the time series e250df572e0fa46f4648d3918a28789f.pngas an example, the low-frequency and high-frequency subsequences of the i-th layer decomposed are denoted by cf1a9eb607053ccb89b4428550b57e4b.pngand respectively 7c00960357dd7bafb360af28df6d75b9.png. At layer i+1, MDWD uses a low-frequency filter and a high-frequency filter to perform a convolution operation on the low-frequency subsequence of the previous layer, as follows:

1a8c831080ceb35306b7ae035a55a1b2.png

where, 035a1033ee6cdb65858b7156df2d068a.pngrepresents the nth element of the low-frequency subsequence in the i-th layer, and 038e13549408dd0492bd1691e855e27c.pngis set as the input sequence. d6f8ffeedff4bca21f3f92a72173bfc7.pngThe low-frequency and high-frequency subsequence sums of the i-th layer b827db4da03bb3f17c2fc03321f9f78b.pngare obtained by downsampling the intermediate variable sequence sums f41bf8a35c0f93a8f4efea206b4c7d04.pngby a848c2abd0c260a82aaca034e8af606e.pngone-half.

The subsequence 59782f9405971813c52bdd7ecaf0cc0b.pngis called the i-th level decomposition result of the time series data X. In particular, the sequence satisfies: 1) the original sequence X can be completely reconstructed from subsequences; 2) sequences at different levels have different time and frequency resolutions. As the number of layers increases, the frequency resolution continues to increase, while the time resolution, especially for low-frequency subsequences, continues to decrease.

5385ee15707f988060af3ba2f01e2bef.png

Figure 1 mWDN model framework

2. Multi-level Wavelet Decomposition Network

Figure 1 is the framework diagram of the mWDN model. As shown in the picture, the mWDN model decomposes time series data hierarchically according to the following two formulas:

7eb9501f104cce4a723c336c977a95b6.png

Represents the sigmoid activation function, 6848c34fc18dcf1a3daac9017d163653.pngand 91ffd266a3f2dc10524b7e6ea3e789fe.pngis the trainable bias matrix, which is initially a random value close to zero. It can be seen that the equation in Equation (2) is very similar to the equation in Equation (1). d11d629730e76b2119399ee80e0dbe50.pngand 388f4d3054ac2b4c3dfc0303300ecedc.pngrespectively denote the low-frequency and high-frequency subsequences generated by the decomposition of the time series X at level i, which are obtained by intermediate variables 65e1b233ce66bedb4b0870233a0d681d.pngand downsampling b63677a903c41bbecc4287030e074af7.pngby average pooling . 93d8758a5907e464d8b9ab7904a673ca.pngIn order to realize the convolution operation of formula (1), we set the initialized weight matrix sum as follows:

4419e7c67c2d700cf8caf6e7e628a5eb.png

Obviously, bdfe4c1cdcbe0c388bffa4e9651232fa.pngand 8b386758c3a20e9ff8eca92b30cde959.png, where P is 31f509fc7c8bd3dc39536f736f7e6e1c.pngthe size of . In the weight matrix are random values c0e27a4cba2a8a9f544532cf0e9e0fd7.png​​satisfying the sum. 579847016e54899eb675b5d0f7853b93.pngThe article uses Daubechies 4 wavelet coefficients in the model, and its filter coefficients are as follows:

bbf92f3dd781e2ce031aabd85e37437f.png

From formula (2) to formula (3), the article uses the deep neural network framework to realize the approximate MDWD model. It is worth noting that although the weight matrix is ​​initialized as the filter coefficients of the MDWD model, the matrix can still be trained on the perturbation of real data.

3. Residual Classification Flow

The TRC task is mainly to predict and classify time series of unknown class labels. The key is to extract distinct features from time series data. The natural time-frequency feature X obtained from the decomposition of the mWDN model can be applied to TSC. In this part, the article proposes the Residual Classification Flow (RCF) network to mine the potential application of mWDN in the TSC task.

0cfe94b5e8a9da54ab19236992a15317.png

Figure 2 RCF model framework

The framework of the RCF model is shown in Figure 2, which contains many independent classifiers. 4aec9681a273fa1f61e970dbbd85f38f.pngThe RCF model connects the subsequences generated by the i-th layer mWDN through the forward neural network 9e78f4cac1175853aac2c29e66f751dd.png:

e4b677326e2dc368121670e20f22cf54.png

Can be represented as a multi-layer perceptron, or a convolutional neural network, or other types of neural networks, and represent trainable parameters. In addition, the RCF model uses a residual network structure that will be connected to all classifiers:

ddbeab016bd659302f3fd28f621089d6.png

Represents a softmax classifier representing the one-hot encoded predictors of the class labels of the time series. 429631494014fe2cdd5445ece5fc5822.pngThe RCF model evolves the decomposition results of mWDN at each level . Because the decomposition results at different mWDN levels have different time and frequency resolutions, the RCF model can fully capture the patterns of input time series with different time and frequency resolutions. In other words, RCF adopts a multi-view learning method to achieve high-performance time series classification. In addition, deep residual networks are proposed to solve the problem that may cause training difficulties when using deeper network structures. RCF also inherits this advantage. In Equation (6), the classifier at level i makes a decision based on the decision of the classifier at level i-1. Therefore, users can append residual classifiers until the classification performance of the model no longer improves.

4. Multi-frequency Long Short-Term Memory

The article proposes a long-short-term memory neural network based on mWDN multi-frequency to solve the TSF problem. The design of the mLSTM model is based on the recognition that the temporal correlation of hidden nodes in a time series is closely related to frequency. For example, large-scale temporal dependencies such as long-term trends are usually at low frequencies, while small-scale temporal dependencies such as short-term disturbances and events are generally at high frequencies. Therefore, the article divides the complex TSF problem into many sub-problems to predict the subsequence decomposed by mWDN, which will make the problem relatively simpler because the frequency composition of the subsequence is more simplified. Given a time series of infinite length, a sliding window of size T from the past to time t is given on the series as follows:

d7c69f835412031d2c552b93a4db5d6b.png

Use mWDN to decompose X to obtain low-frequency and high-frequency sequence data at level i as follows:

d0e55e2efb0037ad8aebe7dd1b59ee66.png

As shown in Figure 3, the mLSTM model uses the decomposition results of the last layer as input to N+1 independent LSTM sub-networks. Each sub-LSTM network predicts the future state of each subsequence in the network. Finally, the prediction values ​​of each sub-LSTM network are fused together through a fully connected neural network to obtain the final prediction result.

488cea9c63ea11cb8ef5394a639577db.png

Figure 3 mLSTM framework

 5. Case studies

In this part, the paper evaluates the performance of the mWDN-based model in solving TSC and TSF problems.

1. Task 1: Time Series Classification

Experimental settings : On 40 data sets of the UCR time series library, the classification performance of different models was tested. The main models are as follows:

  • RNN and LSTM: Recurrent neural network and long short-term memory neural network are two classic deep neural network models, which are widely used in time series analysis.

  • MLP, FCN, and ResNet: These three models are proposed as strong baselines for the UCR time series library. They have the same framework: one input layer, followed by three hidden bias blocks, and finally a softmax activation function as the output layer. MLP uses a fully connected layer as its bias block, while FCN and ResNet use a convolutional layer and residual convolutional network as their bias block, respectively.

  • MLP-RCF, FCN-RCF and ResNet-RCF: These three models use the bias block of MLP/FCN/ResNet as formula (5) in the RCF model. We compared the classification effect of RCF model and MLP/FCN/ResNet to verify the effectiveness of RCF.

  • Wavelet-RCF: This model has the same results as the ResNet-RCF model, but uses mWDN to partially replace the standard MDWD with fixed parameters. We compare it with the ResNet-RCF model to verify the effectiveness of trainable parameters in mWDM.

For each dataset, we ran each model ten times and returned the average classification error as the evaluation metric. In order to compare the performance performance on all data sets, the article further proposes Mean Per-Class Error (MPCE) as the evaluation index of each model. Let be the number of categories of the kth data set, and represent the error rate of each model on this data set, then the MPCE calculation of each model is as follows:

de467f56ea160d3aa4e4315bff3a0c1e.png

Note that the factor of the number of categories is erased in MPCE. The smaller the value of MPCE, the better the overall performance.

Results & Analysis : Table 1 presents the experimental results, and the summary information is listed in the next two lines. Note that the best performance in each dataset is in bold, and the second best in italics. Among all baselines, FCN-RCF achieves the best performance, has the smallest MACE value, and achieves the best performance in 19 of the 40 datasets. The FCN has also achieved a relatively satisfactory performance. It has the best performance on the 9 data sets, and has a relatively small MPCE: 0.023, but the gap with the FCM-RCF is still relatively large. It can also be seen from Table 1 that MLP-RCF performs better than MLP on 37 data sets, and ResNet-RCF performs better than ResNet on 27 data sets. This shows that the RCF framework is indeed a general framework compatible with different types of deep learning classifiers, which can significantly improve the classification performance of TSC tasks.

In addition, Table 1 shows that Wavelet-RCF has achieved the second best performance in MPCE and AVG rankings, which shows that the frequency information obtained by wavelet decomposition is very helpful for time series problems. In addition, it can be clearly seen from the table that the ResNet-RCF model outperforms Wavelet-RCF on most datasets, which strongly proves that our RCF framework adopts parameter trainable mWDN under deep learning instead of directly using Advantages of traditional wavelet decomposition as a feature extraction tool. More technically speaking, compared with Wavelet-RCF, the ResNet-RCF model based on mWND can achieve a better trade-off between the frequency domain prior and the possibility of training data. This also explains why the RCF-based model can achieve better prediction results in previous experimental observations.

表1 Comparison of Classification Performance on 40 UCR Time Series Datasets

3efe8a74c867ebf01b36b8f2a2b7bb00.png

2. TaskⅡ: Time Series Forecasting

Experimental settings : The article tests the predictive ability of the mLSTM model for a traffic forecasting scenario. The experiment uses a real data set named WuxiCellPhone, which contains the user volume time series data of 20 mobile phone base stations located in the center of Wuxi within two weeks, and the statistical time granularity of the user volume time series is 5 minutes. In this experiment, the following models were selected as baselines:

  • SAE (Stacked Auto-Encoders), is widely used in various TSF tasks;

  • RNN (Recurrent Neural Networks) and LSTM (Long Short-Term Memory), models specially proposed for time series analysis;

  • wLSTM, which has the same structure as mLSTM, but replaces the mWDM part with the standard MDWD.

This part uses two commonly used indicators to evaluate the performance of the model, including Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE), which are defined as follows:

bd7f87a3d004ba22d29ac3792b671e17.png

Results & Analysis : We compared the performance of all models in two TSF scenarios (see the original text for specific scenarios). In the first scenario, the article predicts the average number of users of the base station during the subsequence, the length of which varies from 5 to 30 minutes. Figure 4 compares the average performance of 20 base stations over a period of one week. It can be seen from the picture that although the prediction errors of all models gradually decrease as the length of time increases, mLSTM still achieves the best performance. Specifically, the mLSTM model is consistently better than the wLSTM model, again verifying the feasibility of mWDN for time series forecasting.

In the second scenario, the article predicts the average number of users in the next 5 minutes after a given time interval from 0 to 30 minutes. Figure 5 compares the predictive performance of mLSTM and other baselines. Unlike the trend observed in Scenario 1, the forecast error increases gradually as the time scale increases. At the same time, it can be seen from Figure 5 that the mLSTM performance results are once again better than other baselines, which also proves the results observed in Scenario 1.

4fe3b029ccb860b11624af3bd2c26bf8.png

图4 Comparison of prediction performance with varying period lengths(Scenario Ⅰ)

ffba53d2e6ae30492972af357e855824.png

图5 Comparison of prediction performance with varying interval lengths(Scenario Ⅱ)

3. Interpretability research

In this chapter, the article focuses on the unique advantage of the mWDN model: interpretability. Since mWDN embeds discrete wavelet decomposition, the output sum of the middle layer of mWDN 03b4ab0bc3886a1ac6a0c0cca7f00ec9.pnginherits the physical meaning of wavelet decomposition. The article uses two data sets to explain this: WuxiCellPhone and ECGFiveDays. Figure 6(a) shows the sequence data of the number of users of a cell phone base station within a day, and Figure 6(b) shows an electrocardiogram (ECG) sample.

d9383bfd83711937a17e8f7e51df2eec.png

Figure 6 Time series data sample

1. Motivation for the experiment

Figure 7 shows the output of the mWDN layer after inputting the time series samples in Figure 6 in the mLSTM and RCF models, respectively. Figure 7(a) describes the subsequence after three-layer wavelet decomposition in mLSTM. As shown in the picture, the 1cc257a1ab78e701f538828a7a2c871a.pngoutput 2413a9c4fb79702aee99a6f1db3596ad.pngof the middle layer corresponds to the frequency components of the input sequence from high to low. The same situation can also be seen from Fig. 7(b), which respectively represent the output of the first three layers of the RCF model, which indicates that the middle layer of mWDN inherits the frequency decomposition in the wavelet decomposition.

52ae75e46dd15ed16acf6fe32bf5c97b.png

Figure 7 Subsequence generated by mWDN model

2. Materiality analysis

The article introduces a method for the importance analysis of the mWDN model, which aims to quantify the importance of each hidden layer on the final output of the mWDN model. We define the problem of time series classification or forecasting using neural networks as follows:

841a13dfb017cfcd9ccc6f6dd1731df8.png

where M represents the neural network, x represents the input sequence data and p represents the predicted value. Given a trained model M, if a small perturbation of the i-th element causes a large change in the output p, then M is very sensitive to p. Therefore, the sensitivity of the neural network M to the i-th element of the input sequence is defined as the 38f0f18b06068ed4a0c7d8b34a193800.pngpartial derivative of the pair as follows:

6e43320740a5b428a0f40960fdfe13e0.png

Obviously, represents the function of a given model M about. Given a training dataset of J training samples 33a872c65257f8a80dc0c89d2da0e721.png, the importance of the ith element of the input sequence x to the model M can be defined as:

70e1c3dc8d4b8003b96fc939c46bc018.png

where is the value of the i-th element in the j-th training sample.

The calculation formula of the above importance can be extended to the hidden layer of the mWDN model. Assuming that a is an output of the hidden layer a of the mWDN model, the neural network M can be rewritten as:

e73064346e16c8758ec6224ecc72dc4f.png

And the sensitivity of a to model M is defined as:

777cf43657aa66d99c6fdec0385f9d47.png

Given a training dataset 7ba4f956b59d0210e8c53904e5328e68.png, the importance of a to model M is calculated as follows:

a0ccf5df491298281ceee8c69359d721.png

and denote the importance of a time series element and a mWDN layer to the model, respectively.

3. Experimental results

Figures 8 and 9 show the results of the importance analysis, respectively. In Figure 8, the mLSTM model is tested on WuxiCellPhone. Figure 8(b) shows the importance map of all elements, where the x-axis represents the timestamp, and the color of the map represents the importance of features: the redder the more important. It can be seen from the graph that the newest elements are more important than the old elements, which is very reasonable in the time series analysis scenario and also proves the time value of information.

Figure 8(a) shows the importance maps of the hidden layers arranged from top to bottom in order of increasing frequency. In order to facilitate comparison, the length of the output is unified in the article. From the graph, it can be observed that the top low-frequency layers have higher importance; and only the hidden layers with higher importance exhibit temporal values ​​consistent with Fig. 8(b). These all indicate that the low-frequency layer in mWDN is more important for the successful prediction of time series. This is not difficult to understand, because the information captured from the low-frequency layer usually indicates the basic trend of human activities, and thus can be well used to predict the future.

Figure 9 depicts the importance map of the RCF model trained on the ECGFiverDay dataset. As shown in Figure 9(b), the most important elements are roughly located on the time axis from 100 to 110, which is quite different from Figure 8(b). For better understanding, recall that this range corresponds to the T wave of the ECG, which covers the period of time the heart relaxes and prepares for its next contraction. It is generally accepted that T wave abnormalities indicate severe impairment of physiological function. Therefore, describing T-wave elements is more important for classification tasks.

Figure 9(a) is the importance spectrogram of the hidden layers, which are also arranged from top to bottom in increasing order of frequency. One of the interesting phenomena opposite to Fig. 8(a) is that the high frequency layer is more important for the classification task of ECGFiveDays. To understand this, we should understand that the general trend of the ECG curve captured by the low-frequency layer is very similar to everyone, while the abnormal fluctuations captured by the high-frequency layer are the real distinguishable information for identifying heart disease. This also reveals the difference between time series classification and time series forecasting.

The experiments in this section prove the interpretability advantages of the mWDN model generated by combining wavelet decomposition and the importance analysis method proposed in the article, and can also be regarded as a discussion of the black-box problem of deep learning.

 6 Conclusion

The main purpose of this post is to build frequency-aware deep learning models for time series analysis. To achieve this goal, we first design a new wavelet decomposition-based neural network structure mWDN for frequency learning of time series, which can be seamlessly embedded into deep learning frameworks by making all parameters trainable. Based on the mWDN structure, the article further designs two deep learning models for time series classification and prediction tasks, and experiments on a large number of real data sets show that they have more advantages than the most advanced models. As a new attempt to interpretable deep learning, the article further proposes an importance analysis method to identify important factors affecting time series analysis, thus verifying the interpretable advantages of mWDN.

推荐阅读:

我的2022届互联网校招分享

我的2021总结

浅谈算法岗和开发岗的区别

互联网校招研发薪资汇总
2022届互联网求职现状,金9银10快变成铜9铁10!!

公众号:AI蜗牛车

保持谦逊、保持自律、保持进步

发送【蜗牛】获取一份《手把手AI项目》(AI蜗牛车著)
发送【1222】获取一份不错的leetcode刷题笔记

发送【AI四大名著】获取四本经典AI电子书

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/130716694