Euclidean distance is equivalent to a special case of DTW, warping path corresponding to the lower left corner to a diagonal matrix from the upper right corner.
Removing the square roots
By using lower bounds to prune, because the computation time complexity lower bounds are less than the time complexity of the DTW. such as:
-
LBKimFL
LBKimFLThe time complexity is O (1)
In carrying out this paper, because the time series are normalized, time-series data of maximum and minimum values for the entire lower bound distance smaller contribution, therefore, to remove the original algorithm LB_kim (time complexity of O (n)) extracted four characteristic points of maximum and minimum values, so that the reduced time complexity of O (1). However, in order for this strategy to achieve maximum effect, the authors extracted the 2nd, 3rd and penultimate two or three time points, to cascade pruning. (For more details, refer algorithm lb_kim_hierarchy method)
-
LBKeogh
LBKeogh
Early Abandoning of ED and
LBKeoghLBKeoghBased on this optimization strategy, proposed by the authors Reorder early abandoning further reduce the computational cost.
UCR Suite optimization strategy
Concepts and definitions
Optimization Strategy:
1.Early Abandoning Z-normalization
Need for Q and C before calculating DTW distance normalization process, but on the entire dataset complexity normalized too high, hence the use of online Z-normalization, so to early stop strategy may be employed to end normalization advance calculations.
First, the mean and variance formula for computing the sequence C as follows (Equation 3):
When using online Z-normalization of the current iteration to the k th point source sequence of time T, the square of the time elements calculated and accumulated, and the accumulated time point of the element and expressed as (Equation 4):
Then for calculating the mean and variance of the m corresponding to the point of time between the k-m + 1 to k shown in the formula (Formula 5)
Therefore, abandon normalization early pseudo Code policy based online Z-normalization as shown in (1-7):
-
Reorder early abandoning
Front early calculation of abandoning the policy is from the first time point sequence begins, calculated from left to right. This paper presents a strategy is to quickly find the sum of the differences between the maximum subsequence Q and C, and according to its sequence to determine whether the best-so-far greater than the value, so as to achieve the purpose of reducing the computational cost. Computational cost comparison of these two sequences shown below (1-8):
-
Left represents the difference between the calculated left to right order, it is necessary to calculate nine time steps in order to determine whether a premature end, and the right to find a new calculation procedure, this time need only count five time steps can determine in advance whether End.
-
Well, now the question becomes, how to find the difference between the sum of the greatest sequences?There is a question whether these sequences are contiguous to find? By reading the source code, we can see subsequence is not necessarily continuous.Herein practice, the absolute value of the first Z-normalization processed through a sequence of all time points Q sort elements, this theoretical basis is acquired during the DTW distance between the time sequence, qi may correspond to a plurality of time points in the sequence C. C sequence is z-normalization Gaussian distribution with mean 0 mean, therefore, furthest from the mean value 0 qiqiAuthor experiments show that the use of such a correlation is calculated way to find the best order and the order of evaluation of real 0.999. -
Reversing the query/data role in LBKeoghLBKeoghBased on Q, use LBKeoghEQLBKeoghEQTo prune, there needs to be calculated only once for the U and L Q, which can save a lot of time and space overhead; if used in all LBKeoghECLBKeoghECTo prune, based on each of a C, U and calculate L, then the computational cost will increase a lot.
therefore, LBKeoghECLBKeoghECPolicy option is available only if LBKeoghEQLBKeoghEQPruning effect is not ideal, they can "just-in-time" strategy to use LBKeoghECLBKeoghECTo assist LBKeoghEQLBKeoghEQPruning efficiency, thus greatly reducing the space overhead. for LBKeoghECLBKeoghECTime cost, time overhead can be reduced by pruning to complete DTW offset. Intuitive understanding of the two calculation methods is shown below (1-9):
-
Use cascading lower boundsCurrently there is a variety of ways to calculate the lower bound. Each lower bound can be used to prune the DTW and the time complexity can be estimated. So far, there are at least 18 lower bound mechanism, all of them achieved again, and then were tested in a comparison of 50 different sets of data, the results shown below (1-10):
-
Based on the above results, the authors carried out a variety of ED and DTW cascade through lower bound manner pruning:First, the time complexity is O (1) is LBKimFLLBKimFL, Which can filter out many of the candidate subsequence, then, based on Q, using LBKeoghEQLBKeoghEQTo prune,
-
in case, LBKeoghEQLBKeoghEQPruning effect is not ideal when using LBKeoghECLBKeoghECTo assist LBKeoghEQLBKeoghEQPruning efficiency,Finally, if all the above pruning strategy fails, you can still be calculated DTW complete by early abandoningExperiments show that each lower bound strategy used above can help improve the speed of DTW, a lower bound remove any strategy will make the search speed is doubled. In a large-scale search, pruning strategies can save more than 99.9999% of the time overhead DTW algorithm.
Analysis of results
-
Naive: each subsequence is zero-based, z normalization. Each step using a full Euclidean distance or DTW. (About two-thirds of the article is based on the idea to similarity calculation)
-
State-of-the-art: the best current model Z-normalization, early abandoning the use of lower bound to assist and complete calculation of these strategies DTW achieved. (About one-third of the article based on the idea to similarity calculation)
-
UCR Suite
-
GOd's ALgorithm (GOAL) directly based on the mean, variance calculated by comparing the similarity, the time complexity is O (1)GOAL model is equivalent to a baseline model for all solve the problem of the length of the unknown sequence search unlimited fastest model.
Randomly generated data set based on a comparison of experimental results
Experimental comparison of different lengths query
UCR-DTW python achieve
UCR-DTW applied all of the above optimization strategy
GitHub:ucr-suite-python
UCR-ED python achieve
UCR-ED application optimization strategy is as follows:
Early Abandoning of ED
Reorder early abandoning
|
|
Reference material
Searching and mining trillions -blog
Search Time Series
DTW (Dynamic Time Warping) Dynamic Time Warping
Time Series Classification and Clustering