"UCR-DTW" and "UCR-ED" Detailed Model


DTW (Dynamic Time Warping) and known optimization strategy


Calculating the similarity between the two time series Q and C, it is a common Euclidean distance measurement method (ED), is calculated as shown below (1-1):

Can be seen from FIG Euclidean distance limitations: Euclidean distance by establishing one correspondence between the two sequences, such that the peak between Q and C are not aligned, and therefore the calculated sequence similarity exists a large deviation . DTW algorithm can be a good solution to this problem.


In most cases, two sequences have a very similar overall shape, but these shapes are not aligned on the time axis. Therefore, before calculating the similarity two sequences, one would need (or both) sequences warping on the time axis, such that the two sequences are aligned peaks better.


DTW is an effective way to achieve this warping. In other words, the DTW algorithm to find another non-correspondence mapping between the two sequences, the mapping relationship is also referred to warping path. In the above Q, C as an example, the correspondence relationship obtained gray line as shown (1-2):


Intuitive understanding of the solution process DTW algorithm for building a nxn matrix (assumed here that Q and C are time sequence length n, the matrix elements (i, j) represents a time point sequence Q qi and time sequence C point
Euclidean distance between cj) goal is to find a path from (0,0) to (n, n) in the matrix, such that all the elements on the path to minimize the sum.


Euclidean distance is equivalent to a special case of DTW, warping path corresponding to the lower left corner to a diagonal matrix from the upper right corner.

At present four kinds of optimization methods known in the DTW search order is as follows:
  1. Removing the square roots


  2. By using lower bounds to prune, because the computation time complexity lower bounds are less than the time complexity of the DTW. such as:


    1. LBKimFL
      L
      B
      K
      i
      m
      F
      L
      The time complexity is O (1)


      In carrying out this paper, because the time series are normalized, time-series data of maximum and minimum values ​​for the entire lower bound distance smaller contribution, therefore, to remove the original algorithm LB_kim (time complexity of O (n)) extracted four characteristic points of maximum and minimum values, so that the reduced time complexity of O (1). However, in order for this strategy to achieve maximum effect, the authors extracted the 2nd, 3rd and penultimate two or three time points, to cascade pruning. (For more details, refer algorithm lb_kim_hierarchy method)


    2. LBKeogh
      L
      B
      Keogh
      The time complexity is O (n)


  3. Early Abandoning of ED and

    LBKeogh
    L
    B
    Keogh

    Based on this optimization strategy, proposed by the authors Reorder early abandoning further reduce the computational cost.


I.e. in the calculation or ED
LBKeogh
L
B
Keogh
When the current time point if the two sequences (1, k) (Note: k <= | Q |) and the square of the difference between, two sequences is greater than the smallest distance value of the current best-so-far, it can be ended prematurely Q and C are similar judgments. Calculation shown below (1-4):



4.Early Abandoning of DTW


Complete calculation
LBKeogh
L
B
Keogh
Value, still need to calculate the full value of DTW, we can use part of the
LBKeogh
L
B
Keogh
Value to reduce the amount of calculation of DTW.


For example, left to right the first time point [1, k] value of DTW, and at a time point [k + 1, n], multiplexed with the previously computed
LBKeogh
L
B
Keogh
value. The final value of the distance still to get the full value of the lower bound DTW. In this case, we can use stop early strategy, each calculated DTW value of the current point of time, you can re-calculated using the previous good
LBKeogh
L
B
Keogh
To obtain a lower bound value of the entire sequence.


By comparing this lower bound, and comparing the current minimum distance value best-so-far, the lower bound if the current value is greater than best-so-far, then the calculation may be ended in advance DTW.


This intuitive manner is represented below (1-5):



Above, we introduced over the previous four kinds of optimization of DTW.


There is also known to enhance the computing speed of DTW strategy that the use of multi-core computing resources.


UCR Suite optimization strategy

Concepts and definitions


Definition 1:


T time series is an ordered list:
T
=
t
1
,
t
2
,
,
t
m
However, the source data is a very long time series, we finally need it was compared with a shorter sequences.


Definition 2:


Subsequence
Ti, k is a sub-sequence in the time sequence T, which starts at
ti, length k, namely:

These, to = ti, ti + 1, ..., ti + k-k-1,1≤i≤m +1.


Here, we
Close
T
i
,
k
Denoted by C, a candidate promoter sequence and the query Q relatively similar. Let Q length | Q | = n.


Definition 3:


Euclidean distance between Q and C (| Q | = | C |) is defined as (Equation 1):


T-th element of the path P is defined
pt=(i,j)t
p
t
=
(
i
,
j
)
t
Then we can warping path is expressed as (Equation 2):


P=p1,p2,⋯,pt,⋯,pT,n≤T≤2n−1


Optimization Strategy:


1.Early Abandoning Z-normalization


Need for Q and C before calculating DTW distance normalization process, but on the entire dataset complexity normalized too high, hence the use of online Z-normalization, so to early stop strategy may be employed to end normalization advance calculations.


First, the mean and variance formula for computing the sequence C as follows (Equation 3):



When using online Z-normalization of the current iteration to the k th point source sequence of time T, the square of the time elements calculated and accumulated, and the accumulated time point of the element and expressed as (Equation 4):

Then for calculating the mean and variance of the m corresponding to the point of time between the k-m + 1 to k shown in the formula (Formula 5)

Therefore, abandon normalization early pseudo Code policy based online Z-normalization as shown in (1-7):


Paper, the authors mention the presence of the floating-point calculation error accumulation problem here, here to finish 1 million sequences through each ratio, we conduct a complete Z-normalization, thereby eliminating the problem of cumulative error.


  1. Reorder early abandoning


Front early calculation of abandoning the policy is from the first time point sequence begins, calculated from left to right. This paper presents a strategy is to quickly find the sum of the differences between the maximum subsequence Q and C, and according to its sequence to determine whether the best-so-far greater than the value, so as to achieve the purpose of reducing the computational cost. Computational cost comparison of these two sequences shown below (1-8):



  1. Left represents the difference between the calculated left to right order, it is necessary to calculate nine time steps in order to determine whether a premature end, and the right to find a new calculation procedure, this time need only count five time steps can determine in advance whether End.

  2. Well, now the question becomes, how to find the difference between the sum of the greatest sequences?
    There is a question whether these sequences are contiguous to find? By reading the source code, we can see subsequence is not necessarily continuous.
    Herein practice, the absolute value of the first Z-normalization processed through a sequence of all time points Q sort elements, this theoretical basis is acquired during the DTW distance between the time sequence,
    qi may correspond to a plurality of time points in the sequence C. C sequence is z-normalization Gaussian distribution with mean 0 mean, therefore, furthest from the mean value 0
    qi
    q
    i
    Contribution to the maximum distance value, and therefore the absolute value of z-normalizated Q sorting sequence, so that the difference between the sum of the maximum fast sub-sequences.
    Author experiments show that the use of such a correlation is calculated way to find the best order and the order of evaluation of real 0.999.
  3. Reversing the query/data role in
    LBKeogh
    L
    B
    Keogh
    Based on Q, use
    LBKeoghEQ
    L
    B
    Keogh
    E
    Q
    To prune, there needs to be calculated only once for the U and L Q, which can save a lot of time and space overhead; if used in all
    LBKeoghEC
    L
    B
    Keogh
    E
    C
    To prune, based on each of a C, U and calculate L, then the computational cost will increase a lot.

    therefore,
    LBKeoghEC
    L
    B
    Keogh
    E
    C
    Policy option is available only if
    LBKeoghEQ
    L
    B
    Keogh
    E
    Q
    Pruning effect is not ideal, they can "just-in-time" strategy to use
    LBKeoghEC
    L
    B
    Keogh
    E
    C
    To assist
    LBKeoghEQ
    L
    B
    Keogh
    E
    Q
    Pruning efficiency, thus greatly reducing the space overhead. for
    LBKeoghEC
    L
    B
    Keogh
    E
    C
    Time cost, time overhead can be reduced by pruning to complete DTW offset. Intuitive understanding of the two calculation methods is shown below (1-9):


  1. Use cascading lower bounds


    Currently there is a variety of ways to calculate the lower bound. Each lower bound can be used to prune the DTW and the time complexity can be estimated. So far, there are at least 18 lower bound mechanism, all of them achieved again, and then were tested in a comparison of 50 different sets of data, the results shown below (1-10):





  1. Based on the above results, the authors carried out a variety of ED and DTW cascade through lower bound manner pruning:


    First, the time complexity is O (1) is
    LBKimFL
    L
    B
    Kim
    F
    L
    , Which can filter out many of the candidate subsequence, then, based on Q, using
    LBKeoghEQ
    L
    B
    Keogh
    E
    Q
    To prune,


  2. in case,
    LBKeoghEQ
    L
    B
    Keogh
    E
    Q
    Pruning effect is not ideal when using
    LBKeoghEC
    L
    B
    Keogh
    E
    C
    To assist
    LBKeoghEQ
    L
    B
    Keogh
    E
    Q
    Pruning efficiency,
    Finally, if all the above pruning strategy fails, you can still be calculated DTW complete by early abandoning


    Experiments show that each lower bound strategy used above can help improve the speed of DTW, a lower bound remove any strategy will make the search speed is doubled. In a large-scale search, pruning strategies can save more than 99.9999% of the time overhead DTW algorithm.


Analysis of results


Paper, comparative analysis of performance for these types of the following ways:


  1. Naive: each subsequence is zero-based, z normalization. Each step using a full Euclidean distance or DTW. (About two-thirds of the article is based on the idea to similarity calculation)


  2. State-of-the-art: the best current model Z-normalization, early abandoning the use of lower bound to assist and complete calculation of these strategies DTW achieved. (About one-third of the article based on the idea to similarity calculation)


  3. UCR Suite


  4. GOd's ALgorithm (GOAL) directly based on the mean, variance calculated by comparing the similarity, the time complexity is O (1)


    GOAL model is equivalent to a baseline model for all solve the problem of the length of the unknown sequence search unlimited fastest model.


UCR Suite are used in the code for the four models of the comparative experiment, the difference between the models is only the corresponding acceleration commenting out the code only.

Randomly generated data set based on a comparison of experimental results

It can be seen from the figure, for the length of the query 128, a large difference in performance between SOFA and UCR Suite set of algorithms.

Experimental comparison of different lengths query

Next, look for different lengths Query, comparison of the performance of these types of models:

UCR-DTW python achieve

UCR-DTW applied all of the above optimization strategy

GitHub:ucr-suite-python

UCR-ED python achieve

UCR-ED application optimization strategy is as follows:

  1. Early Abandoning of ED

  2. Reorder early abandoning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
复制代码
import time
import math

class UCR_ED(object):
    def __init__(self,input_file, query_file,m=128):
        self.fp = open(input_file,'r')
        self.qp = open(query_file,'r')
        self.m = m #length of query
        self.Q = [None]*self.m#query array
        self.T = [0.0]*(self.m*2)#array of current data
        self.order = [] #ordering of query by |z(q_i)|
        self.bsf = float('inf')
        self.loc = 0 #answer:location of the best-so-far match
        
        self.ex,self.ex2,self.mean,self.std=0.0,0.0,0.0,0.0
        #用于统计运行时间
        self.t1 = time.time()
        self.t2 = 0.0
        
        self.Q_normalize()
            
        #Sort the query data
        self.sort_query_order()
        
        #read data file, one value at a time
        ex = 0.0
        ex2 = 0.0
        i = 0
        while True:
            try:
                line = self.line_to_float(next(self.fp))
                ex += line
                ex2 += line*line
                self.T[i%m] = line
                self.T[(i%m)+m] = line
            except:
                break
            # if there is enough data in T, the ED distance can be calculated
            if i>=m-1:
                #the current starting location of T
                j = (i+1)%m
                #Z-norm(T[i]) will be calculated on the fly
                mean = ex/self.m
                std = ex2/self.m
                std = math.sqrt(std-mean*mean)

                #Calculate ED distance
                dist = self.distance(self.Q, self.T, j, self.m, mean, std, self.order, self.bsf)

                if dist<self.bsf:
                    self.bsf = dist
                    self.loc = i-m+1
                ex -= self.T[j]
                ex2 -= self.T[j]*self.T[j]
            i+=1
            
        self.fp.close()

        self.t2 = time.time()
        
        print("Location: ", self.loc)
        print("Distance: ",math.sqrt(self.bsf))
        print("Data Scanned: ", i)
        print("Total Execution Time: ",(self.t2-self.t1),' sec')
        
        
    def line_to_float(self, s):
        return ConvertELogStrToValue(s.strip())[1]
    
    
    def sort_query_order(self):
        self.Q_tmp = {}
        for i in range(self.m):
            self.Q_tmp[i] = self.Q[i]
        self.Q_tmp = dict(sorted(self.Q_tmp.items(),key=lambda x:x[1]))
        #also create another arrays for keeping sorted envelop
        self.order = list(self.Q_tmp.keys())
        
        
    def Q_normalize(self):
        i = 0
        ex = 0.0
        ex2 = 0.0
        while i<self.m:
            try:
                line = self.line_to_float(next(self.qp))
                ex += line
                ex2 += line*line
                self.Q[i] = line
                i+=1
            except:
                break
        self.qp.close()
        
        mean = ex/self.m
        std = ex2/self.m
        std = math.sqrt(std-mean*mean)
        
        #Do z-normalization on query data
        for i in range(self.m):
            self.Q[i] = (self.Q[i]-mean)/std
        
        
    def distance(self,Q,T,j,m,mean,std,order,bsf):
        '''
        Main function for calculating ED distance between the query Q and current data T
        Q is already sorted by absolute z-normalization value |Z-normalize(Q[i])|
        '''
        distance_sum = 0.0
        for i in range(m):
            if distance_sum>=bsf:
                break
            x = (T[order[i]+j]-mean)/std
            distance_sum += (x-Q[i])*(x-Q[i])
            
        return distance_sum
复制代码

Reference material

  • Searching and mining trillions -blog

  • Search Time Series

  • DTW (Dynamic Time Warping) Dynamic Time Warping

  • Time Series Classification and Clustering


Guess you like

Origin juejin.im/post/5d639f976fb9a06afb61d712