ML-7- [Application] clustering algorithm - Time Sequence Clustering (DTW distance and LB_Keogh)

table of Contents

  1. problem analysis
  2. data processing
  3. Code
  4. result

Today, the brothers ask me for help: I ​​hope that the new data based on white blood cells of patients with pneumonia do a cluster data and draw the curve in general: that is, the same curve fitting and change of classification. Locate this problem unsupervised classification. So think of a clustering method.

First, the problem analysis

1, the first attempts to use: extracting statistical characteristic value of the time sequence, for example the maximum value, minimum value. Then Lee algorithm currently used are classified according to the extracted features, e.g. Naive Bayes, SVMs, KNN like. It found that the effect is not very good.

2, try to form based on K-means unsupervised classification, this classification to classify data based on the distance of the two, so you want to define the concept of distance number, then access to information, consider using dynamic time warping (Dynamic Time Warping, DTW) . Below mainly based on expansion in this area.

Second, data processing

Gives more complete data, on a excel spreadsheet, do the following simple sorting, the raw data visible end of the article github address.

Third, code implementation

3.1 Dynamic Time Warping ( Dynamic Time the Warping, DTW )

If the Euclidean distance is: the TS 3 than ts2 closer ts1, but the naked eye is not the case. Therefore lead DTW distance.

  Dynamic time warping algorithm, named Incredibles, the two representatives is "aligned" with different lengths of time sequences of a type of thing. The most common places such as DTW, speech recognition, the same letter, by different speakers, certainly not the same length, after the recorded voice, its signal must be very similar, but not quite in time to tidy it. So we need to lengthen or shorten with a function in which a signal such that the error between them is minimized. The following blog post gives a better explanation: https: //blog.csdn.net/lin_limin/article/details/81241058. English simply explained as follows (in short: differencing is to allow the offset, and takes that as the minimum distance.)

Code DTW distance is defined as follows:

    1 	def DTWDistance(s1, s2):
    2 	    DTW={}
    3 	    for i in range(len(s1)):
    4 	        DTW[(i, -1)] = float('inf')
    5 	    for i in range(len(s2)):
    6 	        DTW[(-1, i)] = float('inf')
    7 	    DTW[(-1, -1)] = 0
    8 	    for i in range(len(s1)):
    9 	        for j in range(len(s2)):
   10 	            dist= (s1[i]-s2[j])**2
   11 	            DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
   12 			
   13 	    return math.sqrt(DTW[len(s1)-1, len(s2)-1])
   14 
   15 

Solving this relatively cumbersome, time complexity is relatively high, so did a small acceleration:

#DTW distance, only the value before the detection window W, the definition of the offset portion W, to reduce the amount of recursion to find

    1 	def DTWDistance(s1, s2, w):
    2 	    DTW = {}
    3 	    w = max(w, abs(len(s1) - len(s2)))
    4 	    for i in range(-1, len(s1)):
    5 	        for j in range(-1, len(s2)):
    6 	            DTW[(i, j)] = float('inf')
    7 	    DTW[(-1, -1)] = 0
    8 	    for i in range(len(s1)):
    9 	        for j in range(max(0, i - w), min(len(s2), i + w)):
   10 	            dist = (s1[i] - s2[j]) ** 2
   11 	            DTW[(i, j)] = dist + min(DTW[(i - 1, j)], DTW[(i, j - 1)], DTW[(i - 1, j - 1)])
   12 	    return math.sqrt(DTW[len(s1) - 1, len(s2) - 1])

3.2 LB_Keogh distance

The main idea is to a great time searching for data, one by one using DTW algorithm compares each match is very time-consuming. Then we can use the approximate method of calculating faster computing LB, LB disposed of by the majority of the sequence can not be the best match sequence, one by one for the rest of the sequence comparison using DTW it? English explained as follows:

Chinese explained below, primarily with reference to other Bowen : LB_keogh definition is relatively complex, including two parts.

The first part is the Q {U, L} envelope curve (FIG particular), to Q sequences defined upper and lower bounds for each time step. Defined as follows

Where r is the distance for a sliding window, you can customize.

Diagram is as follows:

U is the upper envelope, each time step is the maximum number of the current time step before and after the range of r Q. Likewise the L envelope. So LB_Keogh defined as follows:

Image described as follows:

    1 	def LB_Keogh(s1, s2, r):
    2 	    LB_sum = 0
    3 	    for ind, i in enumerate(s1):
    4 	        # print(s2)
    5 	        lower_bound = min(s2[(ind - r if ind - r >= 0 else 0):(ind + r)])
    6 	        upper_bound = max(s2[(ind - r if ind - r >= 0 else 0):(ind + r)])
    7 	        if i >= upper_bound:
    8 	            LB_sum = LB_sum + (i - upper_bound) ** 2
    9 	        elif i < lower_bound:
   10 	            LB_sum = LB_sum + (i - lower_bound) ** 2
   11 	    return math.sqrt(LB_sum)

3.3 using the k-means algorithm Clustering

    3 # 1 K-means algorithm defined 
    number of classification 2 #num_clust, 
    3 DEF k_means_clust (Data, num_clust, num_iter, W = 3): 
    . 4 ## Step a: Mean initialization point 
    5 centroids = random.sample (list (data ) , num_clust) 
    . 6 counter = 0 
    . 7 for n-in Range (num_iter): 
    . 8 = counter +. 1 
    . 9 Print # 
   10 # counter 
   number 11 assignments = {} # 0,1,2 storage class class number and the like included in the class 
   12 # through each sample point i, because this question before different, more ind encoder 
   13 is for ind, I in the enumerate (Data): 
   14 min_dist = a float ( 'INF') # nearest distance, the initial set a more large value 
   15 closest_clust = None # closest_clust: the nearest point numbers mean 
   16 ## step two: find the nearest point of the mean
   17 for c_ind, j in enumerate ( centroids): # distance of each point and the center point, a total of num_clust value 
   18 if LB_Keogh (i, j, 3) <min_dist: # minimum cycle find that 
   19 cur_dist = DTWDistance ( I, J, W) 
   20 is cur_dist IF <min_dist: # ind find the nearest point distance c_ind 
   21 is min_dist = cur_dist 
   22 is closest_clust c_ind = 
   23 is ## step three: update ind relevant cluster 
   24 Print # (closest_clust) 
   25 IF closest_clust in Assignments: 
   Assignments 26 is [closest_clust] .append (IND) 
   27 the else: 
   28 Assignments [closest_clust] = []
   Assignments 29 [closest_clust] .append (IND)
   30 # recalculate centroids of clusters ## Step Four: updating the mean cluster point 
   31 is in Assignments for Key: 
   32 = 0 clust_sum 
   33 is in Assignments for K [Key]: 
   34 is clust_sum = clust_sum Data + [K] 
   35 centroids of [Key] = [m / len (Assignments [Key]) for m in clust_sum] 
   36 centroids of return, return Assignments # cluster center value, and the array of cluster number of all points

3.4 print the detailed classification based on clustering:

    1 	num_clust = 2  #定义需要分类的数量
    2 	centroids,assignments = k_means_clust(WBCData, num_clust,800, 3)
    3 	for i in range(num_clust):
    4 	    s = []
    5 	    WBC01 = []
    6 	    days01 = []
    7 	    for j, indj in enumerate(assignments[i]):  #画出各分类点的坐标
    8 	        s.append(int(Numb[indj*30]))
    9 	        WBC01 = np.hstack((WBC01,WBC[30 * indj:30 * indj + 30]))
   10 	        days01 = np.hstack((days01 , days[0: 30]))
   11 	    print(s)
   12 	    plt.title('%s' % s)
   13 	    plt.plot(centroids[i],c="r",lw=4)
   14 	    plt.scatter(days01, WBC01 )
   15 	    plt.show()

Fourth, the results

Case into two categories are defined, and can be flexibly adjusted according to the value of num_clust equal to 2 and the classification is illustrated as follows:

WBC01: [6774, 7193, 8070, 8108, 8195, 2020006799, 2020007003, 2020007251, 2020007420, 2020007636, 2020007718, 2020007928, 2020007934, 2020008022, 2020008196, 2020008239, ......] does not list all

WBC02:[2020007250, 2020007388, 2020007389, 2020007422, 2020007625, 2020007703, 2020007927 , ……]

Description:

Code training process, must pay attention to the type of data, such as matrix and ndarray, although when are printed (45, 30), but then the training time, a little attention, it will lead to a mess of problems, need to print long investigation .

Data and code in this article, please visit: My GitHub , download. If useful for you, please feel free to stars.

Annex I: Overall Code

    1 	# Author:yifan
    2 	
    3 	import pandas as pd
    4 	import numpy as np
    5 	import matplotlib.pylab as plt
    6 	import math
    7 	import random
    8 	import  xlrd
    9 	import  numpy as np
   10 	import sys
   11 	
   12 	
   13 	#演示
   14 	# import pandas as pd
   15 	# import numpy as np
   16 	# import matplotlib.pylab as plt
   17 	# x=np.linspace(0,50,100)
   18 	# ts1=pd.Series(3.1*np.sin(x/1.5)+3.5)
   19 	# ts2=pd.Series(2.2*np.sin(x/3.5+2.4)+3.2)
   20 	# ts3=pd.Series(0.04*x+3.0)
   21 	# ts1.plot()
   22 	# ts2.plot()
   23 	# ts3.plot()
   24 	# plt.ylim(-2,10)
   25 	# plt.legend(['ts1','ts2','ts3'])
   26 	# plt.show()
   27 	# def euclid_dist(t1,t2):
   28 	#     return math.sqrt(sum((t1-t2)**2))
   29 	# print (euclid_dist(ts1,ts2))   #26.959216038
   30 	# print (euclid_dist(ts1,ts3))   #23.1892491903
   31 	
   32 	
   33 	#1   数据提取
   34 	# workbook = xlrd.open_workbook(r"D:\datatest\xinguanfeiyan\20200229.xlsx")
   35 	workbook = xlrd.open_workbook(r"D:\datatest\xinguanfeiyan\20200315L.xlsx")
   36 	worksheet = workbook.sheet_by_index(0)
   37 	n = worksheet.nrows
   38 	days = [0]*(n-1)
   39 	WBC = [0]*(n-1)
   Of Numb = 40 [0] * (. 1-n-) 
   41 	for i in range(1,n):
   42 is of Numb [-I. 1] = worksheet.cell_value (I, 0) 
   43 is Days [-I. 1] = worksheet.cell_value (I,. 1) 
   44 is the WBC [I. 1- ] = worksheet.cell_value (I, 2) 
   45 S = int (len (the WBC) / 30) 
   46 is WBCData = np.mat (the WBC) .reshape (S, 30) 
   47 = WBCData np.array (WBCData) 
   48 Print # (WBCData) 
   49 	 
   50 # 2 defines the similarity distance 
   51 #DTW distance, time complexity is multiplied by the length of two time series 
   52 is DEF DTWDistance (S1, S2): 
   53 is the DTW {} = 
   54 is for I in Range (len (S1) ): 
   55 the DTW [(I, -1)] = a float ( 'INF') 
   56 is for I in Range (len (S2)): 
   57 is the DTW [(-. 1, I)] = a float ( 'INF') 
   58 the DTW [(-1, -1)] = 0 
   59 for I in Range (len (S1)): 
   60 for j in range (len (s2) ):
   Dist = 61 is (S1 [I] - S2 [J]) ** 2 
   62 is the DTW [(I, J)] + dist = min (the DTW [(I -. 1, J)], the DTW [(I, J -. 1 )], the DTW [(I -. 1, J -. 1)]) 
   63 is Math.sqrt return (the DTW [len (S1) -. 1, len (S2) -. 1]) 
   64 	 
   65 #DTW distance, just before the detection of W the value of the window, reducing the time complexity of 
   66 DEF DTWDistance (S1, S2, W): 
   67 the DTW = {} 
   68 W = max (W, ABS (len (S1) - len (S2))) 
   69 for I in Range ( -1, len (S1)): 
   70 for J in Range (-1, len (S2)): 
   71 is the DTW [(I, J)] = a float ( 'INF') 
   72 the DTW [(-. 1, -1) ] 0 = 
   73 is for I in Range (len (S1)):  
   74 for J in Range (max (0, I - W), min (len (S2), I + w)):
   75 dist = (S1 [I] - S2 [J]) ** 2
   76 	            DTW[(i, j)] = dist + min(DTW[(i - 1, j)], DTW[(i, j - 1)], DTW[(i - 1, j - 1)])
   77 	    return math.sqrt(DTW[len(s1) - 1, len(s2) - 1])
   78 	#Another way to speed things up is to use the LB Keogh lower bound of dynamic time warping
   79 	def LB_Keogh(s1, s2, r):
   80 	    LB_sum = 0
   81 	    for ind, i in enumerate(s1):
   82 	        # print(s2)
   83 	        lower_bound = min(s2[(ind - r if ind - r >= 0 else 0):(ind + r)])
   84 	        upper_bound = max(s2[(ind - r if ind - r >= 0 else 0):(ind + r)])
   85 	        if i >= upper_bound:
   86 	            LB_sum = LB_sum + (i - upper_bound) ** 2
   I elif 87 <lower_bound: 
   88 	            LB_sum = LB_sum + (i - lower_bound) ** 2
   89 return Math.sqrt (LB_sum) 
   90 	 
   91 is 	 
   92 # K-means algorithm. 3 defines 
   the number of classification 93 #num_clust, 
   94 DEF k_means_clust (Data, num_clust, num_iter, W =. 3): 
   95 ## Step a: mean initialization point 
   96 = centroids of random.sample (List (Data), num_clust) 
   97 = 0 counter 
   98 for n-in Range (num_iter): 
   99 = counter +. 1 
  100 Print # 
  101 # counter 
  102 stores Assignments = {#} 0,1,2 category class number and the like included in the class numbers 
  103 through each sample point # i, is different because this question before, more ind coding 
  104 for ind, i in enumerate ( data):
  105 min_dist = float ( 'inf' ) # nearest distance, set the initial value of a large 
  117 Assignments [closest_clust] .append (ind) 
  106 closest_clust = None # closest_clust: the nearest point numbers mean
  107 ## Step Two: Find the closest point average of 
  108 for c_ind, j in enumerate ( centroids): # distance of each point and the center point, the total value num_clust 
  109 if LB_Keogh (i, j, 3) <min_dist: # find the minimum cycle 
  110 cur_dist DTWDistance = (I, J, W) 
  111 IF cur_dist <min_dist: # ind find the nearest point distance c_ind 
  112 min_dist = cur_dist 
  113 closest_clust c_ind = 
  114 ## step three: update ind belongs cluster 
  115 Print # (closest_clust) 
  1 16 Assignments in closest_clust IF: 
  1 18 the else: 
  119 Assignments [closest_clust] = []
  120 	                assignments[closest_clust].append(ind)
  121 	        # recalculate centroids of clusters  ## 步骤四: 更新簇的均值点
  122 	        for key in assignments:
  123 	            clust_sum = 0
  124 	            for k in assignments[key]:
  125 	                clust_sum = clust_sum + data[k]
  126 	            centroids[key] = [m / len(assignments[key]) for m in clust_sum]
  127 	    return centroids,assignments
  128 	
  129 	#
  130 	num_clust = 2  #定义需要分类的数量
  131 	centroids,assignments = k_means_clust(WBCData, num_clust,800, 3)
  132 	for i in range(num_clust):
  133 	    s = []
  134 	    WBC01 = []
  135 	    days01 = []
  136 	    for j, indj in enumerate(assignments[i]):  #画出各分类点的坐标
  137 	        s.append(int(Numb[indj*30]))
  138 	        WBC01 = np.hstack((WBC01,WBC[30 * indj:30 * indj + 30]))
  139 	        days01 = np.hstack((days01 , days[0: 30]))
  140 	    print(s)
  141 	    plt.title('%s' % s)
  142 	    plt.plot(centroids[i],c="r",lw=4)
  143 	    plt.scatter(days01, WBC01 )
  144 	    plt.show()

  

 

Guess you like

Origin www.cnblogs.com/yifanrensheng/p/12501238.html