table of Contents
- problem analysis
- data processing
- Code
- result
Today, the brothers ask me for help: I hope that the new data based on white blood cells of patients with pneumonia do a cluster data and draw the curve in general: that is, the same curve fitting and change of classification. Locate this problem unsupervised classification. So think of a clustering method.
First, the problem analysis
1, the first attempts to use: extracting statistical characteristic value of the time sequence, for example the maximum value, minimum value. Then Lee algorithm currently used are classified according to the extracted features, e.g. Naive Bayes, SVMs, KNN like. It found that the effect is not very good.
2, try to form based on K-means unsupervised classification, this classification to classify data based on the distance of the two, so you want to define the concept of distance number, then access to information, consider using dynamic time warping (Dynamic Time Warping, DTW) . Below mainly based on expansion in this area.
Second, data processing
Gives more complete data, on a excel spreadsheet, do the following simple sorting, the raw data visible end of the article github address.
Third, code implementation
3.1 Dynamic Time Warping ( Dynamic Time the Warping, DTW )
If the Euclidean distance is: the TS 3 than ts2 closer ts1, but the naked eye is not the case. Therefore lead DTW distance.
Dynamic time warping algorithm, named Incredibles, the two representatives is "aligned" with different lengths of time sequences of a type of thing. The most common places such as DTW, speech recognition, the same letter, by different speakers, certainly not the same length, after the recorded voice, its signal must be very similar, but not quite in time to tidy it. So we need to lengthen or shorten with a function in which a signal such that the error between them is minimized. The following blog post gives a better explanation: https: //blog.csdn.net/lin_limin/article/details/81241058. English simply explained as follows (in short: differencing is to allow the offset, and takes that as the minimum distance.)
Code DTW distance is defined as follows:
1 def DTWDistance(s1, s2): 2 DTW={} 3 for i in range(len(s1)): 4 DTW[(i, -1)] = float('inf') 5 for i in range(len(s2)): 6 DTW[(-1, i)] = float('inf') 7 DTW[(-1, -1)] = 0 8 for i in range(len(s1)): 9 for j in range(len(s2)): 10 dist= (s1[i]-s2[j])**2 11 DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)]) 12 13 return math.sqrt(DTW[len(s1)-1, len(s2)-1]) 14 15
Solving this relatively cumbersome, time complexity is relatively high, so did a small acceleration:
#DTW distance, only the value before the detection window W, the definition of the offset portion W, to reduce the amount of recursion to find
1 def DTWDistance(s1, s2, w): 2 DTW = {} 3 w = max(w, abs(len(s1) - len(s2))) 4 for i in range(-1, len(s1)): 5 for j in range(-1, len(s2)): 6 DTW[(i, j)] = float('inf') 7 DTW[(-1, -1)] = 0 8 for i in range(len(s1)): 9 for j in range(max(0, i - w), min(len(s2), i + w)): 10 dist = (s1[i] - s2[j]) ** 2 11 DTW[(i, j)] = dist + min(DTW[(i - 1, j)], DTW[(i, j - 1)], DTW[(i - 1, j - 1)]) 12 return math.sqrt(DTW[len(s1) - 1, len(s2) - 1])
3.2 LB_Keogh distance
The main idea is to a great time searching for data, one by one using DTW algorithm compares each match is very time-consuming. Then we can use the approximate method of calculating faster computing LB, LB disposed of by the majority of the sequence can not be the best match sequence, one by one for the rest of the sequence comparison using DTW it? English explained as follows:
Chinese explained below, primarily with reference to other Bowen : LB_keogh definition is relatively complex, including two parts.
The first part is the Q {U, L} envelope curve (FIG particular), to Q sequences defined upper and lower bounds for each time step. Defined as follows
Where r is the distance for a sliding window, you can customize.
Diagram is as follows:
U is the upper envelope, each time step is the maximum number of the current time step before and after the range of r Q. Likewise the L envelope. So LB_Keogh defined as follows:
Image described as follows:
1 def LB_Keogh(s1, s2, r): 2 LB_sum = 0 3 for ind, i in enumerate(s1): 4 # print(s2) 5 lower_bound = min(s2[(ind - r if ind - r >= 0 else 0):(ind + r)]) 6 upper_bound = max(s2[(ind - r if ind - r >= 0 else 0):(ind + r)]) 7 if i >= upper_bound: 8 LB_sum = LB_sum + (i - upper_bound) ** 2 9 elif i < lower_bound: 10 LB_sum = LB_sum + (i - lower_bound) ** 2 11 return math.sqrt(LB_sum)
3.3 using the k-means algorithm Clustering
3 # 1 K-means algorithm defined number of classification 2 #num_clust, 3 DEF k_means_clust (Data, num_clust, num_iter, W = 3): . 4 ## Step a: Mean initialization point 5 centroids = random.sample (list (data ) , num_clust) . 6 counter = 0 . 7 for n-in Range (num_iter): . 8 = counter +. 1 . 9 Print # 10 # counter number 11 assignments = {} # 0,1,2 storage class class number and the like included in the class 12 # through each sample point i, because this question before different, more ind encoder 13 is for ind, I in the enumerate (Data): 14 min_dist = a float ( 'INF') # nearest distance, the initial set a more large value 15 closest_clust = None # closest_clust: the nearest point numbers mean 16 ## step two: find the nearest point of the mean 17 for c_ind, j in enumerate ( centroids): # distance of each point and the center point, a total of num_clust value 18 if LB_Keogh (i, j, 3) <min_dist: # minimum cycle find that 19 cur_dist = DTWDistance ( I, J, W) 20 is cur_dist IF <min_dist: # ind find the nearest point distance c_ind 21 is min_dist = cur_dist 22 is closest_clust c_ind = 23 is ## step three: update ind relevant cluster 24 Print # (closest_clust) 25 IF closest_clust in Assignments: Assignments 26 is [closest_clust] .append (IND) 27 the else: 28 Assignments [closest_clust] = [] Assignments 29 [closest_clust] .append (IND) 30 # recalculate centroids of clusters ## Step Four: updating the mean cluster point 31 is in Assignments for Key: 32 = 0 clust_sum 33 is in Assignments for K [Key]: 34 is clust_sum = clust_sum Data + [K] 35 centroids of [Key] = [m / len (Assignments [Key]) for m in clust_sum] 36 centroids of return, return Assignments # cluster center value, and the array of cluster number of all points
3.4 print the detailed classification based on clustering:
1 num_clust = 2 #定义需要分类的数量 2 centroids,assignments = k_means_clust(WBCData, num_clust,800, 3) 3 for i in range(num_clust): 4 s = [] 5 WBC01 = [] 6 days01 = [] 7 for j, indj in enumerate(assignments[i]): #画出各分类点的坐标 8 s.append(int(Numb[indj*30])) 9 WBC01 = np.hstack((WBC01,WBC[30 * indj:30 * indj + 30])) 10 days01 = np.hstack((days01 , days[0: 30])) 11 print(s) 12 plt.title('%s' % s) 13 plt.plot(centroids[i],c="r",lw=4) 14 plt.scatter(days01, WBC01 ) 15 plt.show()
Fourth, the results
Case into two categories are defined, and can be flexibly adjusted according to the value of num_clust equal to 2 and the classification is illustrated as follows:
WBC01: [6774, 7193, 8070, 8108, 8195, 2020006799, 2020007003, 2020007251, 2020007420, 2020007636, 2020007718, 2020007928, 2020007934, 2020008022, 2020008196, 2020008239, ......] does not list all
WBC02:[2020007250, 2020007388, 2020007389, 2020007422, 2020007625, 2020007703, 2020007927 , ……]
Description:
Code training process, must pay attention to the type of data, such as matrix and ndarray, although when are printed (45, 30), but then the training time, a little attention, it will lead to a mess of problems, need to print long investigation .
Data and code in this article, please visit: My GitHub , download. If useful for you, please feel free to stars.
Annex I: Overall Code
1 # Author:yifan 2 3 import pandas as pd 4 import numpy as np 5 import matplotlib.pylab as plt 6 import math 7 import random 8 import xlrd 9 import numpy as np 10 import sys 11 12 13 #演示 14 # import pandas as pd 15 # import numpy as np 16 # import matplotlib.pylab as plt 17 # x=np.linspace(0,50,100) 18 # ts1=pd.Series(3.1*np.sin(x/1.5)+3.5) 19 # ts2=pd.Series(2.2*np.sin(x/3.5+2.4)+3.2) 20 # ts3=pd.Series(0.04*x+3.0) 21 # ts1.plot() 22 # ts2.plot() 23 # ts3.plot() 24 # plt.ylim(-2,10) 25 # plt.legend(['ts1','ts2','ts3']) 26 # plt.show() 27 # def euclid_dist(t1,t2): 28 # return math.sqrt(sum((t1-t2)**2)) 29 # print (euclid_dist(ts1,ts2)) #26.959216038 30 # print (euclid_dist(ts1,ts3)) #23.1892491903 31 32 33 #1 数据提取 34 # workbook = xlrd.open_workbook(r"D:\datatest\xinguanfeiyan\20200229.xlsx") 35 workbook = xlrd.open_workbook(r"D:\datatest\xinguanfeiyan\20200315L.xlsx") 36 worksheet = workbook.sheet_by_index(0) 37 n = worksheet.nrows 38 days = [0]*(n-1) 39 WBC = [0]*(n-1) Of Numb = 40 [0] * (. 1-n-) 41 for i in range(1,n): 42 is of Numb [-I. 1] = worksheet.cell_value (I, 0) 43 is Days [-I. 1] = worksheet.cell_value (I,. 1) 44 is the WBC [I. 1- ] = worksheet.cell_value (I, 2) 45 S = int (len (the WBC) / 30) 46 is WBCData = np.mat (the WBC) .reshape (S, 30) 47 = WBCData np.array (WBCData) 48 Print # (WBCData) 49 50 # 2 defines the similarity distance 51 #DTW distance, time complexity is multiplied by the length of two time series 52 is DEF DTWDistance (S1, S2): 53 is the DTW {} = 54 is for I in Range (len (S1) ): 55 the DTW [(I, -1)] = a float ( 'INF') 56 is for I in Range (len (S2)): 57 is the DTW [(-. 1, I)] = a float ( 'INF') 58 the DTW [(-1, -1)] = 0 59 for I in Range (len (S1)): 60 for j in range (len (s2) ): Dist = 61 is (S1 [I] - S2 [J]) ** 2 62 is the DTW [(I, J)] + dist = min (the DTW [(I -. 1, J)], the DTW [(I, J -. 1 )], the DTW [(I -. 1, J -. 1)]) 63 is Math.sqrt return (the DTW [len (S1) -. 1, len (S2) -. 1]) 64 65 #DTW distance, just before the detection of W the value of the window, reducing the time complexity of 66 DEF DTWDistance (S1, S2, W): 67 the DTW = {} 68 W = max (W, ABS (len (S1) - len (S2))) 69 for I in Range ( -1, len (S1)): 70 for J in Range (-1, len (S2)): 71 is the DTW [(I, J)] = a float ( 'INF') 72 the DTW [(-. 1, -1) ] 0 = 73 is for I in Range (len (S1)): 74 for J in Range (max (0, I - W), min (len (S2), I + w)): 75 dist = (S1 [I] - S2 [J]) ** 2 76 DTW[(i, j)] = dist + min(DTW[(i - 1, j)], DTW[(i, j - 1)], DTW[(i - 1, j - 1)]) 77 return math.sqrt(DTW[len(s1) - 1, len(s2) - 1]) 78 #Another way to speed things up is to use the LB Keogh lower bound of dynamic time warping 79 def LB_Keogh(s1, s2, r): 80 LB_sum = 0 81 for ind, i in enumerate(s1): 82 # print(s2) 83 lower_bound = min(s2[(ind - r if ind - r >= 0 else 0):(ind + r)]) 84 upper_bound = max(s2[(ind - r if ind - r >= 0 else 0):(ind + r)]) 85 if i >= upper_bound: 86 LB_sum = LB_sum + (i - upper_bound) ** 2 I elif 87 <lower_bound: 88 LB_sum = LB_sum + (i - lower_bound) ** 2 89 return Math.sqrt (LB_sum) 90 91 is 92 # K-means algorithm. 3 defines the number of classification 93 #num_clust, 94 DEF k_means_clust (Data, num_clust, num_iter, W =. 3): 95 ## Step a: mean initialization point 96 = centroids of random.sample (List (Data), num_clust) 97 = 0 counter 98 for n-in Range (num_iter): 99 = counter +. 1 100 Print # 101 # counter 102 stores Assignments = {#} 0,1,2 category class number and the like included in the class numbers 103 through each sample point # i, is different because this question before, more ind coding 104 for ind, i in enumerate ( data): 105 min_dist = float ( 'inf' ) # nearest distance, set the initial value of a large 117 Assignments [closest_clust] .append (ind) 106 closest_clust = None # closest_clust: the nearest point numbers mean 107 ## Step Two: Find the closest point average of 108 for c_ind, j in enumerate ( centroids): # distance of each point and the center point, the total value num_clust 109 if LB_Keogh (i, j, 3) <min_dist: # find the minimum cycle 110 cur_dist DTWDistance = (I, J, W) 111 IF cur_dist <min_dist: # ind find the nearest point distance c_ind 112 min_dist = cur_dist 113 closest_clust c_ind = 114 ## step three: update ind belongs cluster 115 Print # (closest_clust) 1 16 Assignments in closest_clust IF: 1 18 the else: 119 Assignments [closest_clust] = [] 120 assignments[closest_clust].append(ind) 121 # recalculate centroids of clusters ## 步骤四: 更新簇的均值点 122 for key in assignments: 123 clust_sum = 0 124 for k in assignments[key]: 125 clust_sum = clust_sum + data[k] 126 centroids[key] = [m / len(assignments[key]) for m in clust_sum] 127 return centroids,assignments 128 129 # 130 num_clust = 2 #定义需要分类的数量 131 centroids,assignments = k_means_clust(WBCData, num_clust,800, 3) 132 for i in range(num_clust): 133 s = [] 134 WBC01 = [] 135 days01 = [] 136 for j, indj in enumerate(assignments[i]): #画出各分类点的坐标 137 s.append(int(Numb[indj*30])) 138 WBC01 = np.hstack((WBC01,WBC[30 * indj:30 * indj + 30])) 139 days01 = np.hstack((days01 , days[0: 30])) 140 print(s) 141 plt.title('%s' % s) 142 plt.plot(centroids[i],c="r",lw=4) 143 plt.scatter(days01, WBC01 ) 144 plt.show()