[Deep Learning] Clustering of one-dimensional arrays

In the process of learning clustering algorithms, most of the clustering algorithms learned are for n-dimensional data, and there are fewer clustering methods for one-dimensional data. Today, let’s learn how to cluster one-dimensional data.

Solution 1: Use K-Means to cluster one-dimensional data

The Python code is as follows:

  
  
   
   
  1. from sklearn.cluster import KMeans
  2. import numpy as np
  3. x = np.random.random(10000)
  4. y = x.reshape(-1,1)
  5. km = KMeans()
  6. km.fit(y)

The core operation is y = x.reshape(-1,1), which means changing the one-dimensional data into only one column and an unknown number of rows (-1 means calculating another shape attribute of the array based on the remaining dimensions. value).

Option 2: Using one-dimensional clustering method Jenks Natural Breaks

Jenks Natural Breaks. Generally speaking, the principle of classification is to put them together and divide them into several categories. Statistics can be measured by variance, by calculating the variance of each category, and then calculating the sum of these variances, and using the sum of variances to compare the quality of classification. Therefore, it is necessary to calculate the sum of variances of various classifications, and the smallest value is the optimal classification result (but it is not unique). This is also the principle of natural breakpoint taxonomy. In addition, when you look at the distribution of the data, you can clearly find the breaks. These breaks are consistent with those calculated by the Jenks Natural Breaks method. This classification is therefore "natural".

Jenks Natural Breaks and K Means are completely equivalent when it comes to one-dimensional data. Their objective functions are the same, but the steps of the algorithms are not exactly the same. K Means is to first set K initial random points. Jenks Breaks uses a traversal method to move point by point until it reaches the minimum value.

There are two types of Natural Breaks algorithms:

  • Jenks-Caspall algorithm (1971) is an algorithm invented by Jenks and Caspall. The principle is as mentioned above. When implementing, it is necessary to calculate each classification situation and find the one with the smallest sum of variances, which requires a huge amount of calculation. If n numbers are divided into k categories, it is necessary to find k-1 combinations from n-1 numbers. This number is amazing. When the amount of data is large and there are many categories, the computer capabilities at that time simply cannot exhaust all possibilities.
  • Fisher-Jenks algorithm (1977), Fisher (1958) invented an algorithm to improve computational efficiency without the need for exhaustive enumeration. Jenks introduced this method to data classification. But latecomers almost only know Jenks but not Fisher.

Specific algorithm implementation:

Like K-Means, using Jenks Natural Breaks requires first determining the number of clusters K. A common method is: GVF (The Goodness of Variance Fit). GVF, translated as "variance goodness of fit", the formula is as follows:

Clustering of one-dimensional arrays - Picture 1

Among them, SDAM is the Sum of squared Deviations from the Array Mean, which is the variance of the original data; SDCM is the Sum of squared Deviations about Class Mean, which is the sum of the variances of each class. Obviously, SDAM is a constant, and SDCM is related to the classification number k. Within a certain range, the larger the GVF, the better the classification effect. The smaller the SDCM, the larger the GVF and the closer it is to 1. The SDCM increases with the increase of k. When k equals n, SDMC=0 and GVF=1.

GVF is used to determine the classification effect of different classification numbers. Drawing a graph with k and GVF, we can get:

Clustering of one-dimensional arrays - Picture 2

As k increases, the GVF curve becomes flatter and flatter. Especially at the red line (k=5), the curve becomes basically flat (larger fluctuations before, smaller fluctuations afterwards), and k(5) is not very large, so it can be divided into 5 categories. Generally speaking, GVF>0.7 is acceptable. Of course, the higher the better, but it must be considered that k cannot be too large. Obviously, this is an empirical formula, but it's better than nothing.

Code example:

  
  
   
   
  1. from jenkspy import jenks_breaks
  2. import numpy as np
  3.  
  4.  
  5. def goodness_of_variance_fit(array, classes):
  6.     # get the break points
  7.     classes = jenks_breaks(array, classes)
  8.  
  9.     # do the actual classification
  10.     classified = np.array([classify(i, classes) for i in array])
  11.  
  12.     # max value of zones
  13.     maxz = max(classified)
  14.  
  15.     # nested list of zone indices
  16.     zone_indices = [[idx for idx, val in enumerate(classified) if zone + 1 val] for zone in range(maxz)]
  17.  
  18.     # sum of squared deviations from array mean
  19.     sdam = np.sum((array - array.mean()) 2)
  20.  
  21.     # sorted polygon stats
  22.     array_sort = [np.array([array[index] for index in zone]) for zone in zone_indices]
  23.  
  24.     # sum of squared deviations of class means
  25.     sdcm = sum([np.sum((classified - classified.mean()) 2) for classified in array_sort])
  26.  
  27.     # goodness of variance fit
  28.     gvf = ( sdam - sdcm ) / sdam
  29.  
  30.     return gvf
  31.  
  32.  
  33. def classify(value, breaks):
  34.     for i in range(1, len(breaks)):
  35.         if value < breaks[i]:
  36.             return i
  37.     return len(breaks) - 1
  38.  
  39.  
  40. if name main:
  41.     gvf = 0.0
  42.     nclasses = 2
  43.     array = np.random.random(10000)
  44.     while gvf < .8:
  45.         gvf = goodness_of_variance_fit(array, nclasses)
  46.         print(nclasses, gvf)
  47.         nclasses += 1

Reference links:

Option 3: Kernel Density Estimation

The so-called kernel density estimation uses a smooth peak function ("kernel") to fit the observed data points to simulate the real probability distribution curve. For more details on kernel density estimation , please refer to the relevant instructions in the previous Mean Shift clustering .

Usage example:

  
  
   
   
  1. import numpy as np
  2. from scipy.signal import argrelextrema
  3. import matplotlib.pyplot as plt
  4. from sklearn.neighbors.kde import KernelDensity
  5.  
  6. a = np.array([10, 11, 9, 23, 21, 11, 45, 20, 11, 12]).reshape(-1, 1)
  7. kde = KernelDensity(kernel=‘gaussian’, bandwidth=3).fit(a)
  8. s = np.linspace(0, 50)
  9. e = kde.score_samples(s.reshape(-1, 1))
  10. plt.plot(s, e)
  11. plt.show()
  12.  
  13. mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
  14. print(“Minima:”, s[mi])
  15. print(“Maxima:”, s[ma])
  16. print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
  17.  
  18. plt.plot(s[:mi[0] + 1], e[:mi[0] + 1], ‘r’,
  19.          s [ mi [ 0 ]: mi [ 1 ] + 1 ], e [ mi [ 0 ]: mi [ 1 ] + 1 ], 'g' ,
  20.          s [ mi [ 1 ]:], e [ mi [ 1 ]:], 'b' ,
  21.          s [ ma ], e [ ma ], 'go' ,
  22.          s [ mi ], e [ mi ], 'ro' )
  23. plt.show()

Output content:

  
  
   
   
  1. Minima: [17.34693878 33.67346939]
  2. Maxima: [10.20408163 21.42857143 44.89795918]
  3. [10 11  9 11 11 12] [23 21 20] [45]

Clustering of one-dimensional arrays - Picture 3

Clustering of one-dimensional arrays - Picture 4

Reference links:

Guess you like

Origin blog.csdn.net/qq_15719613/article/details/134868257