In the process of learning clustering algorithms, most of the clustering algorithms learned are for n-dimensional data, and there are fewer clustering methods for one-dimensional data. Today, let’s learn how to cluster one-dimensional data.
Solution 1: Use K-Means to cluster one-dimensional data
The Python code is as follows:
- from sklearn.cluster import KMeans
- import numpy as np
- x = np.random.random(10000)
- y = x.reshape(-1,1)
- km = KMeans()
- km.fit(y)
The core operation is y = x.reshape(-1,1), which means changing the one-dimensional data into only one column and an unknown number of rows (-1 means calculating another shape attribute of the array based on the remaining dimensions. value).
Option 2: Using one-dimensional clustering method Jenks Natural Breaks
Jenks Natural Breaks. Generally speaking, the principle of classification is to put them together and divide them into several categories. Statistics can be measured by variance, by calculating the variance of each category, and then calculating the sum of these variances, and using the sum of variances to compare the quality of classification. Therefore, it is necessary to calculate the sum of variances of various classifications, and the smallest value is the optimal classification result (but it is not unique). This is also the principle of natural breakpoint taxonomy. In addition, when you look at the distribution of the data, you can clearly find the breaks. These breaks are consistent with those calculated by the Jenks Natural Breaks method. This classification is therefore "natural".
Jenks Natural Breaks and K Means are completely equivalent when it comes to one-dimensional data. Their objective functions are the same, but the steps of the algorithms are not exactly the same. K Means is to first set K initial random points. Jenks Breaks uses a traversal method to move point by point until it reaches the minimum value.
There are two types of Natural Breaks algorithms:
- Jenks-Caspall algorithm (1971) is an algorithm invented by Jenks and Caspall. The principle is as mentioned above. When implementing, it is necessary to calculate each classification situation and find the one with the smallest sum of variances, which requires a huge amount of calculation. If n numbers are divided into k categories, it is necessary to find k-1 combinations from n-1 numbers. This number is amazing. When the amount of data is large and there are many categories, the computer capabilities at that time simply cannot exhaust all possibilities.
- Fisher-Jenks algorithm (1977), Fisher (1958) invented an algorithm to improve computational efficiency without the need for exhaustive enumeration. Jenks introduced this method to data classification. But latecomers almost only know Jenks but not Fisher.
Specific algorithm implementation:
- Jenks-Caspall algorithm:https://github.com/domlysz/Jenks-Caspall.py
- Fisher-Jenks algorithm:https://github.com/mthh/jenkspy
Like K-Means, using Jenks Natural Breaks requires first determining the number of clusters K. A common method is: GVF (The Goodness of Variance Fit). GVF, translated as "variance goodness of fit", the formula is as follows:
Among them, SDAM is the Sum of squared Deviations from the Array Mean, which is the variance of the original data; SDCM is the Sum of squared Deviations about Class Mean, which is the sum of the variances of each class. Obviously, SDAM is a constant, and SDCM is related to the classification number k. Within a certain range, the larger the GVF, the better the classification effect. The smaller the SDCM, the larger the GVF and the closer it is to 1. The SDCM increases with the increase of k. When k equals n, SDMC=0 and GVF=1.
GVF is used to determine the classification effect of different classification numbers. Drawing a graph with k and GVF, we can get:
As k increases, the GVF curve becomes flatter and flatter. Especially at the red line (k=5), the curve becomes basically flat (larger fluctuations before, smaller fluctuations afterwards), and k(5) is not very large, so it can be divided into 5 categories. Generally speaking, GVF>0.7 is acceptable. Of course, the higher the better, but it must be considered that k cannot be too large. Obviously, this is an empirical formula, but it's better than nothing.
Code example:
- from jenkspy import jenks_breaks
- import numpy as np
- def goodness_of_variance_fit(array, classes):
- # get the break points
- classes = jenks_breaks(array, classes)
- # do the actual classification
- classified = np.array([classify(i, classes) for i in array])
- # max value of zones
- maxz = max(classified)
- # nested list of zone indices
- zone_indices = [[idx for idx, val in enumerate(classified) if zone + 1 val] for zone in range(maxz)]
- # sum of squared deviations from array mean
- sdam = np.sum((array - array.mean()) 2)
- # sorted polygon stats
- array_sort = [np.array([array[index] for index in zone]) for zone in zone_indices]
- # sum of squared deviations of class means
- sdcm = sum([np.sum((classified - classified.mean()) 2) for classified in array_sort])
- # goodness of variance fit
- gvf = ( sdam - sdcm ) / sdam
- return gvf
- def classify(value, breaks):
- for i in range(1, len(breaks)):
- if value < breaks[i]:
- return i
- return len(breaks) - 1
- if name ‘main’:
- gvf = 0.0
- nclasses = 2
- array = np.random.random(10000)
- while gvf < .8:
- gvf = goodness_of_variance_fit(array, nclasses)
- print(nclasses, gvf)
- nclasses += 1
Reference links:
- https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization
- https://macwright.org/2013/02/18/literate-jenks.html
Option 3: Kernel Density Estimation
The so-called kernel density estimation uses a smooth peak function ("kernel") to fit the observed data points to simulate the real probability distribution curve. For more details on kernel density estimation , please refer to the relevant instructions in the previous Mean Shift clustering .
Usage example:
- import numpy as np
- from scipy.signal import argrelextrema
- import matplotlib.pyplot as plt
- from sklearn.neighbors.kde import KernelDensity
- a = np.array([10, 11, 9, 23, 21, 11, 45, 20, 11, 12]).reshape(-1, 1)
- kde = KernelDensity(kernel=‘gaussian’, bandwidth=3).fit(a)
- s = np.linspace(0, 50)
- e = kde.score_samples(s.reshape(-1, 1))
- plt.plot(s, e)
- plt.show()
- mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
- print(“Minima:”, s[mi])
- print(“Maxima:”, s[ma])
- print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
- plt.plot(s[:mi[0] + 1], e[:mi[0] + 1], ‘r’,
- s [ mi [ 0 ]: mi [ 1 ] + 1 ], e [ mi [ 0 ]: mi [ 1 ] + 1 ], 'g' ,
- s [ mi [ 1 ]:], e [ mi [ 1 ]:], 'b' ,
- s [ ma ], e [ ma ], 'go' ,
- s [ mi ], e [ mi ], 'ro' )
- plt.show()
Output content:
- Minima: [17.34693878 33.67346939]
- Maxima: [10.20408163 21.42857143 44.89795918]
- [10 11 9 11 11 12] [23 21 20] [45]
Reference links: