Clustering:
Attempts to divide the samples in the data set into several disjoint subsets, each subset is called a "cluster".
Performance metrics:
Clustering performance measures are also known as clustering "effectiveness indicators". Generally, it is hoped that the "intra-cluster similarity" of the clustering result is high and the "inter-cluster similarity" is low.
There are roughly two types of clustering performance metrics, external metrics and internal metrics .
External metrics : Compare the clustering results to some "reference model".
x is the dataset data, C is the result of clustering division, C* is the cluster division result given by the reference model, and * represent the cluster label vectors corresponding to C and C* respectively. Therefore, the commonly used clustering performance measurement external indicators are:
The results of the above indicators are all in the [0, 1] interval, and the larger the value, the better.
Internal metrics : directly examine the clustering results without utilizing any reference model.
Consider the cluster division of clustering results , used to calculate the distance between two samples, representing the center point of cluster C.
From the above four formulas, we can deduce the internal indicators of commonly used clustering performance metrics :
The smaller the value of DBI, the better, and the larger the value of DI, the better.
Distance calculation :
For a function , if it is a "distance measure", it satisfies the following basic properties:
Commonly used " Minkowski distance ":
When P=2, the Minkowski distance is the Euclidean distance :
When P=1, the Minkowski distance is the Harmanton distance :
Prototype clustering :
Prototype clustering is also known as "prototype-based clustering".
K-means algorithm
According to the cluster division C obtained by clustering, the square error is minimized , where
is the mean vector of cluster C, which to a certain extent describes the closeness of the samples in the cluster around the cluster mean vector, and the smaller the E value
The higher the similarity of the samples in the cluster is.
The K-means algorithm adopts a greedy strategy, and the pseudo code of the algorithm is as follows:
Learning Vector Quantization
Learning vector quantization is also trying to find a set of prototypes to describe the clustering structure, but in the learning process, it will use its own category labels to assist clustering.