Machine Learning Notes 1 (Watermelon Book): clustering tasks, performance measurement, distance calculation,

Clustering:

        Attempts to divide the samples in the data set into several disjoint subsets, each subset is called a "cluster".

 Performance metrics:

        Clustering performance measures are also known as clustering "effectiveness indicators". Generally, it is hoped that the "intra-cluster similarity" of the clustering result is high and the "inter-cluster similarity" is low.

        There are roughly two types of clustering performance metrics, external metrics and internal metrics .

        

External metrics : Compare the clustering results to some "reference model".

x is the dataset data, C is the result of clustering division, C* is the cluster division result given by the reference model, \lambdaand \lambda* represent the cluster label vectors corresponding to C and C* respectively. Therefore, the commonly used clustering performance measurement external indicators are:

     The results of the above indicators are all in the [0, 1] interval, and the larger the value, the better.

Internal metrics : directly examine the clustering results without utilizing any reference model.

Consider the cluster division of clustering results , used to calculate the distance between two samples, representing the center point of cluster C.

 

 

  

 

  

 From the above four formulas, we can deduce the internal indicators of commonly used clustering performance metrics :

 The smaller the value of DBI, the better, and the larger the value of DI, the better.

Distance calculation :

        For a function , if it is a "distance measure", it satisfies the following basic properties:

                

    

        Commonly used " Minkowski distance ":

        When P=2, the Minkowski distance is the Euclidean distance :

        When P=1, the Minkowski distance is the Harmanton distance :

 Prototype clustering :

        Prototype clustering is also known as "prototype-based clustering".

        K-means algorithm

            According to the cluster division C obtained by clustering, the square error is minimized , where

        is the mean vector of cluster C, which to a certain extent describes the closeness of the samples in the cluster around the cluster mean vector, and the smaller the E value    

  The higher the similarity of the samples in the cluster is.

        The K-means algorithm adopts a greedy strategy, and the pseudo code of the algorithm is as follows:

 Learning Vector Quantization

        Learning vector quantization is also trying to find a set of prototypes to describe the clustering structure, but in the learning process, it will use its own category labels to assist clustering.

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_44575717/article/details/124199067