Introductory Research on Machine Learning (18)—Model Evaluation of Clustering

table of Contents

Contour factor

API in sklearn


In the introductory research of machine learning (17) -Instacart Market user classification , we use the KMeans in sklearn to divide users into three categories. Then how do we evaluate the good or bad of this model?

Contour factor

From the above figure, we know that the clustering finally divided these points into 2 categories. The final result of the clustering is: the inner distance is minimized, and the outer distance is maximized. We use the contour coefficient to describe, where the formula is as follows:

The bi in it refers to the minimum value of the distance between each sample and all samples in other clusters. It can be regarded as an external distance;

And ai can be regarded as the average value of all the distances from the sample value to this cluster. It can be regarded as the internal distance.

We can see that when bi>>ai, the outer distance far exceeds the inner distance, and the contour coefficient is close to 1. When ai》》bi, the inner distance far exceeds the outer distance. At this time, the contour coefficient is close to -1; then you can see that the range of the contour coefficient is [-1,1], the closer to 1, the better the clustering effect; the closer to -1, the worse the clustering effect .

API in sklearn

A mature API is provided in sklearn to implement this contour coefficient, and the corresponding API is shown in the figure:

 sklearn.metrics.silhouette_score(X, labels, metric='euclidean', sample_size=None,
                     random_state=None, **kwds):

The parameters are described as follows:

parameter meaning
X Eigenvalues
labels Our last predicted label
 metric

The method of calculating the distance between two samples", the default is'euclidean', it must be the optional metric in metrics.pairwise.pairwise_distances ('cityblock','cosine','euclidean','l1','l2' ,'manhattan' or'braycurtis','canberra','chebyshev','correlation','dice','hamming','jaccard','kulsinski','mahalanobis','matching','minkowski', ' rogerstanimoto','russellrao','seuclidean','sokalmichener','sokalsneath','sqeuclidean','yule');

It is also possible to calculate the distance between two instances for a callable function.

 sample_size Randomly take the average of a part of the sample
random_state

Used to generate random samples when sample_size is non-empty

We use this API to evaluate the introductory research on machine learning (17)-the model in Instacart Market user classification :

from sklearn.metrics import  silhouette_score
score = silhouette_score(data_new,y_predict)

The output score value is:

0.5368333366182597

We see that this value is close to 1, so this model is okay.

Guess you like

Origin blog.csdn.net/nihaomabmt/article/details/104477492