Machine learning day18 clustering algorithm evaluation

Clustering algorithm evaluation

Assuming that there is no external label data, how do we evaluate the pros and cons of different clustering algorithms?

Unsupervised learning often does not have annotated data. This is a model. The design of the algorithm directly affects the final output and the performance of the model. In order to evaluate different clustering algorithms, we can start with clusters.

  • A data cluster defined by a center. This type of data collective tends to be distributed in a spherical shape. The center is often defined as the centroid, which is the average of all points in the data cluster. The distance from the data in the collection to the center is closer than the distance to the centers of other clusters.

  • A data cluster defined by density. This type of data set presents a density that is significantly different from the surrounding data clusters, and may be dense or sparse. When data clusters are irregular or coiled around each other, by noise, outliers, this is generally defined by the density of clusters.

  • Clusters defined by connectivity. There is a connection relationship between data points and data points in this type of data set. The entire data cluster appears as a graph structure. This definition is effective for irregular shapes or twisted data clusters

  • A data cluster defined by a concept. All data points in this type of data set have a certain common property.

Each situation requires different evaluation methods. For example, K-means clustering can be evaluated using the sum of squared errors.
The recognition of cluster evaluation is to estimate the feasibility of clustering on the data set and the quality of the results produced by the clustering method. This process is divided into three sub-tasks.


  1. The step of estimating the clustering trend is to detect whether there is a non-random cluster structure in the data distribution. If the data basis is random, then the clustering result is meaningless. We can increase the number of cluster categories. If the data is basically random, that is, there is no suitable cluster structure, then the clustering error will not change much with the increase in the number of cluster categories, and no suitable one will be found. K corresponds to the actual number of clusters of the data.

  2. After determining the number of data clusters to
    determine the clustering trend, we need to find the number of clusters that best matches the real data distribution, and then determine the quality of the clustering results.

  3. Determine the clustering quality
    Given a preset number of clusters, different clustering algorithms will output different results, we need to determine the quality of the clustering results. The following indicators are generally used.

  • Contour coefficient, given a point p, the contour coefficient of this point is defined as

  • image.png

  • Where a(p) is the average distance between point p and other points in the same cluster, and b(p) is the minimum average distance between point p and points in another different cluster. a(p) reflects the degree of data compactness of the cluster, b(p) reflects the degree of separation of the cluster from other adjacent clusters. The larger the b(p) and the smaller a(p), the better the corresponding clustering quality. Therefore, we average the contour coefficients s(p) corresponding to all points to measure the quality of the clustering results.

  • The standard deviation of the mean square deviation, used to measure the compactness of the clustering results, is defined as follows

  • image.pngamong them
  • Represents the i-th cluster,

  • Is the center of the cluster,

  • Represents a sample point belonging to the i-th cluster,

  • Is the sample number of the i-th cluster, and P is the vector dimension corresponding to the sample point. RMSSTD can be regarded as a normalized standard deviation.

  • image.png
  • , Usually NC

  • ,therefore

  • It is a number close to the total number of points and can be regarded as a constant.

  • image.png
  • R square, slightly

  • Improve Hubert imagestatistics, slightly


Guess you like

Origin blog.51cto.com/15069488/2578573