Chapter 6. Clustering algorithm K-means: 1. k-means: unsupervised classification 1. Clustering algorithm in sklearn 1.1 sklearn.cluster.K-means 1.2 Hyperparameter n_clusters=inert

1. k-means: unsupervised classification

Clustering and classification:

Clustering is an unsupervised learning method whose goal is to divide data samples into different groups so that samples within the same group are similar to each other, while samples between different groups are very different. Clustering algorithms assign data points into different clusters based on the similarity or distance between them. Clustering algorithms do not require a priori label or category information, but group them based on the characteristics of the data itself. Clustering can help discover the intrinsic structure, similarities, and patterns of data and is very useful for tasks such as data exploration, segmentation, and preprocessing.

Classification is a supervised learning method that uses existing label or category information to train a model and predict the category to which new unknown data points belong. The goal of a classification task is to learn a model that can map input data into predefined categories. Classification algorithms require known labels or category information as training data, and learn classification rules or decision boundaries based on these labels. Common classification algorithms include decision trees, support vector machines, and neural networks.

Clustering is often used for tasks such as data exploration, segmentation and preprocessing, which helps to discover the similarity and group structure of data; classification is often used for prediction and discrimination tasks to classify new unknown data.

1.Clustering algorithm in sklearn

Two forms of expression: class and function
Insert image description here

  • For k-means:
    cluster: the same cluster is one type.
    Centroid: the average of the horizontal and vertical coordinates of a cluster of data.
    Process: randomly select k samples as the initial centroid; start the cycle: each sample point is assigned to the centroid nearest to it. Generate k clusters; calculate a new centroid for each cluster, and clustering is completed when the centroid position no longer changes.

  • Measurement of the distance from the sample to the centroid:
    Insert image description here
    n represents the number of features, for example, for x and y of two-dimensional data, n=2; x represents the sample point, and u represents the centroid.

If k-means uses Euclidean distance, the sum of the distances of all sample points is: the Insert image description here
former is the sum of squares within the cluster, and the latter is the sum of the whole squares. K-means pursues the centroid with the smallest sum of squares within the cluster. This is the model evaluation metric for k-means, not the loss function.

1.1 sklearn.cluster.K-means

Code:

  1. Create a dataset:
from sklean.datasets import make_blobs as mb
import matplotlib.pyplot as pt 
x,y=mb(n_samples=500,n_features=2,centers=4,random_sate=1)
print(x.shape)

fig,ax1=pt.subplots(1)//fig是窗口,ax1是子图
ax1.scatter(x[:,0],x[:,1],marker='o',s=8)//点的形状,点的大小
print(pt.show())

color=["red","pink","orange","gray"]
fig,ax1=pt.subplots(1)

//聚类算法结束应该的样子
for i in range(4):
	ax1.scatter(x[y==i],x[y==i,1],marker='o',s=8,c=color[i])
print(pt.show())
  1. perform clustering
from sklean.cluster import KMeans as km
import matplotlib.pyplot as pt 
n_clsters=3
cluster=km(n_clusters,random_state=0).fix(x)
y_pred=cluster_.labels_
print(y_pred)//查看y的所有值

center=cluster.cluster_centers_//质心
print(center)
print(center.shape)

d=cluster.inertia_//距离平方和,越小越好
print(d) //此时为1903

color=["red","pink","orange","gray"]
fig,ax1=pt.subplots(1)
for i in range(n_clusters):
   ax1.scatter(x[y_pred==i],x[y_pred==i,1],marker='o',s=8,c=color[i])
ax1.scatter(center[:,0],center[:,1],maeker='x',s=18,c="black")
print(pt.show())

The results are as follows: Insert image description here
If n_clusters=3 is changed to 4: d=908; if n_clusters=3 is changed to 5, d=733.
But, it is not that the smaller the number, the better. How to choose the hyperparameter n_clusters=?

1.2 How to choose the hyperparameter n_clusters=inertia?

  • Is smaller inertia better?
    no.
    First, it is not bounded. We only know that the smaller the inertia, the better, and 0 is the best, but we don’t know whether a smaller inertia has reached the limit of the model and whether it can continue to improve.
    Second, its calculation is too easily affected by the number of features. When the data dimension is large, lnertia's calculation amount will fall into the curse of dimensionality, and the calculation amount will explode, making it unsuitable for evaluating models again and again.
    Third, inertia makes assumptions about the distribution of data. It assumes that the data satisfies a convex distribution (that is, the number looks like a convex function on a two-dimensional plane image), and it assumes that the data is isotropic, that is It means that the attributes of data represent the same meaning in different directions. But real-life data is often not like this. Therefore, using Inertia as an evaluation index will cause the clustering algorithm to perform poorly on some slender clusters, ring clusters, or irregularly shaped manifolds.
    Insert image description here
    Only clusters of clusters are effective, and others are easy to distinguish wrongly.

Therefore, when the real label is unknown, the contour coefficient is generally selected.

1.3 Silhouette coefficient silhouette_score

  • Silhouette coefficient:
    Insert image description here
    that is:
    Insert image description here
    if most samples in a family have relatively high contour coefficients, then the family will have a higher overall contour coefficient, and the higher the average contour coefficient of the entire data set, the clustering is appropriate. . If many sample points have low silhouette coefficients or even negative values, clustering is inappropriate, and the clustering hyperparameter K may be set too large or too small.

The profile coefficient can simultaneously measure:

  1. The similarity a of a sample to other samples in the same family as its white body is equal to the average distance between the sample and all other points in the same ship
  2. The similarity b between a sample and samples in other clusters is equal to the average distance between the sample and all points in the next nearest cluster. According to the requirements of clustering "the difference within the cluster is small and the difference outside the cluster is large", we hope that b Always larger than a, and the bigger the better.

Contour coefficient code:

silhouette_samples(x,y_pred)//每个样本点的轮廓系数
silhouette_score(x,y_pred)
//y_pred就是cluster_.labels_

Result:
Insert image description here
Based on contour coefficient selection, then n_clusters=4

1.4 Other evaluation indicators

In addition to the silhouette coefficient, which is the most commonly used, there are also the Calinski-Harabaz index (CHI, also known as the variance ratio standard), the Davies-Bouldin index and The Contingency Matrix can be used.
For example:
Insert image description here

Guess you like

Origin blog.csdn.net/qq_53982314/article/details/131261650