Dimensionality reduction of KMeans clustering----vectorization application

Table of contents

KMeans clustering overview

Model Evaluation Metrics

Labeled Evaluation Metrics

Unlabeled Evaluation Metrics

Case: Vectorization Application

KMeans clustering overview

KMeans is a typical representative of clustering algorithms, and it is also the simplest clustering algorithm; in KMeans algorithm, the number K of clusters is a hyperparameter, which needs to be determined by human input. The core task of KMeans is to find the K optimal centroids according to the K we set, and assign the data closest to these centroids to the clusters represented by these centroids.

Time complexity of the KMeans algorithm: The average complexity of the KMeans algorithm is O(k*n*T), where k is our hyperparameter, the number of clusters that need to be input, n is the sample size in the entire data set, and T is all The number of iterations required (relatively, the average complexity of KNN is O(n)).

Model Evaluation Metrics

The result of the clustering model is not some kind of label output, and the result of clustering is uncertain, and its model effect needs to be evaluated by business requirements or algorithm requirements;

Labeled Evaluation Metrics

Although no real labels are entered in the clustering, the data you have may have real labels. The probability of this happening is very small. Use the following method for model evaluation indicators:

Information mutual separation : the closer the value range is to 1 among (0,1), the better the clustering effect; 0 points are generated under random uniform clustering;

#普通互信息分
metrics.adjusted_mutual_info_score (y_pred, y_true)
#调整的互信息分
metrics.mutual_info_score (y_pred, y_true)
#标准化互信息分
metrics.normalized_mutual_info_score (y_pred, y_true)

A series of intuitive metrics V-measure based on conditional analysis : the value range is among (0,1), the closer to 1, the better the clustering effect. Since it is divided into two metrics of homogeneity and completeness, it can be more careful Research on which task the model is not doing well enough.
There is no assumption on the sample distribution, and it can have a good performance on any distribution; it will not produce 0 points under random uniform clustering;

#同质性:是否每个簇仅包含单个类的样本
metrics.homogeneity_score(y_true, y_pred)

#完整性:是否给定类的所有样本都被分配给同一个簇中
metrics.completeness_score(y_true, y_pred) 

#同质性和完整性的调和平均,叫做V-measure 
metrics.v_measure_score(labels_true, labels_pred) 

#三者可以被一次性计算出来: 
metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)

Adjust the Rand coefficient : the value is between (-1,1), the negative value means that the points in the cluster are very different, even independent of each other, the Rand coefficient of the positive class is better, the closer to 1, the better the sample distribution. Hypothetically, it can perform well on any distribution, especially on data with "folded" shapes; yields 0 points under random uniform clustering;

#调整兰德系数
metrics.adjusted_rand_score(y_true, y_pred)

Unlabeled Evaluation Metrics

In 99% of cases, there is no real labeled data. Such clustering is completely dependent on evaluating the degree of density within a cluster (small differences within a cluster) and the degree of dispersion between clusters (large differences outside a cluster) to evaluate the effect of clustering. Among them, the silhouette coefficient is the most commonly used evaluation index for clustering algorithms. It is defined per sample, and it can simultaneously measure:

1) The similarity a between the sample and other samples in its own cluster is equal to the average distance between the sample and all other points in the same cluster 

2) The similarity b between the sample and samples in other clusters is equal to the average distance between the sample and all points in the next nearest cluster. b is always greater than a, and the bigger the better. The silhouette coefficient for a single sample is calculated as:

The range of the silhouette coefficient is (-1,1), and the closer the value is to 1, the sample is very similar to the sample in its own cluster, and is not similar to the samples in other clusters. When the sample point is more similar to the samples outside the cluster When , the silhouette coefficient is negative. When the silhouette coefficient is 0, it means that the samples in the two clusters have the same similarity, and the two clusters should be one cluster. It can be concluded that the closer the silhouette coefficient is to 1, the better, and negative numbers indicate that the clustering effect is very poor. 

Case: Vectorization Application

One of the most important applications of K-Means clustering is vector quantization (VQ) on unstructured data (images, sounds). Unstructured data often takes up a lot of storage space, the file itself will be relatively large, and the operation is very slow. We hope to reduce the size of unstructured data as much as possible or simplify unstructured data on the premise of ensuring data quality. Structure. Vector quantization can help us achieve this purpose. The essence of vector quantization of KMeans clustering is a dimensionality reduction application, and the ideas of any dimensionality reduction algorithm are different. The dimensionality reduction of feature selection is to directly select the features that contribute the most to the model, the dimensionality reduction of PCA is to aggregate information, and the dimensionality reduction of vector quantization is to compress the size of information on the same sample size, that is, without changing the number of features or samples The number of , only changes the amount of information on samples under these features.

Using the centroid obtained in K-Means clustering to replace the original data can compress the amount of information on the data to a very small size without losing too much information. Next, let's take a look at how K-Means can compress the data size without losing too much information through the vector quantization of a picture.

1. Import the required library

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle

2. Import data, explore data

china = load_sample_image("china.jpg")
newimage = china.reshape((427 * 640,3))

import pandas as pd
pd.DataFrame(newimage).drop_duplicates().shape
plt.figure(figsize=(15,15))
plt.imshow(china)

The image now has more than 9W colors. Use K-Means to compress the colors to 64 without seriously losing the quality of the image. To this end, we will use K-Means to cluster 9W colors into 64 categories, and then use the centroids of 64 clusters to replace all 9W colors. Remember that the centroids have such a property: the points in the cluster are all away from the centroid nearest sample point.

For comparison, we also plot the vector quantized image randomly compressed to 64 colors. We need to randomly select 64 sample points as random centroids, calculate the distance from each sample in the original data to find the nearest random centroid to each sample, and then replace the original sample with the random centroid corresponding to each sample . In both cases, we observe the situation after image visualization to see the loss of image information.

Before that, we need to process the data into data that the K-Means class in sklearn can accept

china = np.array(china, dtype=np.float64) / china.max()
w, h, d = original_shape = tuple(china.shape)
image_array = np.reshape(china, (w * h, d))
image_array

3. Perform K-Means vector quantization on the data 

n_clusters = 64
image_array_sample = shuffle(image_array, random_state=0)[:1000]
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(image_array_sample)
kmeans.cluster_centers_
labels = kmeans.predict(image_array)
labels.shape
image_kmeans = image_array.copy()
for i in range(w*h):
    image_kmeans[i] = kmeans.cluster_centers_[labels[i]]
image_kmeans
pd.DataFrame(image_kmeans).drop_duplicates().shape
image_kmeans = image_kmeans.reshape(w,h,d)
image_kmeans.shape

4. Perform random vector quantization on the data

plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)
plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors, K-Means)')
plt.imshow(image_kmeans)
plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(image_random)
plt.show()

Guess you like

Origin blog.csdn.net/qq_31807039/article/details/130542919