[Machine Learning] Experiment 3: K-Means Clustering

introduce

In this lab, you will implement the K-means clustering algorithm (K-means) and learn how it works on data clustering and its application to image compression.

The data sets needed for this experiment include:

  • ex3data1.mat - 2D dataset
  • hzau.jpeg - image used to test the image compression performance of the k-means clustering algorithm

The scoring criteria are as follows:

  • Point 1: Find the nearest class center point ----------------- (20 points)
  • Point 2: Calculate the mean class center ------------------ (20 points)
  • Point 3: Randomly initialize the class center ----------------- (10 points)
  • Point 4: K-means clustering algorithm --------------------- (20 points)
  • Point 5: Image compression ----------------------------- (30 points)

 

# 引入所需要的库文件
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sb
from scipy.io import loadmat

%matplotlib inline

1 K-means Clustering K-means Clustering

In this part of the experiment, the K-means clustering algorithm will be implemented.

In each iteration, the algorithm mainly includes two parts: finding the nearest class center point and calculating the mean class center.

Also, based on the need for initialization, create a function that selects a random sample and uses it as the initial cluster center.

1.1 Find the nearest class center point

In this part of the experiment, we will find the nearest class center for each sample point and assign it to the corresponding class.

The specific update formula is as follows:

 Among them x_{i}, is the i-th sample point, u_{j}and is the j-th mean class center.

**Point 1:** In the cell below, please **implement the code** of ''find the nearest class center point''.

 

# ====================== 在这里填入代码 =======================
def find_closest_centroids(X, centroids):
    """   
    输入
    ----------
    X : 尺寸为 (m, n)的矩阵,第i行为第i个样本,n为样本的维数。 
    centroids : 尺寸为 (k, n)的矩阵,其中k为类别个数。
    
    输出
    -------
    idx : 尺寸为 (m, 1)的矩阵,第i个分量表示第i个样本的类别。 
    """
    m = X.shape[0]
    k = centroids.shape[0]
    idx = np.zeros(m)
    
    for i in range(m):
        min_dist = 1e9
        for j in range(k):
            dist = np.sum((X[i,:] - centroids[j,:])** 2)
            if dist < min_dist:
                min_dist = dist
                idx[i] = j

    return idx
# ============================================================= 

If the above functions are completed  find_closest_centroids, the following code can be used for testing. If the result is [0 2 1], the calculation passes.

#导入数据
data = loadmat('ex3data1.mat')
X = data['X']
initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])

idx = find_closest_centroids(X, initial_centroids)
idx[0:3]

Output result:

 

#显示并查看部分数据
data2 = pd.DataFrame(data.get('X'), columns=['X1', 'X2'])
data2.head()

Output result:

 

#可视化二维数据
fig, ax = plt.subplots(figsize=(9,6))
ax.scatter(X[:,0], X[:,1], s=30, color='k', label='Original')
ax.legend()
plt.show()

Output result:

 

1.2 Calculate the mean class center

In this part of the experiments, we use the mean of each class of samples as the new class center.

The specific update formula is as follows:

 Wherein C_{j}is the index set of the jth sample point, ∣∣ C_{j}∣∣ is C_{j}the number of elements in the set.

**Point 2:** In the cell below, please **implement the code** of ''Calculate mean class center''.

 

# ====================== 在这里填入代码 ======================= 
def compute_centroids(X, idx, k):
    m, n = X.shape
    centroids = np.zeros((k,n))
    for i in range(k):
        indices =  np.where(idx == i)
        centroids[i,:] = (np.sum(X[indices,:], axis=1)/ len(indices[0])).ravel()
    return centroids
# ============================================================= 
#测试上述计算均值类中心代码
compute_centroids(X, idx, 3)

Output result:

 

1.3 Randomly initialize the class center

Randomly select k samples as initial class centers.

**Point 3:** In the cell below, please **implement the code** of ''random initialization class center''. Specifically, k samples are randomly selected as the initial class centers.

# ====================== 在这里填入代码 ======================= 
def init_centroids(X, k):
    m, n = X.shape
    idx = np.random.randint(0, m, k)
    centroids = np.zeros((k, n))
    for i in range(k):
        centroids[i,:] = X[idx[i],:]
    return centroids
# ============================================================= 
#测试上述随机初始化类中心代码
init_centroids(X, 3)

Output result:

 1.4 Realize the K-means clustering algorithm

**Point 4:** In the cell below, please implement the code** of the ''K-means clustering algorithm'' by combining the above steps**.

# ====================== 在这里填入代码 =======================
def run_k_means(X, initial_centroids, max_iters):
    m, n = X.shape
    k = initial_centroids.shape[0]
    idx = np.zeros(m)
    centroids = initial_centroids
    for i in range(max_iters):
        idx = find_closest_centroids(X, centroids)
        centroids = compute_centroids(X, idx, k)
        
    
    
    return idx, centroids
# ============================================================= 

 

2 Applying the K-means clustering algorithm to dataset 1

In this part of the experiment, the implemented K-means clustering algorithm is applied to data set 1. The sample dimension in this data set is 2, so after the clustering is completed, the clustering results can be observed through visualization.

idx, centroids = run_k_means(X, initial_centroids, 10)
cluster1 = X[np.where(idx == 0)[0],:]
cluster2 = X[np.where(idx == 1)[0],:]
cluster3 = X[np.where(idx == 2)[0],:]

fig, ax = plt.subplots(figsize=(9,6))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
plt.show()

Output result:

 1.3 Apply K-means clustering algorithm to image compression Image compression with K-means

#读取图像
A = mpl.image.imread('hzau.jpeg')
A.shape

Output result:

Now we need to apply some preprocessing to the data and feed it to the K-means algorithm.

# 归一化图像像素值的范围到[0, 1] 
A = A / 255.

# 对原始图像尺寸作变换
X = np.reshape(A, (A.shape[0] * A.shape[1], A.shape[2]))
X.shape

Output result:

 **Point 5:** In the cell below, **Please use the K-means clustering algorithm to achieve image compression**. The specific method is to replace the original pixel with the corresponding center pixel of the mean class.

 

# ====================== 在这里填入代码 =======================
idx, centroids = run_k_means(X, initial_centroids, 10)
idx = find_closest_centroids(X, centroids)


A_compressed = centroids[idx.astype(int),:]
A_compressed.shape

print(A_compressed.shape)
# ============================================================= 

Output result:

 

#显示压缩前后的图像
fig, ax = plt.subplots(1, 2, figsize=(9,6))
ax[0].imshow(A)
ax[0].set_axis_off()
ax[0].set_title('Original image')
ax[1].imshow(A_compressed)
ax[1].set_axis_off()
ax[1].set_title('Compressed image')
plt.show()

Output result:

 

Guess you like

Origin blog.csdn.net/CE00001/article/details/127812757