Machine Learning: K-means and K-medoid Clustering Algorithms

The difference between the two algorithms

The difference between the two is mainly in the selection of the centroid, K-means is the mean of the sample points, and K-medoid is the point with the smallest distance and the smallest distance from the current classification sample points.

2. Algorithm steps

1、K-means

1. Randomly select K centroids
2. Calculate the distance from each point to the centroid
3. Divide the point class into the closest centroid to form K clusters
4. Recalculate the centroid in each cluster according to the classified clusters
Average of individual points in each cluster
5. Repeat the iteration 2-4 times until the two classification errors before and after are less than the specified value

2 、 K-medoids

1. Randomly select K centroids
The centroid must be the value of some sample point
2. Calculate the distance from each point to the centroid
3. Divide the class of the point to the centroid closest to him to form K clusters
4. Recalculate the centroid in each cluster according to the classified clusters
Calculate the distance sum of all samples in the cluster to one of the points
Select the distance and the smallest sample as the new centroid
5. Repeat the iteration 2-4 times until the two classification errors before and after are less than the specified value

3. Analysis of advantages and disadvantages

1. K-medoid runs slowly because the time complexity of calculating the centroid is O(n^2), because it must calculate the distance between any two points, while K-means only calculates the average distance, and the time complexity is O( 1)
2. K-medoid has better robustness to noise and can separate abnormal samples into classes.
3. K-medoid is only suitable for small samples, and K-means is often used for large samples.

Four, K value selection

1.
Elbow method The core index of the elbow method is the sum of squares of errors (SSE). Core idea: As the number of clusters K increases, the sample division will be more refined, the degree of aggregation of each cluster will gradually increase, and the sum of squared errors SSE will gradually become smaller. Moreover, when K is less than the real number of clusters, since the increase of K will greatly increase the degree of aggregation of each cluster, the decline of SSE will be very large, and when K reaches the real number of clusters, increase K again. The degree of aggregation returns will decrease rapidly, so the decline of SSE will decrease sharply, and then it will become flat as the value of K continues to increase, that is to say, the relationship between SSE and K is in the shape of an elbow, and this elbow The K value corresponding to the part is the true number of clusters of the data. SSE-K Line Chart
2. Contour coefficient method
The core index of this method is the contour coefficient. The contour coefficient of a sample point Xi is defined as follows:
S = (ba)/max(a,b)
where a is the average of Xi and other samples in the same cluster The distance, called the degree of cohesion, and b is the average distance between Xi and all samples in the nearest cluster, called the degree of separation. The definition of the nearest cluster is:
insert image description here
where p is a sample in a certain cluster Ck. That is, after using the average distance from Xi to all samples of a certain cluster as a measure of the distance from the point to the cluster, select a cluster closest to Xi as the nearest cluster
to obtain the silhouette coefficients of all samples, and then calculate the average to get the average. silhouette factor. The value range of the average silhouette coefficient is [-1, 1], and the closer the distance between the samples in the cluster, the farther the distance between the samples, the larger the average silhouette coefficient, the better the clustering effect. Then, selecting the K value with the largest average silhouette coefficient is the optimal number of clusters.

5. Preliminary test clustering center

1. Randomly select K points as the initial cluster center.
2. First randomly select a point as the center point of the first initial cluster, then select the point farthest from the point as the center point of the second initial cluster, and then select the one with the largest distance from the first two points. The point is used as the center point of the third initial cluster, and so on, until K initial cluster center points are selected.
3. Use hierarchical clustering or Canopy algorithm for initial clustering, and then use the center points of these clusters as the initial cluster center points of KMeans algorithm.
Commonly used hierarchical clustering algorithms are BIRCH and ROCK, which will not be introduced here. Let’s briefly introduce the Canopy algorithm, mainly from Mahout’s Wiki:
First define two distances T1 and T2, T1>T2. From the initial set of points S Randomly remove a point P, and then for each point I still in S, calculate the distance between the point I and the point P, if the distance is less than T1, add the point I to the Canopy represented by the point P, if the distance If it is less than T2, then point I is removed from the set S, and point I is added to the Canopy represented by point P. After one iteration, randomly select a point from the set S as a new point P, and then repeat the above steps.
After the Canopy algorithm is executed, many Canopys will be obtained. It can be considered that each Canopy is a Cluster. Different from hard partitioning algorithms such as KMeans, each point in the clustering result of Canopy may belong to multiple Canopys. We can choose the data point closest to the center point of each Canopy, or directly select the center point of each Canopy as the initial K cluster center points of KMeans.

6. Code

1、K-means

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

class KMeans():

    def __init__(self, train_data, k_num_center):
    '''
    train_data 训练数据,格式为(m,n)
    k_num_center K值
    '''
        self.train_data = train_data
        self.K = k_num_center

    #定义函数计算欧氏距离,point为点的坐标,其维度为(p,)
    def get_euclidean_distance(self,point1, point2):
        return (np.sum(point1 - point2) ** 2) ** 0.5

    #定义函数返回所有样本点到聚类中心的欧氏距离
    def get_distances(self, train_data, crowds):
        all_distances = [] #保存所有样本点到所有聚类中心的欧式距离,其维度为(k,n)
        for i in range(len(crowds)):
            disances = [] #保存所有样本到一个聚类的欧式距离,其维度为(n,)
            for j in range(len(train_data)):
                distance = self.get_euclidean_distance(train_data[j], crowds[i])
                disances.append(distance)
            all_distances.append(disances)
        return all_distances

    def get_distances_sse(self, crowds, clsys):
        sse = 0.0 #保存所有样本点到所有聚类中心的欧式距离,其维度为(k,n)
        for i in range(len(self.train_data)):
            # sse += get_euclidean_distance(train_data[i], crowds[clsys[i]])        
            sse += float(self.get_euclidean_distance(self.train_data[i], crowds[clsys[i]]))
        return sse

    #将样本分类到最近的聚类中心,其维度为(n,)
    def classify(self, crowds):
        all_distances = self.get_distances(self.train_data, crowds)
        clsy = np.argmin(all_distances, axis = 0)
        return clsy

    #定义函数比较两次聚类结果
    def clsy_change(self, new_clsy, clsy):
        changed = False
        for i in range(len(clsy)):
            if clsy[i] != new_clsy[i]:
                changed = True
                break
        return changed

    #随机初始化K个中心点
    def init_crowds(self):
        numOfAttr = self.train_data.shape[1]
        centroids=np.zeros((self.K,numOfAttr))      #随机初始化,最终迭代到每一类的中心位置
        maxAttr=np.zeros(numOfAttr)           # 每一维最大的数
        minAttr=np.zeros(numOfAttr)           # 每一维最小的数
        for i in range(numOfAttr):
            maxAttr[i] = np.max(self.train_data[:,i])    # 每一维最大的数
            minAttr[i] = np.min(self.train_data[:,i])    # 每一维最小的数
            for j in range(self.K):
                centroids[j,i] = maxAttr[i]+(minAttr[i]-maxAttr[i])*np.random.rand()  # 随机初始化,选取每一维[min max]中初始化
        return centroids

    #定义聚类函数
    def final_classify(self):

        crowds = self.init_crowds()
        p = self.train_data.shape[1]  #数据维度
        n = len(self.train_data)      #数据量
        k = len(crowds)          #聚类中心个数

        new_crowds = crowds    #聚类中心
        clsy = np.ndarray((n,))  #分类标签
        new_clsy = np.ndarray((n,))

        while(clsy != new_clsy).any():
            clsy = new_clsy
            new_clsy = self.classify(new_crowds)

            new_crowds = []#新的聚类中心
            clusters = [] #每一个聚类中的样本点的索引
            for i in range(k):
                clusters.append([])
            for j in range(n):
                clusters[new_clsy[j]].append(j)
            for j in range(k):
                if len(clusters[j]) == 0:
                	new_crowds.append(crowds[j])
                else:
                	sums = np.zeros((p,))
                	for m in clusters[j]:
                		sums += self.train_data[m]
                	means = sums / len(clusters[j])
                	new_crowds.append(means)
        sse = self.get_distances_sse(new_crowds, new_clsy)
        return (new_crowds, new_clsy), sse

2 、 K-medoid

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

class KMediod():

    def __init__(self, train_data, k_num_center):
        self.train_data = train_data
        self.K = k_num_center

    #定义函数计算欧氏距离,point为点的坐标,其维度为(p,)
    def get_euclidean_distance(self,point1, point2):
        return (np.sum(point1 - point2) ** 2) ** 0.5

    #定义函数返回所有样本点到聚类中心的欧氏距离
    def get_distances(self, train_data, crowds):
        all_distances = [] #保存所有样本点到所有聚类中心的欧式距离,其维度为(k,n)
        for i in range(len(crowds)):
            disances = [] #保存所有样本到一个聚类的欧式距离,其维度为(n,)
            for j in range(len(train_data)):
                distance = self.get_euclidean_distance(train_data[j], crowds[i])
                disances.append(distance)
            all_distances.append(disances)
        return all_distances

    def get_distances_sse(self, crowds, clsys):
        sse = 0.0 #保存所有样本点到所有聚类中心的欧式距离,其维度为(k,n)
        for i in range(len(self.train_data)):
            # sse += get_euclidean_distance(train_data[i], crowds[clsys[i]])        
            sse += float(self.get_euclidean_distance(self.train_data[i], crowds[clsys[i]]))
        return sse

    #将样本分类到最近的聚类中心,其维度为(n,)
    def classify(self, crowds):
        all_distances = self.get_distances(self.train_data, crowds)
        clsy = np.argmin(all_distances, axis = 0)
        return clsy

    #定义函数比较两次聚类结果
    def clsy_change(self, new_clsy, clsy):
        changed = False
        for i in range(len(clsy)):
            if clsy[i] != new_clsy[i]:
                changed = True
                break
        return changed

    #随机初始化K个中心点
    def init_crowds(self):
        numOfAttr = self.train_data.shape[1]
        centroids=np.zeros((self.K,numOfAttr))      #随机初始化,最终迭代到每一类的中心位置
        maxAttr=np.zeros(numOfAttr)           # 每一维最大的数
        minAttr=np.zeros(numOfAttr)           # 每一维最小的数
        for i in range(numOfAttr):
            maxAttr[i] = np.max(self.train_data[:,i])    # 每一维最大的数
            minAttr[i] = np.min(self.train_data[:,i])    # 每一维最小的数
            for j in range(self.K):
                centroids[j,i] = maxAttr[i]+(minAttr[i]-maxAttr[i])*np.random.rand()  # 随机初始化,选取每一维[min max]中初始化
        return centroids

    #定义聚类函数
    def final_classify(self):

        crowds = self.init_crowds()
        p = self.train_data.shape[1]  #数据维度
        n = len(self.train_data)      #数据量
        k = len(crowds)          #聚类中心个数

        new_crowds = crowds    #聚类中心
        clsy = np.ndarray((n,))  #分类标签
        new_clsy = np.ndarray((n,))

        while(clsy != new_clsy).any():
            clsy = new_clsy
            new_clsy = self.classify(new_crowds)

            new_crowds = []#新的聚类中心
            clusters = [] #每一个聚类中的样本点的索引
            for i in range(k):
                clusters.append([])
            for j in range(n):
                clusters[new_clsy[j]].append(j)
            for j in range(k):
                if len(clusters[j]) == 0:
                    new_crowds.append(crowds[j])
                else:
                    distances = [self.get_euclidean_distance(crowds[j],data) for data in clusters[j]]
                    now_distance = np.sum(distances)
                    new_crowd = crowds[j]
                    for num in range(len(clusters[j])):
                        distances = [self.get_euclidean_distance(self.train_data[clusters[j][num]], data) for data in clusters[j]]
                        new_distance = np.sum(distances)
                        if (new_distance < now_distance):
                            now_distance = new_distance
                            new_crowd = self.train_data[clusters[j][num]]
                    new_crowds.append(new_crowd)
        sse = self.get_distances_sse(new_crowds, new_clsy)
        # sse = 1
        return (new_crowds, new_clsy), sse

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324469256&siteId=291194637