K-means clustering - simple code implementation, associated with the EM algorithm

Algorithm implementation steps

K-means thinking:

(1) randomly selecting the k clusters cluster center
(2) is calculated for each sample point distances to all clusters by selecting the smallest distance as the class cluster of the sample
(3) recalculates the center coordinates of all classes clusters, until certain stop condition (the number of iterations / cluster center converge / minimum squared error)

EM algorithm idea:

(1) Suppose we want to know the estimated two parameters A and B, both in the initial state is unknown;
(2) the information but if you know the A's can get information B, which in turn will know B got a;
(3) a first giving some initial value, in order to obtain an estimate of the value of B, then B, starting from the current value, the value of re-estimation of a, this process continues until convergence.

Both associations:

Step two can be seen from the above, k-means EM is actually reflected, the two parameters are not known for a start, and then select an initial random variable, results obtained after the object, and then calculate the entire turn, iteratively These two steps until you get the best results.

The general idea is an iterative optimization process, there is an objective function, but also variable parameters, k-means more than just a hidden variables determine other parameters to estimate implicit variables, and then determine the implicit variables to estimate other parameters, until the objective function is optimized.

Code:

Calculation step:
2 After traversing all sample points; traversing each sample point, calculating the distance to each sample point i centroid, finding the minimum distance, where the centroid of the cluster numbers assigned to the sample points: 1. Do the following iterations. recalculated each cluster centroid. Until all assigned cluster sample points are no longer changed iteration stops;
3. Finally, return to the centroid of the cluster distribution matrix and sample points.

import numpy as np
import math
import matplotlib.pyplot as plt

#1.加载数据
def loadDataSet(fileName, splitChar='\t'):
    dataSet = []
    with open(fileName) as fr:
        for line in fr.readlines():
            curline = line.strip().split(splitChar)
            fltline = list(map(float, curline))
            dataSet.append(fltline)
    return np.array(dataSet)


#2.计算欧氏距离：两个向量每个元素差的平方，求和后再开方
def dist_eclud(vecA, vecB):
    vec_square = []
    for element in vecA - vecB:   #元素想减
        element = element ** 2    #求平方
        vec_square.append(element)   #放入list
    return sum(vec_square) ** 0.5    #list求和后，再开方


# 3.构建k个随机质心
def rand_cent(data_set, k):
    n = data_set.shape[1]    
    centroids = np.zeros((k, n))   #构造质心的shape
    #限定质心在数据集范围之内
    for j in range(n):
        min_j = float(min(data_set[:,j]))    #数据集中的最小值
        range_j = float(max(data_set[:,j])) - min_j     #
        centroids[:,j] = (min_j + range_j * np.random.rand(k, 1))[:,0]
    return centroids


#4.算法的主函数
def Kmeans(data_set, k):    
    m = data_set.shape[0]
    cluster_assment = np.zeros((m, 2))     
    centroids = rand_cent(data_set, k)     #获取k个随机质心
    cluster_changed = True

    while cluster_changed:       
        cluster_changed = False
         #遍历m个原始数据
        for i in range(m):        
            min_dist = np.inf; min_index = -1
            #遍历k个质心  
            for j in range(k):
                #计算m个原始数据与每一个质心的距离
                dist_ji = dist_eclud(centroids[j,:], data_set[i,:])
                #找到最小距离，将该质心所在簇编号分配给该样本点
                if dist_ji < min_dist:              
                    min_dist = dist_ji
                    min_index = j   

            #重新计算每个簇的质心。直到所有样本点的簇分配都不再发生变化时迭代停止。
            if cluster_assment[i,0] != min_index:    
                cluster_changed = True      
            cluster_assment[i,:] = min_index, min_dist**2
            
        #返回质心和样本点的簇分配矩阵
        for cent in range(k):   
            pts_inclust = data_set[np.nonzero(list(map(lambda x:x==cent, cluster_assment[:,0])))]
            centroids[cent,:] = np.mean(pts_inclust, axis=0)

    return centroids, cluster_assment            


#加载数据
data_set = loadDataSet('img/1.txt', splitChar=',')
#计算出最终的质心
my_centroids, my_cluster_assment = Kmeans(data_set, 4)
print("最终4个质心为：")
print(my_centroids)

#原始数据的坐标点
point_x = data_set[:,0]
point_y = data_set[:,1]
#最终质心的坐标点 
cent_x = my_centroids[:,0]
cent_y = my_centroids[:,1]
#画图
fig, ax = plt.subplots(figsize=(10,5))
#画原始数据点，形状为圆圈
ax.scatter(point_x, point_y, s=30, c="r", marker="o", label="sample point")
#画最终数据点，形状为倒三角
ax.scatter(cent_x, cent_y, s=100, c="black", marker="v", label="centroids")
#在右上角显示小图标
ax.legend()
ax.set_xlabel("factor1")   #横轴名称
ax.set_ylabel("factor2")   #纵轴名称
plt.show()

The results are as follows:
you get a new four centroids
Here Insert Picture Description

The raw data and the final centroid drawn on the map:
Here Insert Picture Description

Can be seen, the position of the centroid of each group substantially at the center of the distribution of data points, to achieve a cluster.

Reference Links: https://blog.csdn.net/weixin_41090915/article/details/79389636