Brief description of k-means clustering algorithm

First, the general idea:

1. Randomly generate k initial points as particles in a certain range, that is, k clusters;

2. Divide the points in the data set to the cluster where the nearest particle is located;

3. For each cluster, calculate the average value of all points of the cluster, get the center of the cluster, use this point as a new mass point, and repeat step 2 until all clusters no longer change.

 

Second, the code

1. Get the dataset

def loadDataSet(fileName):
    # 初始化返回变量
    dataMat = []
    # a.要求函数返回矩阵的形式。可使用类似data=[]进行初始化,然后使
    # 用data.append(XXX)的格式填充数据。
    # b.数据集文件中每一行代表一个点的坐标,坐标与坐标之间使用制表
    # 符“\t”分隔,在读取数据时要注意正确切割。
    # c.将文本中的数值保存成float的数值格式,可使用map(float, XXX)函数
    f = open(fileName)
    for line in f.readlines():
        curLine = line.strip().split('\t')
        fltLine = list(map(float, curLine))  #python2的map方法返回的是list,python3返回的是iterators,需要在外面加上list强制转换。
        dataMat.append(fltLine)
    return dataMat

2. Calculate Euclidean distance

def distEclud(vecA, vecB):
    return sqrt(sum(power(vecA - vecB, 2)))

3. Randomly generate initial particles

def randCent(dataSet, k):
    # 获取数据集中坐标维数
    n = shape(dataSet)[1]
    # n = 2
    # 创建一个矩阵保存随机生成的k个质心
    centroids = mat(zeros((k, n)))
    for j in range(n):
        # 获取该列的最小值与最大值,得到该列的取值范围。
        minJ = min(dataSet[:, j])
        # print(minJ)
        maxJ = max(dataSet[:,j])
        # print(maxJ)
        rangeJ = float(maxJ - minJ)
        # 然后在该范围内随机生成坐标值
        # 可使用numpy中的random.rand()函数,其生成的
        # 随机数范围为[0,1)
        centroids[:,j] = minJ + rangeJ * random.rand(k,1)
    return centroids

4. Perform K-means clustering

def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    # 获取数据总量
    m = shape(dataSet)[0]
    # print(m)
    # 使用一个矩阵辅助记录,第一列保存所属质心下标,
    # 第二列保存到该质心的距离的平方
    clusterAssment = mat(zeros((m, 2)))
    # 调用上一步的函数随机生成k个质心保存为centroids
    centroids = randCent(dataSet, k)
    # print(centroids)
    # 使用一个标记记录质心是否发生变化
    # 若没变化则说明算法已经收敛
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        for i in range(m):
            # 设置标记记录数据点到所有质心的最小距离及质心下标
            minDist = inf;
            minIndex = -1
            for j in range(k):
                # 调用函数计算数据点到质心的距离
                # 保存到变量distJI
                # 并更新相关标记
                distJI = distEclud(dataSet[i,:],centroids[j,:])
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j

            # 比较clusterAssment中这一行的第一列记录的下标是否等
            # 于前面更新的质心下标,若相等则说明质心已收敛,否则
            # 还没收敛,据此设置相关标记
            # 然后记录更新clusterAssment中的数据
            if clusterAssment[i,0] != minIndex:
                clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        # 打印质心
        # python3使用print()函数
        print(centroids)
        # 根据新的聚类结果重新计算质心
        # 因为nonzero()函数从参数需为列表,故使用.A进行转换
        for cent in range(k):
            ptsInClust = \
                dataSet[nonzero(clusterAssment[:, 0].A == cent)[0]]
            centroids[cent, :] = mean(ptsInClust, axis=0)
    return centroids, clusterAssment

5. Results visualization

def show(dataSet, k, centroids, clusterAssment):
     numSamples, dim = dataSet.shape
     mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
     for i in range(numSamples):
         markIndex = int(clusterAssment[i, 0])
         plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
     mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
     for i in range(k):
         plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)
     plt.show()


(Note: The reference of the visualization code comes from: https://www.cnblogs.com/MrLJC/p/4127553.html )

 

 

 

The 23333 binary k-means algorithm is to be continued. . .

 

 

 

 

Published 4 original articles · Like1 · Visitors 10,000+

Guess you like

Origin blog.csdn.net/weixin_40462761/article/details/80136755