100 days to get the machine learning | day44 k-means clustering mathematical derivation and realization python

[How to use "K-means clustering" correct?

1, k-means clustering model for
a given sample file, each sample feature vector is m, the target model is assigned to the k n samples kept class or cluster, each sample belongs to the class of the center The minimum distance, for each sample belongs to only one category. Denotes division written in C, he was a many-function, k-means clustering is a sample from the function to the class.
2, k-means clustering strategy
k-means clustering strategy is by minimizing the loss function or functions to select the optimal partitioning file.
First, calculate the distance between samples, where the squared Euclidean distance is selected.
file
Then the sum of the defined distance between the center of the sample class belong to loss of function of
file
which fileis the mean or center of the l th class
file
, indicating function, or a value of 0. The
K-means clustering is to solve the optimization problem:
file
3, k-means clustering algorithm
k-means clustering algorithm is an iterative process,
first:
for a given value of the center file, seeking division C, is the minimum of the objective function:
file
i.e., a case where the cluster centers is determined, the samples assigned to a class , the minimum sum of the distance between the sample and the center belongs.
Then:
For a given division C, and then seek the center of each class, the objective function is minimization.

That is, the case where the division determination of C, the minimum sum of the distance between the sample and the center belongs. Solving a result, for each class contains Gi nl samples, the mean update ml:
file
Repeat the above two steps, known differentiation not change.

from myUtil import *

def kMeans(dataSet, k):
    m = shape(dataSet)[0]  # 返回矩阵的行数

    # 本算法核心数据结构:行数与数据集相同
    # 列1:数据集对应的聚类中心,列2:数据集行向量到聚类中心的距离
    ClustDist = mat(zeros((m, 2)))

    # 随机生成一个数据集的聚类中心:本例为4*2的矩阵
    # 确保该聚类中心位于min(dataSet[:,j]),max(dataSet[:,j])之间
    clustercents = randCenters(dataSet, k)  # 随机生成聚类中心

    flag = True  # 初始化标志位,迭代开始
    counter = []  # 计数器

    # 循环迭代直至终止条件为False
    # 算法停止的条件:dataSet的所有向量都能找到某个聚类中心,到此中心的距离均小于其他k-1个中心的距离
    while flag:
        flag = False  # 预置标志位为False

        # ---- 1. 构建ClustDist:遍历DataSet数据集,计算DataSet每行与聚类的最小欧式距离 ----#
        # 将此结果赋值ClustDist=[minIndex,minDist]
        for i in xrange(m):

            # 遍历k个聚类中心,获取最短距离
            distlist = [distEclud(clustercents[j, :], dataSet[i, :]) for j in range(k)]
            minDist = min(distlist)
            minIndex = distlist.index(minDist)

            if ClustDist[i, 0] != minIndex:  # 找到了一个新聚类中心
                flag = True  # 重置标志位为True,继续迭代

            # 将minIndex和minDist**2赋予ClustDist第i行
            # 含义是数据集i行对应的聚类中心为minIndex,最短距离为minDist
            ClustDist[i, :] = minIndex, minDist

        # ---- 2.如果执行到此处,说明还有需要更新clustercents值: 循环变量为cent(0~k-1)----#
        # 1.用聚类中心cent切分为ClustDist,返回dataSet的行索引
        # 并以此从dataSet中提取对应的行向量构成新的ptsInClust
        # 计算分隔后ptsInClust各列的均值,以此更新聚类中心clustercents的各项值
        for cent in xrange(k):
            # 从ClustDist的第一列中筛选出等于cent值的行下标
            dInx = nonzero(ClustDist[:, 0].A == cent)[0]
            # 从dataSet中提取行下标==dInx构成一个新数据集
            ptsInClust = dataSet[dInx]
            # 计算ptsInClust各列的均值: mean(ptsInClust, axis=0):axis=0 按列计算
            clustercents[cent, :] = mean(ptsInClust, axis=0)
    return clustercents, ClustDist

Reference:
https://jakevdp.github.io/PythonDataScienceHandbook
https://www.cnblogs.com/eczhou/p/7860424.html
statistical learning methods 14.3
file

Guess you like

Origin www.cnblogs.com/jpld/p/11409800.html
Recommended