Python实现标准的Kmeans算法

Python实现标准的Kmeans算法

Kmeans很常用,特别是针对无监督学习。Kmeans简单容易理解,但是功能还是很强大的。

1 Kmeans算法原理

基本原理,就是通过距离的大小去将数据分类。

可参考:http://www.csdn.net/article/2012-07-03/2807073-k-means

上面的博文写得很好,下面我写写自己的总结吧。

算法概述

1、随机在图中取K个种子,K是用户设定的

2、然后对图中的所有点求到这K个种子点(质心)的距离,距离哪个种子点最近的就属于哪个点群;距离可以使用欧式距离计算

3、然后,移动种子点到属于他的“点群”的中心。中心的计算,是取聚类中所有元素各自维度的算术平方数

4、若本次中心和上次中心不重合,或者误差还在增大,即没有收敛,就重复(2)(3)步骤。

如下图所示:

2 Python实现Kmeans

首先,创建一个简单的数据集(dataset.txt)。

1.658985    4.285136  
-3.453687   3.424321  
4.838138    -1.151539  
-5.379713   -3.362104  
0.972564    2.924086  
-3.567919   1.531611  
0.450614    -3.302219  
-3.487105   -1.724432  
2.668759    1.594842  
-3.156485   3.191137  
3.165506    -3.999838  
-2.786837   -3.099354  
4.208187    2.984927  
-2.123337   2.943366  
0.704199    -0.479481  
-0.392370   -3.963704  
2.831667    1.574018  
-0.790153   3.343144  
2.943496    -3.357075  
-3.195883   -2.283926  
2.336445    2.875106  
-1.786345   2.554248  
2.190101    -1.906020  
-3.403367   -2.778288  
1.778124    3.880832  
-1.688346   2.230267  
2.592976    -2.054368  
-4.007257   -3.207066  
2.257734    3.387564  
-2.679011   0.785119  
0.939512    -4.023563  
-3.674424   -2.261084  
2.046259    2.735279  
-3.189470   1.780269  
4.372646    -0.822248  
-2.579316   -3.497576  
1.889034    5.190400  
-0.798747   2.185588  
2.836520    -2.658556  
-3.837877   -3.253815  
2.096701    3.886007  
-2.709034   2.923887  
3.367037    -3.184789  
-2.121479   -4.232586  
2.329546    3.179764  
-3.284816   3.273099  
3.091414    -3.815232  
-3.762093   -2.432191  
3.542056    2.778832  
-1.736822   4.241041  
2.127073    -2.983680  
-4.323818   -3.938116  
3.792121    5.135768  
-4.786473   3.358547  
2.624081    -3.260715  
-4.009299   -2.978115  
2.493525    1.963710  
-2.513661   2.642162  
1.864375    -3.176309  
-3.171184   -3.572452  
2.894220    2.489128  
-2.562539   2.884438  
3.491078    -3.947487  
-2.565729   -2.012114  
3.332948    3.983102  
-1.616805   3.573188  
2.280615    -2.559444  
-2.651229   -3.103198  
2.321395    3.154987  
-1.685703   2.939697  
3.031012    -3.620252  
-4.599622   -2.185829  
4.196223    1.126677  
-2.133863   3.093686  
4.668892    -2.562705  
-2.793241   -2.149706  
2.884105    3.043438  
-2.967647   2.848696  
4.479332    -1.764772  
-4.905566   -2.911070 

Keans相关方法的实现(keans2.py):

# -*- coding:utf-8 -*-
# kmeans : k-means cluster

from numpy import *
import time
import matplotlib.pyplot as plt

# 计算欧式距离
def euclDistance(vector1,vector2):
    return sqrt(sum(pow(vector2-vector1,2)))  # pow()是自带函数

# 使用随机样例初始化质心
def initCentroids(dataSet,k):
    # k是指用户设定的k个种子点
    # dataSet - 此处为mat对象
    numSamples,dim = dataSet.shape
    # numSample - 行,此处代表数据集数量  dim - 列,此处代表维度,例如只有xy轴的,dim=2
    centroids = zeros((k, dim))  # 产生k行,dim列零矩阵
    for i in range(k):
        index = int(random.uniform(0, numSamples))  # 给出一个服从均匀分布的在0~numSamples之间的整数
        centroids[i, :] = dataSet[index, :]  # 第index行作为种子点(质心)
    return centroids

# k均值聚类
def kmeans(dataSet, k):
    numSamples = dataSet.shape[0]
    # frist column stores which cluster this sample belongs to,
    # second column stores the error between this sample and its centroid
    clusterAssment = mat(zeros((numSamples, 2)))
    clusterChanged = True

    ## step 1: init centroids
    centroids = initCentroids(dataSet, k)

    while clusterChanged:
        clusterChanged = False
        ## for each sample
        for i in xrange(numSamples):
            minDist = 100000.0  # 最小距离
            minIndex = 0  # 最小距离对应的点群
            ## for each centroid
            ## step2: find the centroid who is closest
            for j in range(k):
                distance = euclDistance(centroids[j, :], dataSet[i, :])  # 计算到数据的欧式距离
                if distance < minDist:  # 如果距离小于当前最小距离
                    minDist = distance  # 则最小距离更新
                    minIndex = j  # 对应的点群也会更新

            ## step 3: update its cluster
            if clusterAssment[i, 0] != minIndex:  # 如当前数据不属于该点群
                clusterChanged = True  # 聚类操作需要继续
                clusterAssment[i, :] = minIndex, minDist**2

        ## step 4: update centroids
        for j in range(k):
            pointsInCluster = dataSet[nonzero(clusterAssment[:,0].A == j)[0]]  # 取列
            # nonzeros返回的是矩阵中非零的元素的[行号]和[列号]
            # .A是将mat对象转为array
            # 将所有等于当前点群j的,赋给pointsInCluster,之后计算该点群新的中心
            centroids[j, :] = mean(pointsInCluster, axis=0)  #  最后结果为两列,每一列为对应维的算术平方值

    print "Congratulations, cluster complete!"
    return centroids, clusterAssment

# show your cluster only available with 2-D data
def showCluster(dataSet, k, centroids, clusterAssment):
    numSamples, dim = dataSet.shape  # numSample - 样例数量  dim - 数据的维度
    if dim != 2:
        print "Sorry! I can not draw because the dimension os your data is not 2!"
        return 1

    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
    if k > len(mark):
        print "Sorry! Your k is too large! Please contact Zouxy"
        return 1

    # draw all samples
    for i in xrange(numSamples):
        markIndex = int(clusterAssment[i, 0])
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])

    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']

    # draw the centroids
    for i in range(k):
        plt.plot(centroids[i, 0], centroids[i, 1], mark[i], ms=12.0)
    plt.show()

主程序(test_kmeans.py):

# -*- coding:utf-8 -*-
from numpy import *
from kmeans2 import *
import time
import matplotlib.pyplot as plt

# step 1 : load data
print "step 1 : laod data"
dataSet = []
fileIn = open("F:\\py2projects\\dataset.txt")
for line in fileIn.readlines():
    lineArr = line.strip().split()
    #print lineArr[0]
    dataSet.append([float(lineArr[0]), float(lineArr[1])])
#print mat(dataSet)

## step 2: clustering
print "step 2: clustering ..."
dataSet = array(dataSet)

k = 4
centroids, clusterAssment = kmeans(dataSet, k)

## step 3 : show the result
print "step 3: show the result ..."
showCluster(dataSet, k, centroids, clusterAssment)

3 实现结果



图1



图2



图3

实验结果显示,Kmeans的其中一个缺陷。Kmeans需要用随机种子进行初始化,因此这个随机种子很重要,不同的随机种子点会有得到完全不同的结果,(如上面的图1和图2)

猜你喜欢

转载自blog.csdn.net/louishao/article/details/76619159
今日推荐