Machine Learning K- means clustering algorithm (K-Means)

1 Introduction

Clustering algorithms: is a typical unsupervised learning algorithm, mainly used will be similar to the sample automatically go into a category.

2. ALGORITHM

  1. To determine the number of divided classes are constants K, K sample randomly selected as the initial cluster center
  2. Distance K calculated for each sample to the center of the clusters, the sample attributable to the nearest cluster, formation of the K clusters
  3. Calculating the average of the K samples as the new cluster center cluster
  4. 2,3 cycle steps
  5. The center position of the cluster (or a specified number of cycles), clustering is completed

3. The sample data generator

Sklearn use of the sample data generated make_blobs:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

K=4
n_samples=100 # 100个样本
X, y = make_blobs(n_samples=n_samples, # 100个样本
                 n_features=2, # 每个样本2个特征
                 centers=K # K个中心
                 )
plt.scatter(
    X[:,0], #第一个特征值当x坐标
    X[:,1], #第二个特征值当y坐标
    c=y # 数据类别标签当做颜色,相同标签的颜色也相同
)
plt.show()

Here Insert Picture Description

4. The native code implemented

import numpy as np

# 第1步,选择簇中心,这里取X前K个样本
cluster_centers=np.copy(X[0:K])
print('cluster_centers=',cluster_centers)
print('cluster_centers_type=',y[0:K])

# 初始化数组,用于保存X中每个样本的类型,(样本数量,-1填充样本类型)
X_type = np.full(n_samples, -1)

# 记录循环此处
count=1
while count:
    # 第2步,计算每个样本到簇中心的距离,选择最近的归类
    for n in range(0,n_samples):
        Min_L=-1
        for i in range(0,K):
            #样本到簇中心的欧式距离
            L=((X[n][0]-cluster_centers[i][0])**2+(X[n][1]-cluster_centers[i][1])**2)**0.5
            if(Min_L==-1):#最小距离为-1,则直接更新类型,记录距离
                X_type[n]=i
                Min_L=L
            elif(Min_L>L):#出现更小的距离则更新类型,记录距离
                X_type[n]=i
                Min_L=L
    
    # 第3步,计算K个簇样本的平均值作为新的簇中心
    #分别累加每个簇的点坐标,数量
    sum = np.zeros([K,3])
    for n in range(0,n_samples):
        sum[X_type[n]][0:2]+=X[n]
        sum[X_type[n]][2]+=1
    
    # 记录上一次簇中心
    last_cluster_centers=np.copy(cluster_centers)

    #计算平均值作为新的簇中心
    for n in range(0,K):
        cluster_centers[n]=sum[n][0:2]/sum[n][2]
    
    # 如果簇中心没变则退出循环
    if(last_cluster_centers==cluster_centers).all():
        print('count:',count)
        break
    count+=1

plt.scatter(
    X[:,0], #第一个特征值当x坐标
    X[:,1], #第二个特征值当y坐标
    c=X_type # 数据类别标签当做颜色,相同标签的颜色也相同
)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],s=200,marker='x',c='red')
plt.show()
cluster_centers= [[ -9.55338872  -1.07204676]
 [ -7.72118512   4.60945541]
 [ -8.87714205  -4.04370297]
 [-10.20593433  -2.53019515]]
cluster_centers_type= [3 1 3 3]
count: 5

Here Insert Picture Description

5. sklearn code implementation

from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=K).fit(X)

print('cluster.cluster_centers_=',cluster.cluster_centers_) # 每个簇中心的坐标

print('cluster.inertia_=',cluster.inertia_) #每个样本到其中心的距离累加

plt.scatter(
    X[:,0], #第一个特征值当x坐标
    X[:,1], #第二个特征值当y坐标
    c=y # 数据类别标签当做颜色,相同标签的颜色也相同
)
plt.scatter(cluster.cluster_centers_[:,0],cluster.cluster_centers_[:,1],s=200,marker='x',c='red')
plt.show()
cluster.cluster_centers_= [[-8.41314315 -1.85636774]
 [-0.95072532  7.87723591]
 [ 0.37558634 -6.38892274]
 [-7.98962683  3.91523476]]
cluster.inertia_= 217.84359295947405

Here Insert Picture Description

6. distance formula

Euclidean distance (Euclidean distance, coordinate system set distance):
d ( x , in ) = i = 1 n ( x i μ i ) 2 d(x,u)=\sqrt{\sum_{i=1}^n(x_i-\mu_i)^2}

Manhattan distance (absolute distance):
d ( x , u ) = i = 1 n ( x i μ ) d(x,u)=\sum_{i=1}^n(|x_i-\mu|)

Cosine distance:
c o s θ = i = 1 n ( x i μ ) i n ( x i ) 2 1 n ( μ ) 2 cos\theta=\frac{\sum_{i=1}^n(x_i*\mu)}{\sqrt{\sum_i^n(x_i)^2}*\sqrt{\sum_1^n(\mu)^2}}

Reference:
"sklearn KMeans clustering algorithm (summary)"
"With sklearn.cluster achieve k-means clustering"
"scatter plot of detailed parameters] [matplotlib scatter ()"

Published 154 original articles · won praise 349 · views 710 000 +

Guess you like

Origin blog.csdn.net/Leytton/article/details/103892005