Data Mining Combat (11) K-Means Algorithm

1. K-Means Algorithm

  1. Algorithm Introduction

K-Means algorithm is an unsupervised clustering algorithm. What is unsupervised? That is, for the data in the training set, during the training process, the training algorithm is not told which category a certain data belongs to. The k-means clustering algorithm iteratively finds the cluster centroid points that best represent the data. The algorithm starts with a few data points randomly picked from the training data as centroid points. The k in k-means indicates how many centroid points to look for, and it is also the number of clusters that the algorithm will find.

  1. Algorithm process

(1) Select the initial centroid. First of all, if we have the following data set, we randomly select k points.

(2) Calculate the distance from other points in the data set sample to the centroid, and then select the category of the nearest centroid as its own category.

(3) Recalculate the centroid.

(4) Re-division. Re-divide the sample according to the calculation method in the second step.

(5) Repeat steps 3 and 4 until a certain threshold is reached. This threshold can be the number of iterations, or it can stop iteration when the center of mass does not change or the magnitude of the change of the center of mass is less than a certain value.

2. K-Means++ algorithm

The difference between the K-Means++ algorithm and the traditional K-means algorithm above is that it uses a certain method to make the first step in the algorithm (initializing the centroid) more reasonable, rather than randomly selecting the centroid.

3. Use of sklearn-based K-Means algorithm

  1. create dataset

First, use make_blobs in sklearn to generate isotropic Gaussian clusters. Then plot the data:

import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
# 数据集的个数
data_num = 1000
# k值,同时也是生成数据集的中心点
k_num  = 4

# 生成聚类数据,默认n_features为二,代表二维数据,centers表示生成数据集的中心点个数,random_state随机种子
data,y=ds.make_blobs(n_samples=data_num,centers=k_num,random_state=0)

data is a two-dimensional coordinate, and y is a data label, ranging from 0 to 3. The drawing code is as follows:

# 不同的数据簇用不同的颜色表示
data_colors = matplotlib.colors.ListedColormap(['red','blue','yellow','Cyan'])
# data为二维数据
plt.scatter(data[:,0],data[:,1],c=y,cmap=data_colors)
plt.title("orginal data")
plt.grid()
plt.show()

The drawing looks like this:

2. Using K-Means

Use KMeans under the cluster package to use the k-means algorithm. For specific parameters, see the official document.

'''
sklearn.cluster.KMeans(
    n_clusters=8, 
    init='k-means++', 
    n_init=10, 
    max_iter=300, 
    tol=0.0001, 
    precompute_distances='auto', 
    verbose=0, 
    random_state=None, 
    copy_x=True, 
    n_jobs=1, 
    algorithm='auto' )
参数说明:
(1)n_clusters:簇的个数,也就是k值
(2)init: 初始簇中心的方式,可以为k-means++,也可以为random
(3)n_init: k-means算法在不同随机质心情况下迭代的次数,最后的结果会输出最好的结果
(4)max_iter: k-means算法最大的迭代次数
(5)tol: 关于收敛的相对公差
(8)random_state: 初始化质心的随机种子
(10)n_jobs: 并行工作,-1代表使用所有的处理器
'''
from sklearn.cluster import KMeans
model=KMeans(n_clusters=k_num,init='k-means++')
# 训练
model.fit(data)
# 分类预测
y_pre= model.predict(data)

3. Forecast

plt.scatter(data[:,0],data[:,1],c=y_pre,cmap=data_colors)
plt.title("k-means' result")
plt.grid()
plt.show()

The results after classification are as follows:

Guess you like

Origin blog.csdn.net/bb8886/article/details/129731355