Simple and crude understanding and implementation of machine learning clustering algorithm (B): clustering algorithm api initial use case

Clustering Algorithm

learning target

  • Master clustering algorithm implementation process
  • We know K-means algorithm theory
  • We know evaluation model clustering algorithm
  • The advantages and disadvantages of K-means
  • Understand way clustering algorithm optimization
  • Application Kmeans achieve clustering task
    Here Insert Picture Description

6.2 clustering algorithm api initial use

1 api Introduction

  • sklearn.cluster.KMeans(n_clusters=8)
    • parameter:
      • n_clusters: the number of cluster centers began
        • Integer, default = 8, the resulting number of clusters, i.e., the center of mass produced (centroids of) number.
    • method:
      • estimator.fit(x)
      • estimator.predict(x)
      • estimator.fit_predict(x)
        • Computing cluster centers and the prediction of each sample belongs to which category, which is equivalent to calling fit (x), and then call predict (x)

2 Case

Creating different random two-dimensional data set as a training set, combined with its k-means clustering algorithm, you can try different numbers of clusters are clusters, and observe the clustering effect:

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-LHO7B8ml-1583251021755) (../ images / cluser_demo1.png)]

N_cluster different clustering parameters by value, different clustering results obtained

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-osr2fTeI-1583251021756)(../images/cluster_demo2.png)]

2.1 Process Analysis

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JhlOdG5V-1583251021756)(../images/cluster_demo3.png)]

2.2 code implementation

1. Create a data set

import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabaz_score

# 创建数据集
# X为样本特征,Y为样本簇类别, 共1000个样本,每个样本4个特征,共4个簇,
# 簇中心在[-1,-1], [0,0],[1,1], [2,2], 簇方差分别为[0.4, 0.2, 0.2, 0.2]
X, y = make_blobs(n_samples=1000, n_features=2, centers=[[-1, -1], [0, 0], [1, 1], [2, 2]],
                  cluster_std=[0.4, 0.2, 0.2, 0.2],
                  random_state=9)

# 数据集可视化
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()

2. using k-means clustering, and the evaluation method using CH

y_pred = KMeans(n_clusters=2, random_state=9).fit_predict(X)
# 分别尝试n_cluses=2\3\4,然后查看聚类效果
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

# 用Calinski-Harabasz Index评估的聚类分数
print(calinski_harabaz_score(X, y_pred))
Published 627 original articles · won praise 839 · views 110 000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104644991