Cluster analysis | overview, k-means clustering algorithm for data processing and visualization KMeans

I. Overview

  • Cluster analysis purposes
    • Dividing the data points or samples with large data sets "similarity" is a category feature
  • Common scenarios
    • Exploratory analysis done in the absence of prior experience of doing background
    • Data preprocessing in the case of the large sample size
    • The feature value classes into several categories
  • Cluster analysis problem can be resolved include
    • Data set may be divided into several categories
    • How much sample size in each category have
    • How the strength of the relationship between the variables of different categories
    • What is the typical characteristics of different categories
  • k-means clustering algorithm KMeans
    • Precautions
      • To deal with outliers
      • If modeling features, the dimensionless gap is relatively large, need to do the normalization / standardization
    • Creating KMeans object modeling
      • n_cluster number of clusters
      • init = 'k-means ++' selected point in time, to find a point relatively far from the initial point
      • random_state random number seed
      • The squared error cluster kmeans.inertia_
      • Profile coefficient metrics.silhouette_score ()
      • kmeans_model.cluster_centers_ cluster center
      • Cluster label after kmeans_model.labels_

Two cases

1 Data Preparation

Import PANDAS AS PD 
DF = pd.read_csv ( ' the data.csv ' ) 

# use the last two as a basis for grouping 
x = df.iloc [:, 3: ]. values

 

 2 Create KMeans model, the core code clustering []

# Leader packet 
from sklearn.cluster Import KMeans 

# model creating 
kmeans_model = KMeans (n_clusters =. 5, the init = ' K-means ++ ' , random_state =. 11 ) 

# clustering process 
y_kmeans = kmeans_model.fit_predict (x)

 At this point the data has been divided into five categories, the tag data is added

 

 3 visualize clustering result

# 导入可视化工具包
import matplotlib.pyplot as plt
%matplotlib inline

# 颜色和标签列表
colors_list = ['red', 'blue', 'green','yellow','pink']
labels_list = ['Traditional','Normal','TA','Standard','Youth']

# 需要将DataFrame转成ndarray,才能进行 x[y_kmeans==i,0]
x = x.values

for i in range(5):
    plt.scatter(x[y_kmeans==i,0], x[y_kmeans== i,1], s=100,c=colors_list[i],label=labels_list[i])

# 聚类中心点
plt.scatter(kmeans_model.cluster_centers_[:,0],kmeans_model.cluster_centers_[:,1], s=300,c='black',label='Centroids')

plt.legend()
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.show()

 

4 评估聚类个数

# 用于盛放簇内误差平方和的列表
distortion = []

for i in range(1,11):
    kmeans = KMeans(n_clusters=i,init='k-means++', random_state=11)
    kmeans.fit(x)
    distortion.append(kmeans.inertia_)
    
plt.plot(range(1,11), distortion)

plt.title('The Elbow Method')
plt.xlabel('Number of cluster')
plt.ylabel('Distortion')
plt.show()

 

完成辣!

附几个变量说明,便于复习

 

========================

 

 

========================

 

 本文仅用于学习

Guess you like

Origin www.cnblogs.com/ykit/p/12383257.html