I. Overview
- Cluster analysis purposes
- Dividing the data points or samples with large data sets "similarity" is a category feature
- Common scenarios
- Exploratory analysis done in the absence of prior experience of doing background
- Data preprocessing in the case of the large sample size
- The feature value classes into several categories
- Cluster analysis problem can be resolved include
- Data set may be divided into several categories
- How much sample size in each category have
- How the strength of the relationship between the variables of different categories
- What is the typical characteristics of different categories
- k-means clustering algorithm KMeans
- Precautions
- To deal with outliers
- If modeling features, the dimensionless gap is relatively large, need to do the normalization / standardization
- Creating KMeans object modeling
- n_cluster number of clusters
- init = 'k-means ++' selected point in time, to find a point relatively far from the initial point
- random_state random number seed
- The squared error cluster kmeans.inertia_
- Profile coefficient metrics.silhouette_score ()
- kmeans_model.cluster_centers_ cluster center
- Cluster label after kmeans_model.labels_
Two cases
1 Data Preparation
Import PANDAS AS PD DF = pd.read_csv ( ' the data.csv ' ) # use the last two as a basis for grouping x = df.iloc [:, 3: ]. values
2 Create KMeans model, the core code clustering []
# Leader packet from sklearn.cluster Import KMeans # model creating kmeans_model = KMeans (n_clusters =. 5, the init = ' K-means ++ ' , random_state =. 11 ) # clustering process y_kmeans = kmeans_model.fit_predict (x)
At this point the data has been divided into five categories, the tag data is added
3 visualize clustering result
# 导入可视化工具包 import matplotlib.pyplot as plt %matplotlib inline # 颜色和标签列表 colors_list = ['red', 'blue', 'green','yellow','pink'] labels_list = ['Traditional','Normal','TA','Standard','Youth'] # 需要将DataFrame转成ndarray,才能进行 x[y_kmeans==i,0] x = x.values for i in range(5): plt.scatter(x[y_kmeans==i,0], x[y_kmeans== i,1], s=100,c=colors_list[i],label=labels_list[i]) # 聚类中心点 plt.scatter(kmeans_model.cluster_centers_[:,0],kmeans_model.cluster_centers_[:,1], s=300,c='black',label='Centroids') plt.legend() plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.show()
4 评估聚类个数
# 用于盛放簇内误差平方和的列表 distortion = [] for i in range(1,11): kmeans = KMeans(n_clusters=i,init='k-means++', random_state=11) kmeans.fit(x) distortion.append(kmeans.inertia_) plt.plot(range(1,11), distortion) plt.title('The Elbow Method') plt.xlabel('Number of cluster') plt.ylabel('Distortion') plt.show()
完成辣!
附几个变量说明,便于复习
========================
========================
本文仅用于学习