Lack of k-means clustering algorithm processing abnormal label

k-means clustering algorithm in a crowd scene, is a very useful tool. (Principle of the algorithm can refer to the Python K-Means algorithm implementation )

Common invocation

The algorithm routine is called by:

# 从sklearn引包
from sklearn import cluster
# 初始化并设定聚类数
k_means = cluster.KMeans(n_clusters=9)
# 指定聚类特征
df_pct = stat_score['feature_1', 'feture_2', 'feature_3']
k_means.fit(df_input.fillna(0))

# 计算聚类标签
labels = k_means.labels_
# 获得聚类的质心
C = k_means.cluster_centers_

unusual phenomenon

Conventionally, the above processing, the index based on the column sequence of the dataframe label(number of 0 to 8).
However, we are in the process of implementation of the code, there has been raise ValueError('Length of values does not match length of ' 'index')an error, the error is due to the time to translate the label values emerged labelnumber less than eight cases. In other words, k-means algorithm is given less than eight cluster label.

Intermediate print information has been confirmed that only three cluster labels.

>>>>>>>>>>>>>>>>>>>>labels<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[4 4 4 4 4 2 4 4 4 0 4 4 4 0]
14
{0, 2, 4}

Abnormal

When the cluster drill on dimensions such as company-wide orders a store in the original data, the presence of a certain time the order is too small. The above is because only abnormal feature records 28 on a drill dimension, resulting in k-means to obtain only 3 labels, the last occurrence of a case where the number of rows does not match dataframe.

Treatment

Before clustering of drill dimension, this dimension is the number of feature statistics, if the number is less than a certain threshold is considered:

  1. Packet clustering using a feature tag and then combining the packets;
  2. Consider other clustering algorithms, such as Spectral clustering.

Guess you like

Origin www.cnblogs.com/shenfeng/p/kmean_label_lacking.html