Sklearn K-means clustering

## copyright, reprint indicate the source

chapter

So far, we have been very thorough understanding of data collection, and put it into a training subset and the subset of tests.

Next, we will use the clustering method to train a model, and then use the model to predict the label subset of tests, and finally evaluate the performance of the model.

Clustering (Clustering) in a set of data is unlabeled, a similar data (point) method to return the same category. The biggest difference is that the clustering and classification of target classification is known in advance, but clustering not know. K-means clustering method is used in the cluster, which is calculated based on the distance the dots best home category, i.e. by a set of points (data) is relatively close classified as a class, are each cluster there is a central point.

We first create a clustering model, the training subset clustering process to obtain a cluster center. Prediction model is then used to test the subset of labels, the prediction from the center point to be classified according to the subset of test points (data).

Create a model

Examples

Create a cluster model.

import numpy as np
from sklearn import datasets

# 加载 `digits` 数据集
digits = datasets.load_digits()

from sklearn.preprocessing import scale

# 对`digits.data`数据进行标准化处理
data = scale(digits.data)

# print(data)

# 导入 `train_test_split`
from sklearn.model_selection import train_test_split

# 数据分成训练集和测试集
# `test_size`：如果是浮点数，在0-1之间，表示测试子集占比；如果是整数的话就是测试子集的样本数量，`random_state`：是随机数的种子
X_train, X_test, y_train, y_test, images_train, images_test = train_test_split(data, digits.target, digits.images, test_size=0.33, random_state=42)

# 导入“cluster”模块
from sklearn import cluster

# 创建KMeans模型
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)

# 将训练数据' X_train '拟合到模型中，此处没有用到标签数据y_train，K均值聚类一种无监督学习。
clf.fit(X_train)

cluster.KMeansParameter description:

init='k-means++' - Specify the initialization method
n_clusters=10 - the number of clusters, into 10 categories
random_state=42 - random seed value

We create a model using the K-means clustering method, has been the center point of each cluster, the test can be the center point of the distance to be classified according to the test subset of points (data).

The following methods may be used to display the cluster center point of the image:


# 导入 matplotlib
import matplotlib.pyplot as plt

# 图形尺寸(英寸)
fig = plt.figure(figsize=(8, 3))

# 添加标题
fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

# 对于所有标签(0-9)
for i in range(10):
    # 在一个2X5的网格中，在第i+1个位置初始化子图
    ax = fig.add_subplot(2, 5, 1 + i)
    # 显示图像
    ax.imshow(clf.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)
    # 不要显示坐标轴
    plt.axis('off')

# 显示图形
plt.show()

display

Map

Test Model

The next test to predict the label subsets:

# 预测“X_test”的标签
y_pred=clf.predict(X_test)

# 打印出' y_pred '的前100个实例
print(y_pred[:100])

# 打印出' y_test '的前100个实例
print(y_test[:100])

Export

[0 3 3 6 8 9 8 9 8 8 4 2 7 1 2 4 3 7 3 8 2 8 3 7 4 0 3 8 0 3 2 3 9 2 2 0 3
 2 7 0 0 3 4 3 0 4 3 1 0 3 7 4 3 8 0 1 3 1 1 2 1 2 3 8 2 3 7 1 7 3 3 3 3 7
 7 1 2 8 3 3 3 1 8 3 3 1 0 2 2 3 4 9 4 3 3 9 3 2 2 7]
[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9
 4 7 6 6 9 1 3 6 1 3 0 6 5 5 1 9 5 6 0 9 0 0 1 0 4 5 2 4 5 7 0 7 5 9 5 5 4
 7 0 4 5 5 9 9 0 2 3 8 0 6 4 4 9 1 2 8 3 5 2 9 0 4 4]

In the above code block, the prediction tag test set, the result is stored in y_pred. Then prints a front 100 and y_test Examples of y_pred. As can be seen the model prediction accuracy rate is not high.

Evaluation Model

Next, we will further evaluate the performance of the model to analyze the correctness of the model predictions.

Let's print a confusion matrix:

# 从“sklearn”导入“metrics”
from sklearn import metrics

# 用“confusion_matrix()”打印出混淆矩阵
print(metrics.confusion_matrix(y_test, y_pred))

Export

[[ 0 54  1  0  0  0  0  0  0  0]
 [ 0  0 15  0 29  0  0  0  0 11]
 [ 1  0  2  0  7  0  0  0 27 15]
 [ 0  0  0 49  1  0  0  4  1  1]
 [ 0  0 57  0  0  0  3  4  0  0]
 [ 1  0  2 34  6  0  0  5 25  0]
 [56  1  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  5 55  2  0]
 [ 0  0  0 18 28  0  0  2  4  0]
 [ 1  0  5 55  2  0  1  4  0  0]]

Confusion matrix

Confusion matrix is also called error matrix is a standard format accuracy of the evaluation, in the form of a matrix with n rows and n columns is expressed, each prediction value for column, each row represents the actual category. Values on the diagonal confusion matrix represents the number of predicted values match, other positions represents the number of mismatches. For example the second column is the first row 54, showing the actual value of the classification is 0, but the predicted value is a prediction error class 1 occurs 54 times.

As can be seen the model prediction accuracy rate is not high: 3 forecast figures 49 times, the number 7 prediction of 55 times, the other is very low.

Let us continue to print some common assessment indicators:

from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, adjusted_rand_score, adjusted_mutual_info_score, silhouette_score
print('% 9s' % 'inertia    homo   compl  v-meas     ARI AMI  silhouette')
print('%i   %.3f   %.3f   %.3f   %.3f   %.3f    %.3f'
          %(clf.inertia_,
      homogeneity_score(y_test, y_pred),
      completeness_score(y_test, y_pred),
      v_measure_score(y_test, y_pred),
      adjusted_rand_score(y_test, y_pred),
      adjusted_mutual_info_score(y_test, y_pred),
      silhouette_score(X_test, y_pred, metric='euclidean')))

Output:

inertia    homo   compl  v-meas     ARI AMI  silhouette
48486   0.584   0.662   0.621   0.449   0.572    0.131

homogeneity_score homogeneity index, each cluster contains only members of a single class.
completeness_score integrity indicators, to all members of a given class are assigned to the same cluster.
v_measure_score harmonic mean homogeneity index and the integrity of the index.
adjusted_rand_score adjusted Rand index
adjusted_mutual_info_score adjust mutual information
silhouette_score profile coefficient

For details on these indicators, limited space, not repeat them, you can refer to the relevant information.