KMeans clustering implementation in k-means+python︱scikit-learn (+ MiniBatchKMeans)

I have been using R before, and now I will try to implement Kmeans in Python after learning python. 
The blog that used R to implement kmeans before: Notes︱Various common clustering models and clustering quality assessment (clustering precautions, usage skills)

Cluster analysis is extremely important in customer segmentation. There are three common clustering models, K-mean clustering, hierarchical (system) clustering, and maximum expectation EM algorithm. In the process of establishing a clustering model, a key issue is how to evaluate the clustering results, and some indicators will be used to evaluate. 
.


1. Introduction to Kmeans in scikit-learn

scikit-learn is a Python-based Machine Learning module, which provides many 
algorithm implementations related to Machine Learning, including the K-Means algorithm.

Official website scikit-learn case address: http://scikit-learn.org/stable/modules/clustering.html#k-means 
part from: scikit-learn source code interpretation of Kmeans - simple algorithm complex 
write picture description here

Performance comparison of each cluster: 
write picture description here

优点:

原理简单
速度快
对大数据集有比较好的伸缩性

缺点:

需要指定聚类 数量K
对异常值敏感
对初始值敏感
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

1. Relevant theories

Reference: K-means algorithm and text clustering practice

  • (1) Selection of the center point

The k-meams algorithm can guarantee convergence, but it cannot guarantee convergence to the global optimal point. When the initial center point is not selected well, only the local optimal point can be achieved, and the effect of the whole clustering will be relatively poor. The following methods can be used: k-means center point

Select those points that are as far away as possible from each other as the center point; 
first use the hierarchy to perform preliminary clustering to output k clusters, and use the center point of the cluster as the input of the center point of k-means. 
Randomly select the center point for many times to train k-means, and select the clustering result with the best effect

  • (2) Selection of k value

The error function of k-means has a big flaw, that is, as the number of clusters increases, the error function tends to 0. In the most extreme case, each record is a separate cluster. At this time, the error of the data record is 0, but this clustering result is not what we want, and structural risk can be introduced to penalize the complexity of the model:

write picture description here

λλ is a parameter that balances the training error and the number of clusters, but now the problem becomes how to choose λλ. Some studies [Reference 1] pointed out that when the data set satisfies the Gaussian distribution, λ=2mλ=2m, where m is the dimension of the vector.

Another method is to try different k values ​​in increasing order, and draw their corresponding error values ​​at the same time, and find a better k value by seeking the inflection point. For details, see the text clustering example below.

2. Main function KMeans

Reference blog: python's sklearn study notes 
to see the main function KMeans:

sklearn.cluster.KMeans(n_clusters=8,
     init='k-means++', 
    n_init=10, 
    max_iter=300, 
    tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto' )
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

The meaning of the parameters:

  • n_clusters: The number of clusters, that is, how many categories do you want to cluster into?
  • init: how to get the initial cluster center
  • n_init: Get the number of iterations of the initial cluster center. In order to make up for the influence of the initial centroid, the algorithm will initialize the centroid 10 times by default, implement the algorithm, and then return the best result.
  • max_iter: the maximum number of iterations (because the implementation of the kmeans algorithm requires iteration)
  • tol: tolerance, that is, the condition for the convergence of kmeans operating criteria
  • precompute_distances: Whether it is necessary to calculate the distance in advance, this parameter will make a trade-off between space and time, if it is True, the entire distance matrix will be put into the memory, auto will default when the number of data samples is greater than featurs*samples is greater than 12e6 When False, False, the core implementation method is implemented using Cpython
  • verbose: verbose mode (do not understand what it means, anyway, generally do not change the default value)
  • random_state: The state condition for randomly generated cluster centers.
  • copy_x: A flag for whether to modify the data, if True, the data will not be modified after copying. bool has this parameter in many interfaces of scikit-learn, that is, whether to continue the copy operation on the input data, so as not to modify the user's input data. This will be clearer if you understand Python's memory mechanism.
  • n_jobs: Parallel settings
  • algorithm: The implementation algorithm of kmeans, including: 'auto', 'full', 'elkan', where 'full' means implemented in EM mode

Although there are many parameters, default values ​​have been given. So we generally don't need to pass in these parameters, the parameters. It can be called according to actual needs.

3. Simple case one

Reference blog: The sklearn learning  
notebook illustrates how some classes analyzed by KMeans are called and what they mean.

import numpy as np
from sklearn.cluster import KMeans
data = np.random.rand(100, 3) #生成一个随机数据,样本大小为100, 特征数为3 #假如我要构造一个聚类数为3的聚类器 estimator = KMeans(n_clusters=3)#构造聚类器 estimator.fit(data)#聚类 label_pred = estimator.labels_ #获取聚类标签 centroids = estimator.cluster_centers_ #获取聚类中心 inertia = estimator.inertia_ # 获取聚类准则的总和
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

estimator initializes Kmeans clustering; estimator.fit clustering content fitting; 
estimator.label_cluster label, this is one way, and the other is predict; estimator.cluster_centers_cluster center mean vector matrix 
estimator.inertia_representative the sum of the cluster center mean vectors

4. Case 2

The case comes from: KMeans text clustering using scikit-learn

from sklearn.cluster import KMeans

num_clusters = 3
km_cluster = KMeans(n_clusters=num_clusters, max_iter=300, n_init=40, \ init='k-means++',n_jobs=-1) #返回各自文本的所被分配到的类索引 result = km_cluster.fit_predict(tfidf_matrix) print "Predicting result: ", result
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

km_cluster is KMeans initialization, in which the initial value selection algorithm of init uses 'k-means++'; 
km_cluster.fit_predict is equivalent to the merger of two actions: km_cluster.fit(data)+km_cluster.predict(data), you can get the cluster at one time Labels after class prediction, eliminating the intermediate process.

  • n_clusters: Specify the value of K
  • max_iter: maximum number of iterations for a single initial value calculation
  • n_init: the number of times to reselect the initial value
  • init: formulate the algorithm for initial value selection
  • n_jobs: The number of processes, when it is -1, it means that the CPU is full by default
  • Note that this calculation of a single initial value will always only use a single-process calculation,
  • Parallel computation is just for computation with different initial values. For example, n_init=10, n_jobs=40,
  • There are 20 CPUs on the server that can open 40 processes, but only 10 processes will be opened in the end

in:

km_cluster.labels_
km_cluster.predict(data)
  • 1
  • 2

These are two ways to output the labels of the clustering results, and the results seem to be the same. Both need km_cluster.fit(data) before calling.

5. Case 4 - follow-up analysis of Kmeans

Some analysis after the Kmeans algorithm, reference source: Implementing document clustering in Python

from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)


clusters = km.labels_.tolist()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

It is divided into five categories, and %time is used to measure the running time, and the format of the classification label labels is changed to list.

  • (1) Model saving and loading
from sklearn.externals import joblib

# 注释语句用来存储你的模型
joblib.dump(km,  'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl') clusters = km.labels_.tolist()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • (2) Clustering category statistics
frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre']) frame['cluster'].value_counts()
  • 1
  • 2
  • (3) The centroid mean vector calculates the sum of squares within the group

Select a point closer to the centroid, where km.cluster_centers_ represents one (number of clusters * number of dimensions), which is the mean of different clusters and dimensions. 
This indicator can know: 
within a category, those points are closer to the centroid; 
the sum of squares within the entire category group.

The intra-group sum of squares within the category should refer to the following formula: 
write picture description here 
write picture description here 
It can be seen from the formula:  the
centroid mean vector value of each row - the mean of each row (equivalent to the mean of the mean) 
Note that it is a square. Among them, n represents the sample size, and k is the number of clusters (such as cluster 5). 
Among them, the sum of squares in the whole article can be obtained by:

km.inertia_

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325846875&siteId=291194637