K_Means clustering algorithm using sklearn

First attach the official website description 
[ http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#examples-using-sklearn-cluster-kmeans ]

Attach a translation document 
http://blog.csdn.net/xiaoyi_zhang/article/details/52269242

Another example found on Baidu (infringing deletion):

# -*- coding: utf-8 -*-
from sklearn.cluster import KMeans
from sklearn.externals import joblib import numpy final = open('c:/test/final.dat' , 'r') data = [line.strip().split('\t') for line in final] feature = [[float(x) for x in row[3:]] for row in data] #调用kmeans类 clf = KMeans(n_clusters=9) s = clf.fit(feature) print s #9个中心 print clf.cluster_centers_ #每个样本所属的簇 print clf.labels_ #用来评估簇的个数是否合适,距离越小说明簇分的越好,选取临界点的簇个数 print clf.inertia_ #进行预测 print clf.predict(feature) #保存模型 joblib.dump(clf , 'c:/km.pkl') #载入保存的模型 clf = joblib.load('c:/km.pkl') ''' #用来评估簇的个数是否合适,距离越小说明簇分的越好,选取临界点的簇个数 for i in range(5,30,1): clf = KMeans(n_clusters=i) s = clf.fit(feature) print i , clf.inertia_ '''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

Beginners are explained as follows: 
refer to http://www.cnblogs.com/meelo/p/4272677.html 
sklearn has a consistent interface for all machine learning algorithms. Generally, the following steps are required to learn: 
1. Initialize classification According to different algorithms, different parameters need to be given. Generally, all parameters have a default value. 
write picture description here 
(1) For K-means clustering, we need the number of given categories n_cluster, the default value is 8; 
(2) max_iter is the number of iterations, here the maximum number of iterations is set to 300; 
(3) n_init is set to 10 means Perform 10 random initializations, and select the one with the best effect as the model; 
(4) init='k-means++' will automatically find the appropriate n_clusters by the program; 
(5) tol: float, default value = 1e-4 , combined with innertia to determine the convergence conditions; 
(6) n_jobs: specify the number of processes used for the calculation; 
(7) verbose parameter sets the degree of printing the solution process, the larger the value, the more details are printed; 
(8) copy_x: Boolean , default=True. Centering the data gives more accurate results when we precomputing distances. If this parameter value is set to True, the original data will not be changed. If it is False, the 
modification will be done directly on the original data and restored when the function returns the value. However, due to the addition and subtraction operations on the mean value of the data during the calculation process, after the data is returned, there may be slight differences between the original data and before the calculation. 
Attributes: 
write picture description here 
(1) cluster_centers_: vector, [n_clusters, n_features] 
Coordinates of cluster centers (coordinates of each cluster center??); 
(2) Labels_: The classification of each point; 
(3) innertia_: float, the sum of the distances of each point to the centroid of its cluster. 
For example, a certain code of mine got the result: 
write picture description here 
2. For unsupervised machine learning, the input data is the characteristics of the sample, and clf.fit(X) can input the data into the classifier. 
3. To classify unknown data with a classifier, the predict method of the classifier needs to be used.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325846924&siteId=291194637