Introduction k-means clustering algorithm

k-means algorithm is a clustering algorithm based partitioning, it is the parameter k, the n data objects into k clusters, so that cluster have high similarity, and the similarity between clusters is low.

1. The basic idea

k-means algorithm is a method given data set of n data objects to construct clusters k partitions, each divided cluster is the cluster. The method of dividing data into n clusters, each cluster having at least one data object, and each data object must belong to only one cluster belongs. While meeting the data objects in the same cluster of high similarity, the data objects in different clusters of similarity smaller. Clustering using a similarity in each cluster mean to the object calculated.

K-means algorithm process flow is as follows. First, k randomly selected data objects, each data object representing a cluster center, i.e., the initial center of k selected; remaining for each object according to its similarity with each cluster center (distance), it will be assigned most similar cluster corresponding to cluster center; and then re-calculates the average of all the objects in each cluster, a new cluster center.

Repeat the above process until convergence criterion function, that is, the cluster centers do not change dramatically. Usually as a mean square error criterion function, i.e. each point to minimize the square of the distance and the nearest cluster center.

The method of calculating the average of the new cluster center is calculated for all objects in the cluster, that is, each averaged value of each dimension of all the objects, whereby the center point of the cluster. For example, a cluster comprising the three data objects {(6,4,8), (8,2,2), (4,6,2)}, the center point of the cluster is ((6 + 4 + 8 ) / 3, (2 + 4 + 6) / 3, (2 + 2 + 8) / 3) = (6,4,4).

k-means algorithm described using the distance data similarity between two objects. Distance function from the Ming style, Euclidean distance, and horse type distance from gram, most commonly used is Euclidean distance.
k-means algorithm is optimal when the criterion function or the maximum number of iterations can be terminated. When Euclidean distance, to minimize the criterion function is generally to data objects which cluster center square of the distance and that .
Wherein, k is the number of clusters  is the center point of the i-th cluster, dist ( , X) to the X  distance.

2. Spark MLlib the k-means algorithm

K-means algorithm implementation class of the Spark MLlib KMeans has the following parameters.

class KMeans private (
private var k: int,
private var maxiterations: Int,
private var runs: Int,
private var initializationMode String
private var initializationStep: Int,
private var epsilon: Double,
private var seed: Long) extends: Serializable with Logging

1) MLlib the k-means constructor

MLlib configured with default values ​​of k-means as an example of the interface.

{k: 2,maxIterations: 20,runs: 1, initializationMode: KMeans.K_MEANS_PARALLEL,InitializationSteps: 5,epsilon: le-4,seed:random}。

Meaning of the parameters are explained below.

name Explanation
k It represents the desired number of clusters.
maxIterations The maximum number of iterations representation of a single run.
runs It represents the number of times the algorithm is run. k-means algorithm is not guaranteed to return the global optimal clustering results, so many times ran k-means algorithm on the target data set to help return the best clustering results.
initializationMode Selection method represents an initial cluster center currently support random selection or K_MEANS_PARALLEL mode, the default is K_MEANS_PARALLEL.
initializationsteps K_MEANS_PARALLEL represents the number of steps of the method.
epsilon It represents a k-means algorithm convergence iterative threshold.
seed It represents a random seed when the cluster is initialized.

Normally applications will first call KMeans.train method for clustering training data set, this method returns KMeansModel class instance, then a new data object can be predicted using KMeansModel.predict belongs clustering method.

2) MLlib the k-means training function

There are many overloaded method k-means training function in KMeans.train method MLlib, where most parameters to be described a whole. KMeans.train follows.

def train(
data:RDD[Vector],
k:Int
maxIterations:Int
runs:Int
initializationMode: String,
seed: Long): KMeansModel = {
new KMeans().setK(k)    –
.setMaxIterations(maxIterations)
.setRuns(runs)
.setInitializatinMode(initializationMode)
.setSeed(seed)
.run(data)
}
)

The method constructor same meaning of each parameter is not repeated here.

3) prediction function of k-means in MLlib

Prediction function of the k-means method KMeansModel.predict MLlib receive different formats of data input parameters, may be a vector or RDD, returns the input parameter belongs cluster index. API KMeansModel.predict method is as follows.

def predict(point:Vector):Int
def predict(points:RDD[Vector]):RDD[int]

A first prediction method can only receive one point, and it returns the index in the cluster; the second prediction method may receive a set of points, and each point where the value is returned as RDD cluster manner.

Examples of algorithms k-means 3. MLlib in

Example: Import training data set using the k-means algorithm of clustering data into two clusters among the desired number of clusters is passed to the algorithm as a parameter, and then calculate the sum of the mean square error (WSSSE) within the cluster, by increasing k is the number of clusters to reduce the error.

Step This example uses k-means algorithm of clustering follows.

①  loading data, the data is stored in a text file.

 the number of data clustering, and set the class for 2 to 20 iterations, form the model training data model.

 print center point of the data model.

 using the error sum of squares of the data model and to evaluate.

 use the model to test single-point data.

 cross assessment 1, return results; assessing cross-2, and return data set results.

This example uses the data stored in the document kmeans_data.txt, a spatial position coordinates of the six points, as shown in the following data.

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

Each row of data describes one dot, each dot has three digital coordinate values ​​of which are described in three-dimensional space. Each column will be regarded as a characteristic index data, the data set clustering analysis. Code implementation shown below.

  1. import org.apache.log4j.{Level,Logger}
  2. import org.apache.spark. SparkConf {,} SparkContext
  3. import org.apache.spark.mllib.clustering.KMeans
  4. import org.apache.spark.mllib.linalg.Vectors
  5.  
  6. object Kmeans {
  7. def main(args: Array[String]) {
  8. // set the operating environment
  9. val conf = new SparkConf().setAppName(“Kmeans”).setMaster(“local[4]”)
  10. val sc = new SparkContext(conf)
  11. // Load data set
  12. val data = sc.textFile(“/home/hadoop/exercise/kmeans_data.txt”, 1)
  13. val parsedData = data.map(s => Vectors.dense(s.split(”).map(_.toDouble)))
  14. Forming a data model // clustering data number, set to class 2, 20 iterations, the model train
  15. val numClusters = 2
  16. wave numIterations = 20
  17. val model = KMeans.train(parsedData,numClusters, numIterations)
  18. // print the data model of the center point
  19. printIn(“Cluster centers:”)
  20. for (c <- model.clustercenters) {
  21. printIn (“” + c.toString)
  22. }
  23. // sum of squared error using the data model to evaluate
  24. val cost = model.computeCost(parsedData)
  25. printIn(“Within Set Sum of Squared Errors = “ + cost)
  26. // use the model to test single-point data
  27. printIn(“Vectors 0.2 0.2 0.2 is belongs to clusters:” + model.predict(Vectors.dense(“0.2 0.2 0.2”.split(”).map(_.toDouble))))
  28. printIn(“Vectors 0.25 0.25 0.25 is belongs to clusters:” + model.predict(Vectors.dense(“0.25 0.25 0.25”.split(”)).map(_.toDouble))))
  29. printIn(“Vectors 8 8 8 is belongs to clusters:” +model.predict(Vectors.dense(“8 8 8”.split(‘ ‘).map(_.toDouble))))
  30. 1 // cross evaluation, results only
  31. val testdata = data.map(s => Vectors.dense(s.split(”).map(_.toDouble)))
  32. val result1 = model.predict(testdata)
  33. result1.saveAsTextFile (“/home/hadoop/upload/class8/result_kmeans1”)
  34. @ 2 cross evaluation, and return the results data set
  35. val result2 = data.map {
  36. line =>
  37. val linevectore = Vectors.dense(line.split(”).map(_.toDouble))
  38. val prediction = model.predict(linevectore) line + ” “ + prediction
  39. }.saveAsTextFile(“/home/hadoop/upload/class8/result_kmeans2”)
  40. sc.stop ()
  41. }
  42. }

After the code is running, the window can be seen in the running data calculated by the model, and the center point of two clusters identified: (9.1, 9.1, 9.1) and (0.1, 0.1, 0.1); and using the model to test points classification, they belong to the cluster can be obtained 1,1,0.

Meanwhile, there are two output destination / home / hadoop / spark / mllib / exercise directory: result_kmeansl and result_kmeans2. Evaluation of a cross output six 0,0,0,1,1,1 points belonging to the cluster; in the cross cluster evaluation in the second output data set and each point belongs.

4. algorithm advantages and disadvantages

k-means clustering algorithm is a classical algorithm, which is simple and efficient, easy to understand and implement; low time complexity of the algorithm is O (tkm), where, r is the iteration number, k is the number of clusters, m is the number of records, for the n-dimension, and t << mk << n.

k-means algorithm, there are many inadequate.

Guess you like

Origin blog.csdn.net/yuyuy0145/article/details/92430077