scala-MLlib official document --- spark.mllib package - clusteirng

Five, Clustering

Clustering is an unsupervised learning problem, we aim based on the concept of similarity subset entity grouping each other. Clustering is usually used for exploratory analysis and / or grade as part of the pipeline supervised learning (learning in the pipeline, for each cluster training of different classification or regression model).

k-means

K-means clustering is the most commonly used algorithm, it will aggregate the data points into a predetermined number of clusters. spark.mllib parallel implementation includes variants kmeans ++ method, called kmeans ||. spark.mllib The implementation has the following parameters:

  • k is the number of required clusters. Please note that there may return fewer than k clusters, for example, if you want to cluster unique point fewer than k.
  • maxIterations is the maximum number of iterations to run.
  • initializationMode random initialization or specified by k-means || initialization.
  • runs from Spark 2.0.0 onwards, this parameter is invalid.
  • initializationSteps determined number of steps || k-means algorithm.
  • epsilon identify what we think k-means convergence distance threshold.
  • initialModel is an optional set of cluster centers for initialization. If this parameter is run only once.

Sample Code
The following code segments may be performed in the spark-shell.
In the following example, after loading and parsing data, we use KMeans object data is divided into two clusters. The required number of clusters is passed to the algorithm. Next, we compute the square error and the set (WSSSE). You can reduce this error measure by increasing k. In fact, it is usually the best k "elbow" of the WSSSE figure.
For more information about API, please refer KMeans Scala documentation and KMeansModel Scala documents.

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println(s"Within Set Sum of Squared Errors = $WSSSE")

// Save and load model
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")

Gaussian mixture (Gaussian Mixture)

Gaussian mixture model represents a composite profile, wherein the extraction point from one of the sub-Gaussian distribution of k, each has its own sub-distribution probability. spark.mllib implemented using the expectation maximization algorithm derived maximum likelihood model in the case of a given set of samples. The implementation has the following parameters:

  • k is the number of required clusters.
  • ConvergenceTol is the biggest change in log-likelihood ratio, in which we believe have achieved convergence.
  • maxIterations is to achieve convergence and to perform the maximum number of iterations.
  • initialModel optional starting point is to start the EM algorithm. If omitted, the configuration data from a random starting point.

Sample Code

import org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/gmm_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))).cache()

// Cluster the data into two classes using GaussianMixture
val gmm = new GaussianMixture().setK(2).run(parsedData)

// Save and load model
gmm.save(sc, "target/org/apache/spark/GaussianMixtureExample/GaussianMixtureModel")
val sameModel = GaussianMixtureModel.load(sc,
  "target/org/apache/spark/GaussianMixtureExample/GaussianMixtureModel")

// output parameters of max-likelihood model
for (i <- 0 until gmm.k) {
  println("weight=%f\nmu=%s\nsigma=\n%s\n" format
    (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}

power iteration clustering (power iteration clustering)

Power Iterative Clustering (PIC) is a scalable and efficient algorithms for clustering graph vertices and edges as to impart pairwise similarity of attributes, such as Iteration Clustering in the Lin and Cohen of Power. It is calculated by the power iteration FIG normalized pseudo-affinity matrix and eigenvectors, and for clustering the vertices. spark.mllib comprises using as a backend PIC GraphX ​​achieved. It uses (srcId, dstId, similarity) RDD tuple, and outputs the allocated cluster model. Similarity must be non-negative. PIC is assumed that the similarity measure is symmetrical. One pair (srcId, dstId) order independent, should occur at most once in the input data. If you enter the missing pair will be their similarity as zero. spark.mllib achieved following the PIC (super) parameters:

  • k: number of clusters
  • maxIterations: The maximum number of iterations power
  • initializationMode: initialize the model. This may be a default value of "random" (used as a random vector vertex attributes) may also be "degrees" (standardized and similarity).

PowerIterationClustering realize PIC algorithm. It uses expressed (srcId: Long, dstId: Long , Similarity: Double) affinity matrix RDD tuples. PowerIterationClustering.run call returns PowerIterationClusteringModel, wherein the cluster comprises a dispensing calculated.
For more information about the API, see PowerIterationClustering Scala documentation and PowerIterationClusteringModel Scala documents.

import org.apache.spark.mllib.clustering.PowerIterationClustering

val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints)
val model = new PowerIterationClustering()
  .setK(params.k)
  .setMaxIterations(params.maxIterations)
  .setInitializationMode("degree")
  .run(circlesRdd)

val clusters = model.assignments.collect().groupBy(_.cluster).mapValues(_.map(_.id))
val assignments = clusters.toList.sortBy { case (k, v) => v.length }
val assignmentsStr = assignments
  .map { case (k, v) =>
    s"$k -> ${v.sorted.mkString("[", ",", "]")}"
  }.mkString(", ")
val sizesStr = assignments.map {
  _._2.length
}.sorted.mkString("(", ",", ")")
println(s"Cluster assignments: $assignmentsStr\ncluster sizes: $sizesStr")

latent Dirichlet allocation (LDA latent Dirichlet allocation)

Potential Dirichlet Allocation (LDA) model is a theme, the theme can be inferred from the text document collection. LDA clustering algorithm can be viewed as follows:

  • Corresponding to subject matter cluster center, corresponding to the document to a sample data set (rows).
  • Theme and documentation are present in the feature space, where the feature vector is a vector of words (bag of words) of.
  • LDA will not be used to estimate the distance of traditional clustering, but the use of statistical models feature a text-based file generation mode.

LDA supports different reasoning algorithm setOptimizer function. EMLDA Optimizer use of similar expectations likelihood function to maximize learning and clustering produce comprehensive results, and OnlineLDAOptimizer small batches using an iterative sampling variability inferred online, and usually memory-friendly.
LDA received document collection as vectors of word count and the following parameters (setting mode using the builder):

  • k: number of topics (ie cluster center)
  • optimizer: LDA model for optimizing learning, that EMLDA Optimizer or OnlineLDA Optimizer
  • docConcentration: Dirichlet parameters, the document takes precedence over the theme for distribution. Larger values ​​inferred encourage smoother distribution.
  • topicConcentration: Dirichlet parameter is used to represent the distribution of priority themes on a word (word) of. Larger values ​​inferred encourage smoother distribution.
  • maxIterations: limit the number of iterations.
  • checkpointInterval: If a checkpoint (Spark configuration provided), this parameter specifies the frequency of checkpoint creation. If maxIterations large, use random checkpoints can help reduce the size of the file on disk, and help recovery.

All LDA model spark.mllib are supported:

  • describeTopics: return to the subject of the most important terms and terminology in the form of heavy right array
  • topicsMatrix: return vocabSize by k matrix in which each column is a theme

Note: LDA is still in an experimental feature positive development. As a result, some of the features are available only in one / two models generated by the optimizer optimizer. Currently, you can convert a distributed model for the local model, but not vice versa.
The following discussion will describe each of the optimizer / model pair.
Expectation maximization
implemented in EMLDAOptimizer and DistributedLDAModel in.
For the parameters provided to the LDA:

  • docConcentration: symmetric supports only a priori, all k-dimensional vector values ​​must be provided in the same. All values ​​must be> 1.0. Providing Vector (-1) will cause the default behavior (value (50 / k) k-dimensional vector uniformly +1)
  • topicConcentration: only supports symmetric priori. Must be> 1.0. The default value will cause provide 0.1 -1 + 1.
  • maxIterations: EM maximum number of iterations.

Note: be enough iterations is very important. In early iterations, EM is usually of no use themes, but after performing more iterations, these topics will be significantly improved. Depending on your data set, typically at least 20 times or even rational use of 50-100 iterations. EMLDAOptimizer generate a DistributedLDAModel, the model is not only stored inferred theme, the theme is also stored and distributed complete training corpus training corpus of each document. DistributedLDAModel support:

  • topTopicsPerDocument: training corpus and the right of the main themes of each document weight
  • topDocumentsPerTopic: top of the document and the document for each topic in the topic of heavy weights.
  • logPrior: Given the parameters of the number of super-probability estimates docConcentration and topicConcentration theme and topic distribution of documents
  • logLikelihood: In the case of inferred theme and topic distribution of documents, the number of possibilities for training corpus

Online variational Bayesian
implemented in OnlineLDAOptimizer and LocalLDAModel in.
For the parameters provided to the LDA:

  • docConcentration: Dirichlet parameter value is equal to a vector passing through each dimension k in the dimensions, may be asymmetric priori. Value should be> = 0. Providing Vector (-1) will cause the default behavior (the value of (1.0 / k) of the k-dimensional vectors uniform)
  • topicConcentration: only supports symmetric priori. Must be> = 0. -1 results provide default value (1.0 / k).
  • The maximum number of mini-batch submitted: maxIterations.

In addition, OnlineLDAOptimizer accepts the following parameters:

  • miniBatchFraction: Corpus and fractional sampling used in each iteration
  • optimizeDocConcentration: If set to true, the implementation of maximum likelihood estimation of hyper-parameters docConcentration (aka alpha) after each batch minimum, and set optimized docConcentration in LocalLDAModel returned.
  • tau0 and kappa: learning rate for attenuation, which is calculated by the -κ (τ0 + iter), where iter is the current iteration.

OnlineLDAOptimizer generate a LocalLDAModel, the model stores only infer theme. LocalLDAModel support:

  • logLikelihood (documents): Given inferred theme, calculate the lower limit of the document provided.
  • logPerplexity (documents): Given inferred themes, limit calculation provided documentation of confusion.

Sample Code

import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()

// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

// Output topics. Each is a distribution over words (matching word count vectors)
println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):")
val topics = ldaModel.topicsMatrix
for (topic <- Range(0, 3)) {
  print(s"Topic $topic :")
  for (word <- Range(0, ldaModel.vocabSize)) {
    print(s"${topics(word, topic)}")
  }
  println()
}

// Save and load model.
ldaModel.save(sc, "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
val sameModel = DistributedLDAModel.load(sc,
  "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")

bisecting k-means (k uniformly mean)

Bisecting K-means is usually much faster than conventional K-means, but usually will produce different clusters.
Sharing is a hierarchical k-means clustering. Hierarchical clustering is seeking to establish one of the most commonly used cluster analysis of hierarchical clustering. Cluster policy hierarchy is usually divided into two types:

  • Agglomerative: This is a "bottom-up" approach: the beginning of each observation focused its own group, and moves upward as a cluster, the cluster merge pairs.
  • Divisive: This is a "top-down" approach: all observations are started from a cluster, and with a downward movement to be performed recursively split hierarchy.

Bisecting k-means algorithm is a division algorithm. MLlib The implementation has the following parameters:

  • k: the required number of foliage (default: 4). If there is no split leaf clusters, the actual number may be smaller.
  • maxIterations: split k-means clustering maximum number of iterations (default: 20)
  • minDivisibleClusterSize: cluster minimum number of points can be divided (if> = 1.0) the minimum number of points, or the ratio (if <1.0) (default: 1)
  • seed: random seed (default: class name hash value)
import org.apache.spark.mllib.clustering.BisectingKMeans
import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Loads and parses data
def parse(line: String): Vector = Vectors.dense(line.split(" ").map(_.toDouble))
val data = sc.textFile("data/mllib/kmeans_data.txt").map(parse).cache()

// Clustering the data into 6 clusters by BisectingKMeans.
val bkm = new BisectingKMeans().setK(6)
val model = bkm.run(data)

// Show the compute cost and the cluster centers
println(s"Compute Cost: ${model.computeCost(data)}")
model.clusterCenters.zipWithIndex.foreach { case (center, idx) =>
  println(s"Cluster Center ${idx}: ${center}")
}

streaming k-means (mean flow)

When the data reaches the stream, we may wish to estimate the dynamic cluster, and update it when new data arrives. spark.mllib support convection k-means clustering, and the estimated value with the attenuation control parameter (or "amnesia") is. The algorithm uses a generalization k-means small batch update rules. For each batch of data, we will be all the points assigned to their nearest cluster, the new cluster computing center, and then update each cluster using the following method:
Here Insert Picture Description
where ct is the cluster of the previous center, nt so far assigned to the cluster points, xt is the current batch of new cluster center, mt is added to the cluster of points in the current batch. Attenuation factor [alpha] can be used to ignore the past: α = 1, all the data from the beginning; when α = 0, it will only use the latest data. This is similar to exponentially weighted moving average.
HalfLife attenuation parameters can be used to specify the parameters determining the correct attenuation factor a, so that the data collected at time t, at time t + halfLife its contribution will be reduced to 0.5. Unit of time can be specified as a batch or a point, and the update rule adjusted accordingly.
Sample Code

import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}

val conf = new SparkConf().setAppName("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))

val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)

val model = new StreamingKMeans()
  .setK(args(3).toInt)
  .setDecayFactor(1.0)
  .setRandomCenters(args(4).toInt, 0.0)

model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start()
ssc.awaitTermination()

When you add a new text file with data, the cluster center is updated. Each training point format should be [x1, x2, x3], the test data format of each point shall be (y, [x1, x2, x3]), where y is a useful label or identifier (e.g., real class assignment)). Any time a text file in the / training / data / dir, the model will be updated. Any time a text file in the / testing / data / dir, you will see the forecast. With the new data center cluster will change!

Released two original articles · won praise 0 · Views 627

Guess you like

Origin blog.csdn.net/pt798633929/article/details/103850207