Spark ML - Clustering Algorithms

http://ihoge.cn/2018/ML2.html

Spark ML - Clustering Algorithms

1. KMeans fast clustering

First go to the packages that UR needs:

import org.apache.spark.ml.clustering.{KMeans,KMeansModel}
import org.apache.spark.ml.linalg.Vectors

RDDImplicit conversions enabled :

import spark.implicits._

In order to facilitate the generation of the corresponding , a data type DataFramenamed as each row (a data sample) is defined model_instancehere .case classDataFrame

case class model_instance (features: org.apache.spark.ml.linalg.Vector)

After defining the data type, you can read the data into RDD[model_instance]the structure, and complete the conversion through the RDDimplicit conversion .toDF()method :RDDDataFrame

val rawData = sc.textFile("file:///home/hduser/iris.data")
val df = rawData.map(
    line =>
      { model_instance( Vectors.dense(line.split(",").filter(p => p.matches("\\d*(\\.?)\\d*"))
      .map(_.toDouble)) )}).toDF()

Similar to the MLlib version of the tutorial, we use the filter operator to filter out class labels, regular expressions \\d*(\\.?)\\d*can be used to match numbers of real type, and qualifiers are \\d*used to indicate *numeric characters that match 0 or more times. Qualifier, which means to match 0 or 1 decimal point.\\.??

After getting the data, we can go through the inherent process of the ML package: create Estimatorand call its fit()methods to generate the corresponding Transformerobjects. Obviously, the KMeansclass here is Estimator, and the class used to save the trained model KMeansModelbelongs to Transformer:

val kmeansmodel = new KMeans().
      setK(3).
      setFeaturesCol("features").
      setPredictionCol("prediction").
      fit(df)

Similar to the MLlib version, the KMeans method under the ML package also includes Seed(random number seed), Tol(convergence threshold), K(number of clusters), MaxIter(maximum iterations), initMode(initialization method), initStep(steps of KMeans|| method) and other parameters can be set. Like other ML framework algorithms, users can setXXX()set them through corresponding methods, or ParamMappass in parameters in the form of . For the purpose of introduction, setXXX()the parameter K is set for the usage method, and the rest of the parameters use default values. .

Unlike the implementation in MLlib, KMeansModelas one Transformer, there is no longer a predict()style method, but a consistent method for holistic processing of a given dataset transform()stored in to generate data with predicted cluster labels DataFrameset:

val results = kmeansmodel.transform(df)

In order to facilitate observation, we can use collect()the method, which DataFrameorganizes all the data in an Arrayobject to return:

results.collect().foreach(
      row => {
        println( row(0) + " is predicted as cluster " + row(1))
      })

You can also get all the cluster centers of the model through KMeansModelthe attributes that come with the class :clusterCenters

kmeansmodel.clusterCenters.foreach(
      center => {
        println("Clustering Center:"+center)
      })

The same as the implementation under MLlib, the KMeansModelclass also provides a method to calculate the sum of squared errors within the set (Within Set Sum of Squared Error, WSSSE) to measure the effectiveness of clustering. When the true K value is unknown, the value of The change can be used as an important reference for choosing the appropriate K value:

kmeansmodel.computeCost(df)

2. Gaussian Mixture Model (GMM) clustering algorithm

2.1 Basic principles

Gaussian Mixture Model (GMM) is a probabilistic clustering method that belongs to a generative model, which assumes that all data samples are generated by a multivariate . Specifically, given the number of classes K, for samples in a given sample space

, the probability density function of a Gaussian mixture model can be represented by a mixture distribution composed of K multivariate Gaussian distributions:

in,

yes

is the mean vector,

is the probability density function of the multivariate Gaussian distribution of the covariance matrix. It can be seen that the Gaussian mixture model is composed of K different multivariate Gaussian distributions, and each distribution is called a Gaussian mixture model. Component (Component) , and

is the weight of the first multivariatei Gaussian distribution in the mixture model , and has

。

Assuming that there is an existing Gaussian mixture model, the generation process of the samples in the sample space is:

As a probability (in fact, the weight can be intuitively understood as the proportion of the samples generated by the corresponding components to the total samples), a mixed component is selected, and according to the probability density function of the mixed component, the corresponding samples are generated by sampling.

Then, the process of using GMM for clustering is the "inverse process" of using GMM to generate data samples: given the number of clusters K, through a given data set, a certain parameter estimation method is used to derive each mixed component the parameters of (i.e. the mean vector

, covariance matrix

and weights

), each multivariate Gaussian distribution component corresponds to a cluster after clustering. Gaussian mixture models are trained using maximum likelihood estimation, maximizing the following log-likelihood function:

Obviously, this optimization formula cannot be solved directly analytically, so it can be solved by the Expectation-Maximization (EM) method. The specific process is as follows (for the sake of brevity, the specific mathematical expressions are omitted here. wikipedia ):

1.根据给定的K值，初始化K个多元高斯分布以及其权重；
2.根据贝叶斯定理，估计每个样本由每个成分生成的后验概率；(EM方法中的E步)
3.根据均值，协方差的定义以及2步求出的后验概率，更新均值向量、协方差矩阵和权重；（EM方法的M步）
重复2~3步，直到似然函数增加值已小于收敛阈值，或达到最大迭代次数

After the parameter estimation process is completed, for each sample point, the posterior probability of each cluster is calculated according to Bayes' theorem, and the sample is divided into the cluster with the largest posterior probability. Compared with KMeans and other clustering methods that directly give the clustering of sample points, GMM, a clustering method that gives the probability that sample points belong to each cluster, is called Soft Clustering (Soft Clustering / Soft Assignment) .

2.2 Model training and analysis

The Gaussian mixture models provided by Spark's ML library are all under the org.apache.spark.ml.clusteringpackage, similar to other clustering methods, and their specific implementations are divided into two categories: the class for abstracting the hyperparameters of GMM and training GaussianMixture( Estimator) and the class after training. Model GaussianMixtureModelclass ( Transformer), before using, import the required packages:

import org.apache.spark.ml.clustering.{GaussianMixture,GaussianMixtureModel}
import org.apache.spark.ml.linalg.Vector

RDDImplicit conversions enabled :

import spark.implicits._

We still use the Iris dataset for experiments. In order to facilitate the generation of the corresponding , a data type DataFramenamed as each row (a data sample) is defined model_instancehere .case classDataFrame

case class model_instance (features: org.apache.spark.ml.linalg.Vector)

After defining the data type, you can read the data into RDD[model_instance]the structure and complete the conversion through the RDDimplicit conversion .toDF()method :RDDDataFrame

val rawData = sc.textFile("file:///home/hduser/iris.data")
val df = rawData.map(line =>
      { model_instance( Vectors.dense(line.split(",").filter(p => p.matches("\\d*(\\.?)\\d*"))
      .map(_.toDouble)) )}).toDF()

Similar to the operation of MLlib, we use the filter operator to filter out class tags, regular expressions \\d*(\\.?)\\d*can be used to match numbers of real type, \\d*using *qualifiers, indicating that the number characters that match 0 or more times, \\.?using ?qualifiers , which means the decimal point that matches 0 or 1 times.

You can train a GMM model by creating a GaussianMixtureclass, setting the corresponding hyperparameters, and calling the method. Before calling the method, you need to set a series of hyperparameters, as shown in the following table:fit(..)GaussianMixtureModel

K: the number of clusters, the default is 2
maxIter : the maximum number of iterations, the default is 100
seed : random number seed, default is random Long value
Tol : log-likelihood function convergence threshold, default is 0.01

Among them, each hyperparameter can be set by a method named setXXX(...)(for example, maxIterations setMaxIterations()). Here, we create a simple GaussianMixtureobject, set the number of clusters to 3, and take default values for other parameters.

val gm = new GaussianMixture().setK(3)
               .setPredictionCol("Prediction")
               .setProbabilityCol("Probability")
val gmm = gm.fit(df)

Different from the KMeansequal-hard clustering method, in addition to the clustering prediction of the sample, the probability that the sample belongs to each cluster can also be obtained (here we exist in the "Probability" column).

After calling transform()the method to process the data set, print the data set, you can see the predicted cluster of each sample and its probability distribution vector

val result = gmm.transform(df)
result.show(150, false)

After obtaining the model, you can view the relevant parameters of the model. Unlike the KMeans method, GMM does not directly give the cluster center, but gives the parameters of each mixture component (multivariate Gaussian distribution). In the implementation of ML, each mixed component of GMM is stored in a MultivariateGaussianclass (located in the org.apache.spark.ml.stat.distributionpackage), we can use GaussianMixtureModelthe members of the class to weightsget the weight of each mixed component, and use the gaussiansmembers to get the parameters of each mixed component (mean vector and covariance matrix):

for (i <- 0 until gmm.getK) {
      println("Component %d : weight is %f \n mu vector is %s \n sigma matrix is %s" format
      (i, gmm.weights(i), gmm.gaussians(i).mean, gmm.gaussians(i).cov))
      }

For more exciting content, please follow: http://ihoge.cn