Spark ML - Clustering Algorithms
1. KMeans fast clustering
First go to the packages that UR needs:
import org.apache.spark.ml.clustering.{KMeans,KMeansModel}
import org.apache.spark.ml.linalg.Vectors
RDD
Implicit conversions enabled :
import spark.implicits._
In order to facilitate the generation of the corresponding , a data type DataFrame
named as each row (a data sample) is defined model_instance
here .case class
DataFrame
case class model_instance (features: org.apache.spark.ml.linalg.Vector)
After defining the data type, you can read the data into RDD[model_instance]
the structure, and complete the conversion through the RDD
implicit conversion .toDF()
method :RDD
DataFrame
val rawData = sc.textFile("file:///home/hduser/iris.data")
val df = rawData.map(
line =>
{ model_instance( Vectors.dense(line.split(",").filter(p => p.matches("\\d*(\\.?)\\d*"))
.map(_.toDouble)) )}).toDF()
Similar to the MLlib version of the tutorial, we use the filter operator to filter out class labels, regular expressions \\d*(\\.?)\\d*
can be used to match numbers of real type, and qualifiers are \\d*
used to indicate *
numeric characters that match 0 or more times. Qualifier, which means to match 0 or 1 decimal point.\\.?
?
After getting the data, we can go through the inherent process of the ML package: create Estimator
and call its fit()
methods to generate the corresponding Transformer
objects. Obviously, the KMeans
class here is Estimator
, and the class used to save the trained model KMeansModel
belongs to Transformer
:
val kmeansmodel = new KMeans().
setK(3).
setFeaturesCol("features").
setPredictionCol("prediction").
fit(df)
Similar to the MLlib version, the KMeans method under the ML package also includes Seed
(random number seed), Tol
(convergence threshold), K
(number of clusters), MaxIter
(maximum iterations), initMode
(initialization method), initStep
(steps of KMeans|| method) and other parameters can be set. Like other ML framework algorithms, users can setXXX()
set them through corresponding methods, or ParamMap
pass in parameters in the form of . For the purpose of introduction, setXXX()
the parameter K is set for the usage method, and the rest of the parameters use default values. .
Unlike the implementation in MLlib, KMeansModel
as one Transformer
, there is no longer a predict()
style method, but a consistent method for holistic processing of a given dataset transform()
stored in to generate data with predicted cluster labels DataFrame
set:
val results = kmeansmodel.transform(df)
In order to facilitate observation, we can use collect()
the method, which DataFrame
organizes all the data in an Array
object to return:
results.collect().foreach(
row => {
println( row(0) + " is predicted as cluster " + row(1))
})
You can also get all the cluster centers of the model through KMeansModel
the attributes that come with the class :clusterCenters
kmeansmodel.clusterCenters.foreach(
center => {
println("Clustering Center:"+center)
})
The same as the implementation under MLlib, the KMeansModel
class also provides a method to calculate the sum of squared errors within the set (Within Set Sum of Squared Error, WSSSE) to measure the effectiveness of clustering. When the true K value is unknown, the value of The change can be used as an important reference for choosing the appropriate K value:
kmeansmodel.computeCost(df)
2. Gaussian Mixture Model (GMM) clustering algorithm
2.1 Basic principles
Gaussian Mixture Model (GMM) is a probabilistic clustering method that belongs to a generative model, which assumes that all data samples are generated by a multivariate . Specifically, given the number of classes K
, for samples in a given sample space
, the probability density function of a Gaussian mixture model can be represented by a mixture distribution composed of K multivariate Gaussian distributions:
in,
yes
is the mean vector,
is the probability density function of the multivariate Gaussian distribution of the covariance matrix. It can be seen that the Gaussian mixture model is composed of K different multivariate Gaussian distributions, and each distribution is called a Gaussian mixture model. Component (Component) , and
is the weight of the first multivariatei
Gaussian distribution in the mixture model , and has
。
Assuming that there is an existing Gaussian mixture model, the generation process of the samples in the sample space is:
As a probability (in fact, the weight can be intuitively understood as the proportion of the samples generated by the corresponding components to the total samples), a mixed component is selected, and according to the probability density function of the mixed component, the corresponding samples are generated by sampling.
Then, the process of using GMM for clustering is the "inverse process" of using GMM to generate data samples: given the number of clusters K
, through a given data set, a certain parameter estimation method is used to derive each mixed component the parameters of (i.e. the mean vector
, covariance matrix
and weights
), each multivariate Gaussian distribution component corresponds to a cluster after clustering. Gaussian mixture models are trained using maximum likelihood estimation, maximizing the following log-likelihood function:
Obviously, this optimization formula cannot be solved directly analytically, so it can be solved by the Expectation-Maximization (EM) method. The specific process is as follows (for the sake of brevity, the specific mathematical expressions are omitted here. wikipedia ):
1.根据给定的K值,初始化K个多元高斯分布以及其权重;
2.根据贝叶斯定理,估计每个样本由每个成分生成的后验概率;(EM方法中的E步)
3.根据均值,协方差的定义以及2步求出的后验概率,更新均值向量、协方差矩阵和权重;(EM方法的M步)
重复2~3步,直到似然函数增加值已小于收敛阈值,或达到最大迭代次数
After the parameter estimation process is completed, for each sample point, the posterior probability of each cluster is calculated according to Bayes' theorem, and the sample is divided into the cluster with the largest posterior probability. Compared with KMeans and other clustering methods that directly give the clustering of sample points, GMM, a clustering method that gives the probability that sample points belong to each cluster, is called Soft Clustering (Soft Clustering / Soft Assignment) .
2.2 Model training and analysis
The Gaussian mixture models provided by Spark's ML library are all under the org.apache.spark.ml.clustering
package, similar to other clustering methods, and their specific implementations are divided into two categories: the class for abstracting the hyperparameters of GMM and training GaussianMixture
( Estimator
) and the class after training. Model GaussianMixtureModel
class ( Transformer
), before using, import the required packages:
import org.apache.spark.ml.clustering.{GaussianMixture,GaussianMixtureModel}
import org.apache.spark.ml.linalg.Vector
RDD
Implicit conversions enabled :
import spark.implicits._
We still use the Iris dataset for experiments. In order to facilitate the generation of the corresponding , a data type DataFrame
named as each row (a data sample) is defined model_instance
here .case class
DataFrame
case class model_instance (features: org.apache.spark.ml.linalg.Vector)
After defining the data type, you can read the data into RDD[model_instance]
the structure and complete the conversion through the RDD
implicit conversion .toDF()
method :RDD
DataFrame
val rawData = sc.textFile("file:///home/hduser/iris.data")
val df = rawData.map(line =>
{ model_instance( Vectors.dense(line.split(",").filter(p => p.matches("\\d*(\\.?)\\d*"))
.map(_.toDouble)) )}).toDF()
Similar to the operation of MLlib, we use the filter operator to filter out class tags, regular expressions \\d*(\\.?)\\d*
can be used to match numbers of real type, \\d*
using *
qualifiers, indicating that the number characters that match 0 or more times, \\.?
using ?
qualifiers , which means the decimal point that matches 0 or 1 times.
You can train a GMM model by creating a GaussianMixture
class, setting the corresponding hyperparameters, and calling the method. Before calling the method, you need to set a series of hyperparameters, as shown in the following table:fit(..)
GaussianMixtureModel
- K: the number of clusters, the default is 2
- maxIter : the maximum number of iterations, the default is 100
- seed : random number seed, default is random Long value
- Tol : log-likelihood function convergence threshold, default is 0.01
Among them, each hyperparameter can be set by a method named setXXX(...)
(for example, maxIterations setMaxIterations()
). Here, we create a simple GaussianMixture
object, set the number of clusters to 3, and take default values for other parameters.
val gm = new GaussianMixture().setK(3)
.setPredictionCol("Prediction")
.setProbabilityCol("Probability")
val gmm = gm.fit(df)
Different from the KMeans
equal-hard clustering method, in addition to the clustering prediction of the sample, the probability that the sample belongs to each cluster can also be obtained (here we exist in the "Probability" column).
After calling transform()
the method to process the data set, print the data set, you can see the predicted cluster of each sample and its probability distribution vector
val result = gmm.transform(df)
result.show(150, false)
After obtaining the model, you can view the relevant parameters of the model. Unlike the KMeans method, GMM does not directly give the cluster center, but gives the parameters of each mixture component (multivariate Gaussian distribution). In the implementation of ML, each mixed component of GMM is stored in a MultivariateGaussian
class (located in the org.apache.spark.ml.stat.distribution
package), we can use GaussianMixtureModel
the members of the class to weights
get the weight of each mixed component, and use the gaussians
members to get the parameters of each mixed component (mean vector and covariance matrix):
for (i <- 0 until gmm.getK) {
println("Component %d : weight is %f \n mu vector is %s \n sigma matrix is %s" format
(i, gmm.weights(i), gmm.gaussians(i).mean, gmm.gaussians(i).cov))
}
For more exciting content, please follow: http://ihoge.cn