scala-MLlib official document --- spark.mllib package - Frequent pattern mining + PMML model export

八、Frequent pattern mining

Mining frequent item, item sets, sequences, or other sub-structures are often the first step in analyzing large data sets, which is the subject of active research in data mining over the years. We recommend that you refer to Wikipedia to learn the rules of association for more information. Providing FP-growth spark.mllib parallel implementation, it is popular algorithm for mining frequent itemsets.

FP-growth

FP - Growth algorithm in Han et al. Paper described the algorithm for mining frequent patterns without candidate generation, where the "FP" represents the frequent pattern. Given the transaction data set, the first step is to calculate the FP increase the frequency of the project and identify frequent project. Similar Apriori algorithm designed for the same purpose different, FP-growth in the second step using suffix tree (FP-tree) code transactions structure, without having to explicitly generate the candidate set, which is usually costly. After the second step, the frequent item sets can be extracted from the FP tree. In spark.mllib, we realized parallel version is referred to as FP-growth of PFP, such as Li et al. , PFP: Parallel query FP-growth for the recommendation. PFP distribution according to work suffix affairs FP tree growth, and therefore achieve more scalable than a stand-alone. We provide users with more information, please refer to these papers.
spark.mllib of FP-growth achieved using the following (super) parameters:

  • minSupport: to set the minimum support for a project, the project sets are determined frequently. For example, If a transaction occurs in five 3, which supports the rate of 3/5 = 0.6.
  • numPartitions: for the number of partitions to distribute the work.

Sample Code

import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD

val data = sc.textFile("data/mllib/sample_fpgrowth.txt")

val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { itemset =>
  println(s"${itemset.items.mkString("[", ",", "]")},${itemset.freq}")
}

val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
  println(s"${rule.antecedent.mkString("[", ",", "]")}=> " +
    s"${rule.consequent .mkString("[", ",", "]")},${rule.confidence}")
}

association rules

AssociationRules achieve parallel rule generation algorithm, a rule is configured with a single result.
Sample Code

import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset

val freqItemsets = sc.parallelize(Seq(
  new FreqItemset(Array("a"), 15L),
  new FreqItemset(Array("b"), 35L),
  new FreqItemset(Array("a", "b"), 12L)
))

val ar = new AssociationRules()
  .setMinConfidence(0.8)
val results = ar.run(freqItemsets)

results.collect().foreach { rule =>
println(s"[${rule.antecedent.mkString(",")}=>${rule.consequent.mkString(",")} ]" +
    s" ${rule.confidence}")
}

PrefixSpan

Pei et al PrefixSpan is: sequential mode "growth mode through sequential pattern mining PrefixSpan Method" described in mining algorithm. We provide references for readers to standardized sequential pattern mining problem.
spark.mllib of PrefixSpan achieved using the following parameters:

  • minSupport: is regarded as the minimum needed to support the frequent sequential mode.
  • maxPatternLength: Frequent maximum length of sequential mode. Any frequent pattern beyond this length are not included in the results.
  • maxLocalProjDBSize: Before the start of the project databases locally iterative processing, database prefix projection allowed maximum number of items. This parameter should be adjusted according to the size of the execution of the program.

Examples
The following examples illustrate the operation in sequence on PrefixSpan (Pei et al., Using the same notation):

 <(12)3>
  <1(32)(12)>
  <(12)5>
  <6>

Sample Code

import org.apache.spark.mllib.fpm.PrefixSpan

val sequences = sc.parallelize(Seq(
  Array(Array(1, 2), Array(3)),
  Array(Array(1), Array(3, 2), Array(1, 2)),
  Array(Array(1, 2), Array(5)),
  Array(Array(6))
), 2).cache()
val prefixSpan = new PrefixSpan()
  .setMinSupport(0.5)
  .setMaxPatternLength(5)
val model = prefixSpan.run(sequences)
model.freqSequences.collect().foreach { freqSequence =>
  println(
    s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
      s" ${freqSequence.freq}")
}

Ten, PMML model export (PMML model is derived)

spark.mllib supported models

spark.mllib support export models to Predictive Model Markup Language (PMML).
The following table summarizes spark.mllib can be exported to the Model PMML PMML models and their equivalents.
Here Insert Picture Description

Examples

To support the model (see table above) Export to PMML, just call model.toPMML.
In addition to exporting models to PMML string (model.toPMML above example), you also PMML models can be exported to other formats.
For more information about the API, see KMeans Scala documentation and Vectors Scala documents.
Here is a complete example to build KMeansModel and printed out as PMML:

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Export to PMML to a String in PMML format
println(s"PMML Model:\n ${clusters.toPMML}")

// Export the model to a local file in PMML format
clusters.toPMML("/tmp/kmeans.xml")

// Export the model to a directory on a distributed file system in PMML format
clusters.toPMML(sc, "/tmp/kmeans")

// Export the model to the OutputStream in PMML format
clusters.toPMML(System.out)
Released two original articles · won praise 0 · Views 625

Guess you like

Origin blog.csdn.net/pt798633929/article/details/103850221