spark ML Pipelines

在spark2.0里mllib分为两个包，spark.mllib里是基于RDD的API，spark.ml里是基于 DataFrame的API。官方不会在基于RDD的mllib里添加新特性。所以建议使用ml包。在spark2.2时基于RDD的API会被废弃，到spark3.0会被彻底移除。

Pipelines主要概念

DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

ML使用Spark SQL里来的数据结构DataFrame作为数据集，DataFrame能存储各种数据类型，例如DataFrame可以有不同的列存储文本，特征向量，真标签，预测值等。
Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

Transformer是一个能将一个DataFrame转换为另一个DataFrame的算法。
例如一个ML模型就是一个能将带特征的DataFrame转换为一个带预测结果的DataFrame的Transformer。
Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

Estimator是作用在DataFrame上产生Transformer的算法
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

Pipeline将多个Transformer和Estimators链接起来形成一个特定的ML工作流。
Parameter: All Transformers and Estimators now share a common API for specifying parameters.

所有Transformer和Estimator共享一个指定参数的API

Pipeline components

Transformers

A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:

A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

Transformer是一个包含特征转换和学习模型的抽象概念。它实现了trandform()方法，能够将一个DataFrame转换成另一个。

Estimators

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

Estimator是一个学习算法或任何可以用来训练数据的算法。它实现了
fit()方法，它接收一个DataFrame作为输入然后产生一个模型。例如，
LogisticRegression是个Estimator，调用它的fit()方法能够训练
出模型LogisticRegressionModel

Properties of pipeline components

Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.

Transformer.transform()s 和 Estimator.fit()s 都是无状态的
将来会通过替代概念实现有状态的算法。

Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).

每个Transformer 或 Estimator 都有一个独一无二的ID，在指定参数时
会非常有用（下面会讨论）

Pipeline

In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

在机器学习中，通过一系列算法处理数据和从数据里学习知识都是很正常的事
以简单的文本处理为例，它的工作流中会包括以下几个阶段

Split each document’s text into words.

将每个文档的文本切分为单词
Convert each document’s words into a numerical feature vector.

将每个文档转换为数字化的特征向量
Learn a prediction model using the feature vectors and labels.

使用特征向量和标签生成预测模型

MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.

MLlib使用Pipeline表示这样的工作流，它包含了一系列按特定顺序的
PipelineStages (Transformers and Estimators)

如何工作？

Pipeline是由每个阶段都是Transformer 或 Estimator的一系列特定阶段组成。这些阶段都是有序的，输入DataFrame通过每个阶段时都会被转换。在 Transformer阶段，在DataFrame上调用transform()方法。在Estimator 阶段，fit()方法被调用产生一个Transformer (which becomes part of the PipelineModel, or fitted Pipeline),Transformer在DataFrame上调用transform()方法。

Alt text

上图中上面一行Pipeline由三个阶段组成，前两个阶段Tokenizer和HashingTF都是Transformers（蓝色）。第三个阶段LogisticRegression是个Estimator
（红色）。下面一行代表通过这个Pipeline的数据流，圆柱体表示DataFrames
左边的Pipeline.fit()方法作用于含有文本和标签原始DataFrame
Tokenizer.transform()将文本切分为单词，并在DataFrame上增加一个单词列HashingTF.transform()将单词列转换为特征向量，并将向量列加入DataFrame
LogisticRegression是一个Estimator，Pipeline第一次调用LogisticRegression.fit()方法产生LogisticRegressionModel如果这个Pipeline有更多阶段，它会在将 DataFrame送入下个阶段之前调用LogisticRegressionModel’s transform()

alt text
上图是一个 PipelineModel，它和原始的 Pipeline都是三个阶段，但是原来的所有Estimators都变成Transformers了。当在测试集上调用PipelineModel’s transform()时，数据按序通过每个阶段，并在将它送到下阶段之前调用transform()方法。

Pipelines和PipelineModels确保训练集和测试集经过同样的特征处理过程

参数：

MLlib中Estimators和Transformers使用统一的API指定参数

Param是命名参数ParamMap一系列(parameter, value)键值对

给算法传参有两个主要方法：

为每个实例设置参数例如，如果lr是LogisticRegression的一个实例，可以调用 lr.setMaxIter(10)使lr.fit()最多调用10次
将ParamMap传入fit()或transform()所有在ParamMap都会覆盖原来通过set()方法指定的参数例如如果有两个LogisticRegression实例lr1或lr2，我们可以通过ParamMap同时指定最大迭代次数 ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)

代码示例：

示例：Estimator, Transformer, and Param

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object EstimatorTransformerParamExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("EstimatorTransformerParamExample")
      .getOrCreate()

    // $example on$
    // Prepare training data from a list of (label, features) tuples.
    val training = spark.createDataFrame(Seq(
      (1.0, Vectors.dense(0.0, 1.1, 0.1)),
      (0.0, Vectors.dense(2.0, 1.0, -1.0)),
      (0.0, Vectors.dense(2.0, 1.3, 1.0)),
      (1.0, Vectors.dense(0.0, 1.2, -0.5))
    )).toDF("label", "features")

    // Create a LogisticRegression instance. This instance is an Estimator.
    val lr = new LogisticRegression()
    // Print out the parameters, documentation, and any default values.
    println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

    // We may set parameters using setter methods.
    lr.setMaxIter(10)
      .setRegParam(0.01)

    // Learn a LogisticRegression model. This uses the parameters stored in lr.
    val model1 = lr.fit(training)
    // Since model1 is a Model (i.e., a Transformer produced by an Estimator),
    // we can view the parameters it used during fit().
    // This prints the parameter (name: value) pairs, where names are unique IDs for this
    // LogisticRegression instance.
    println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)

    // We may alternatively specify parameters using a ParamMap,
    // which supports several methods for specifying parameters.
    val paramMap = ParamMap(lr.maxIter -> 20)
      .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
      .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

    // One can also combine ParamMaps.
    val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
    val paramMapCombined = paramMap ++ paramMap2

    // Now learn a new model using the paramMapCombined parameters.
    // paramMapCombined overrides all parameters set earlier via lr.set* methods.
    val model2 = lr.fit(training, paramMapCombined)
    println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

    // Prepare test data.
    val test = spark.createDataFrame(Seq(
      (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
      (0.0, Vectors.dense(3.0, 2.0, -0.1)),
      (1.0, Vectors.dense(0.0, 2.2, -1.5))
    )).toDF("label", "features")

    // Make predictions on test data using the Transformer.transform() method.
    // LogisticRegression.transform will only use the 'features' column.
    // Note that model2.transform() outputs a 'myProbability' column instead of the usual
    // 'probability' column since we renamed the lr.probabilityCol parameter previously.
    model2.transform(test)
      .select("features", "label", "myProbability", "prediction")
      .collect()
      .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
        println(s"($features, $label) -> prob=$prob, prediction=$prediction")
      }
    // $example off$

    spark.stop()
  }
}

如果运行失败请参考我的上篇文章
spark Exception in thread “main” java.lang.IllegalArgumentException: java.net.URISyntaxException

示例：Pipeline

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object PipelineExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("PipelineExample")
      .getOrCreate()

    // $example on$
    // Prepare training documents from a list of (id, text, label) tuples.
    val training = spark.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)

    // Now we can optionally save the fitted pipeline to disk
    model.write.overwrite().save("/tmp/spark-logistic-regression-model")

    // We can also save this unfit pipeline to disk
    pipeline.write.overwrite().save("/tmp/unfit-lr-model")

    // And load it back in during production
    val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

    // Prepare test documents, which are unlabeled (id, text) tuples.
    val test = spark.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "mapreduce spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    // Make predictions on test documents.
    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }
    // $example off$

    spark.stop()
  }
}

参考链接官网