spark ml pipelines

spark ML Pipelines

在spark2.0里mllib分为两个包,spark.mllib里是基于RDD的API,spark.ml里是基于 DataFrame的API。官方不会在基于RDD的mllib里添加新特性。所以建议使用ml包。在spark2.2时基于RDD的API会被废弃,到spark3.0会被彻底移除。

Pipelines主要概念

  • DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

    ML使用Spark SQL里来的数据结构DataFrame作为数据集,DataFrame能存储各种数据类型,例如DataFrame可以有不同的列存储文本,特征向量,真标签,预测值等。

  • Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

    Transformer是一个能将一个DataFrame转换为另一个DataFrame的算法。
    例如一个ML模型就是一个能将带特征的DataFrame转换为一个带预测结果的DataFrame的Transformer。

  • Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

    Estimator is an algorithm that acts on DataFrame to generate Transformer

  • Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

    Pipeline links multiple Transformers and Estimators to form a specific ML workflow.

  • Parameter: All Transformers and Estimators now share a common API for specifying parameters.

    All Transformers and Estimators share an API that specifies parameters

Pipeline components

Transformers

A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:

  • A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
  • A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

    Transformer is an abstract concept that includes feature transformation and learning models. It implements the trandform() method, capable of transforming one DataFrame into another.

Estimators

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

Estimator是一个学习算法或任何可以用来训练数据的算法。它实现了
fit()方法,它接收一个DataFrame作为输入然后产生一个模型。例如,
LogisticRegression是个Estimator,调用它的fit()方法能够训练
出模型LogisticRegressionModel

Properties of pipeline components

Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.

Transformer.transform()s 和 Estimator.fit()s 都是无状态的
将来会通过替代概念实现有状态的算法。

Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).

每个Transformer 或 Estimator 都有一个独一无二的ID,在指定参数时
会非常有用(下面会讨论)

Pipeline

In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

在机器学习中,通过一系列算法处理数据和从数据里学习知识都是很正常的事
以简单的文本处理为例,它的工作流中会包括以下几个阶段
  • Split each document’s text into words.

    Split each document's text into words

  • Convert each document’s words into a numerical feature vector.

    Convert each document to a digitized feature vector

  • Learn a prediction model using the feature vectors and labels.

    Generate predictive models using feature vectors and labels

MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.

MLlib使用Pipeline表示这样的工作流,它包含了一系列按特定顺序的
PipelineStages (Transformers and Estimators) 

how to work?

Pipeline是由每个阶段都是Transformer 或 Estimator的一系列特定阶段组成。这些阶段都是有序的,输入DataFrame通过每个阶段时都会被转换。在 Transformer阶段,在DataFrame上调用transform()方法。在Estimator 阶段,fit()方法被调用产生一个Transformer (which becomes part of the PipelineModel, or fitted Pipeline),Transformer在DataFrame上调用transform()方法。

Alt text

上图中上面一行Pipeline由三个阶段组成,前两个阶段Tokenizer和HashingTF都是Transformers(蓝色)。第三个阶段LogisticRegression是个Estimator
(红色)。下面一行代表通过这个Pipeline的数据流,圆柱体表示DataFrames
左边的Pipeline.fit()方法作用于含有文本和标签原始DataFrame
Tokenizer.transform()将文本切分为单词,并在DataFrame上增加一个单词列HashingTF.transform()将单词列转换为特征向量,并将向量列加入DataFrame
LogisticRegression是一个Estimator,Pipeline第一次调用LogisticRegression.fit()方法产生LogisticRegressionModel如果这个Pipeline有更多阶段,它会在将 DataFrame送入下个阶段之前调用LogisticRegressionModel’s transform()

alt text
上图是一个 PipelineModel,它和原始的 Pipeline都是三个阶段,但是原来的所有Estimators都变成Transformers了。当在测试集上调用PipelineModel’s transform()时,数据按序通过每个阶段,并在将它送到下阶段之前调用transform()方法。

Pipelines和PipelineModels确保训练集和测试集经过同样的特征处理过程

参数:

MLlib中Estimators和Transformers使用统一的API指定参数

Param是命名参数ParamMap一系列(parameter, value)键值对

给算法传参有两个主要方法:

  1. 为每个实例设置参数例如,如果lr是LogisticRegression的一个实例,可以调用 lr.setMaxIter(10)使lr.fit()最多调用10次
  2. 将ParamMap传入fit()或transform()所有在ParamMap都会覆盖原来通过set()方法指定的参数例如如果有两个LogisticRegression实例lr1或lr2,我们可以通过ParamMap同时指定最大迭代次数 ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)

代码示例:

示例:Estimator, Transformer, and Param

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object EstimatorTransformerParamExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("EstimatorTransformerParamExample")
      .getOrCreate()

    // $example on$
    // Prepare training data from a list of (label, features) tuples.
    val training = spark.createDataFrame(Seq(
      (1.0, Vectors.dense(0.0, 1.1, 0.1)),
      (0.0, Vectors.dense(2.0, 1.0, -1.0)),
      (0.0, Vectors.dense(2.0, 1.3, 1.0)),
      (1.0, Vectors.dense(0.0, 1.2, -0.5))
    )).toDF("label", "features")

    // Create a LogisticRegression instance. This instance is an Estimator.
    val lr = new LogisticRegression()
    // Print out the parameters, documentation, and any default values.
    println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

    // We may set parameters using setter methods.
    lr.setMaxIter(10)
      .setRegParam(0.01)

    // Learn a LogisticRegression model. This uses the parameters stored in lr.
    val model1 = lr.fit(training)
    // Since model1 is a Model (i.e., a Transformer produced by an Estimator),
    // we can view the parameters it used during fit().
    // This prints the parameter (name: value) pairs, where names are unique IDs for this
    // LogisticRegression instance.
    println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)

    // We may alternatively specify parameters using a ParamMap,
    // which supports several methods for specifying parameters.
    val paramMap = ParamMap(lr.maxIter -> 20)
      .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
      .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

    // One can also combine ParamMaps.
    val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
    val paramMapCombined = paramMap ++ paramMap2

    // Now learn a new model using the paramMapCombined parameters.
    // paramMapCombined overrides all parameters set earlier via lr.set* methods.
    val model2 = lr.fit(training, paramMapCombined)
    println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

    // Prepare test data.
    val test = spark.createDataFrame(Seq(
      (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
      (0.0, Vectors.dense(3.0, 2.0, -0.1)),
      (1.0, Vectors.dense(0.0, 2.2, -1.5))
    )).toDF("label", "features")

    // Make predictions on test data using the Transformer.transform() method.
    // LogisticRegression.transform will only use the 'features' column.
    // Note that model2.transform() outputs a 'myProbability' column instead of the usual
    // 'probability' column since we renamed the lr.probabilityCol parameter previously.
    model2.transform(test)
      .select("features", "label", "myProbability", "prediction")
      .collect()
      .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
        println(s"($features, $label) -> prob=$prob, prediction=$prediction")
      }
    // $example off$

    spark.stop()
  }
}

如果运行失败请参考我的上篇文章
spark Exception in thread “main” java.lang.IllegalArgumentException: java.net.URISyntaxException

示例:Pipeline

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object PipelineExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("PipelineExample")
      .getOrCreate()

    // $example on$
    // Prepare training documents from a list of (id, text, label) tuples.
    val training = spark.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)

    // Now we can optionally save the fitted pipeline to disk
    model.write.overwrite().save("/tmp/spark-logistic-regression-model")

    // We can also save this unfit pipeline to disk
    pipeline.write.overwrite().save("/tmp/unfit-lr-model")

    // And load it back in during production
    val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

    // Prepare test documents, which are unlabeled (id, text) tuples.
    val test = spark.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "mapreduce spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    // Make predictions on test documents.
    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }
    // $example off$

    spark.stop()
  }
}

参考链接官网

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325524877&siteId=291194637