spark ml pipelines

spark ML Pipelines

In spark2.0, mllib is divided into two packages. Spark.mllib contains RDD-based API, and spark.ml contains DataFrame-based API. Officials will not add new features to RDD-based mllib. So it is recommended to use the ml package. The RDD-based API will be deprecated in spark 2.2, and will be completely removed in spark 3.0.

Main Concepts of Pipelines

  • DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

    ML uses DataFrame, a data structure from Spark SQL, as a data set. DataFrame can store various data types. For example, DataFrame can have different columns to store text, feature vectors, true labels, predicted values, etc.

  • Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

    Transformer is an algorithm that can transform one DataFrame into another DataFrame.
    For example, an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.

  • Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

    Estimator is an algorithm that acts on DataFrame to generate Transformer

  • Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

    Pipeline links multiple Transformers and Estimators to form a specific ML workflow.

  • Parameter: All Transformers and Estimators now share a common API for specifying parameters.

    All Transformers and Estimators share an API that specifies parameters

Pipeline components

Transformers

A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:

  • A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
  • A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

    Transformer is an abstract concept that includes feature transformation and learning models. It implements the trandform() method, capable of transforming one DataFrame into another.

Estimators

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

Estimator是一个学习算法或任何可以用来训练数据的算法。它实现了
fit()方法,它接收一个DataFrame作为输入然后产生一个模型。例如,
LogisticRegression是个Estimator,调用它的fit()方法能够训练
出模型LogisticRegressionModel

Properties of pipeline components

Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.

Transformer.transform()s 和 Estimator.fit()s 都是无状态的
将来会通过替代概念实现有状态的算法。

Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).

每个Transformer 或 Estimator 都有一个独一无二的ID,在指定参数时
会非常有用(下面会讨论)

Pipeline

In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

在机器学习中,通过一系列算法处理数据和从数据里学习知识都是很正常的事
以简单的文本处理为例,它的工作流中会包括以下几个阶段
  • Split each document’s text into words.

    Split each document's text into words

  • Convert each document’s words into a numerical feature vector.

    Convert each document to a digitized feature vector

  • Learn a prediction model using the feature vectors and labels.

    Generate predictive models using feature vectors and labels

MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.

MLlib使用Pipeline表示这样的工作流,它包含了一系列按特定顺序的
PipelineStages (Transformers and Estimators) 

how to work?

A Pipeline consists of a series of specific stages where each stage is a Transformer or Estimator. The stages are all ordered, and the input DataFrame is transformed as it passes through each stage. In the Transformer stage, the transform() method is called on the DataFrame. During the Estimator phase, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and the Transformer calls the transform() method on the DataFrame.

Alt text

The upper line of the Pipeline in the above figure consists of three stages, the first two stages Tokenizer and HashingTF are Transformers (blue). The third stage LogisticRegression is an Estimator
(red). The following line represents the data flow through this Pipeline, the cylinder represents
the Pipeline.fit() method on the left of the DataFrames works on the original DataFrame containing text and labels
Tokenizer.transform() splits the text into words and adds a word column to the DataFrame HashingTF.transform() converts word columns into feature vectors and adds vector columns to the DataFrame
LogisticRegression is an Estimator, the first time the Pipeline calls the LogisticRegression.fit() method to generate a LogisticRegressionModel If this Pipeline has more stages, it will be added to the DataFrame Call LogisticRegressionModel's transform() before sending to the next stage

alt text
The picture above is a PipelineModel, which has three stages like the original Pipeline, but all the original Estimators have become Transformers. When the PipelineModel's transform() is called on the test set, the data goes through each stage in order, and the transform() method is called before sending it to the next stage.

Pipelines and PipelineModels ensure that the training and test sets undergo the same feature processing

parameter:

Estimators and Transformers in MLlib use a unified API to specify parameters

Param is a named parameter ParamMap a series of (parameter, value) key-value pairs

There are two main ways to pass parameters to an algorithm:

  1. Set parameters for each instance For example, if lr is an instance of LogisticRegression, you can call lr.setMaxIter(10) to make lr.fit() call up to 10 times
  2. Passing ParamMap into fit() or transform() will overwrite the parameters originally specified by the set() method in ParamMap. For example, if there are two LogisticRegression instances lr1 or lr2, we can specify the maximum number of iterations through ParamMap at the same time ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)

Code example:

Example: Estimator, Transformer, and Param

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object EstimatorTransformerParamExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("EstimatorTransformerParamExample")
      .getOrCreate()

    // $example on$
    // Prepare training data from a list of (label, features) tuples.
    val training = spark.createDataFrame(Seq(
      (1.0, Vectors.dense(0.0, 1.1, 0.1)),
      (0.0, Vectors.dense(2.0, 1.0, -1.0)),
      (0.0, Vectors.dense(2.0, 1.3, 1.0)),
      (1.0, Vectors.dense(0.0, 1.2, -0.5))
    )).toDF("label", "features")

    // Create a LogisticRegression instance. This instance is an Estimator.
    val lr = new LogisticRegression()
    // Print out the parameters, documentation, and any default values.
    println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

    // We may set parameters using setter methods.
    lr.setMaxIter(10)
      .setRegParam(0.01)

    // Learn a LogisticRegression model. This uses the parameters stored in lr.
    val model1 = lr.fit(training)
    // Since model1 is a Model (i.e., a Transformer produced by an Estimator),
    // we can view the parameters it used during fit().
    // This prints the parameter (name: value) pairs, where names are unique IDs for this
    // LogisticRegression instance.
    println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)

    // We may alternatively specify parameters using a ParamMap,
    // which supports several methods for specifying parameters.
    val paramMap = ParamMap(lr.maxIter -> 20)
      .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
      .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

    // One can also combine ParamMaps.
    val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
    val paramMapCombined = paramMap ++ paramMap2

    // Now learn a new model using the paramMapCombined parameters.
    // paramMapCombined overrides all parameters set earlier via lr.set* methods.
    val model2 = lr.fit(training, paramMapCombined)
    println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

    // Prepare test data.
    val test = spark.createDataFrame(Seq(
      (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
      (0.0, Vectors.dense(3.0, 2.0, -0.1)),
      (1.0, Vectors.dense(0.0, 2.2, -1.5))
    )).toDF("label", "features")

    // Make predictions on test data using the Transformer.transform() method.
    // LogisticRegression.transform will only use the 'features' column.
    // Note that model2.transform() outputs a 'myProbability' column instead of the usual
    // 'probability' column since we renamed the lr.probabilityCol parameter previously.
    model2.transform(test)
      .select("features", "label", "myProbability", "prediction")
      .collect()
      .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
        println(s"($features, $label) -> prob=$prob, prediction=$prediction")
      }
    // $example off$

    spark.stop()
  }
}

If the operation fails, please refer to my previous article
spark Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException

Example: Pipeline

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object PipelineExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("PipelineExample")
      .getOrCreate()

    // $example on$
    // Prepare training documents from a list of (id, text, label) tuples.
    val training = spark.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)

    // Now we can optionally save the fitted pipeline to disk
    model.write.overwrite().save("/tmp/spark-logistic-regression-model")

    // We can also save this unfit pipeline to disk
    pipeline.write.overwrite().save("/tmp/unfit-lr-model")

    // And load it back in during production
    val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

    // Prepare test documents, which are unlabeled (id, text) tuples.
    val test = spark.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "mapreduce spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    // Make predictions on test documents.
    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }
    // $example off$

    spark.stop()
  }
}

Reference link official website

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325524915&siteId=291194637