Spark ML PipeLine study notes

The goal of the spark.ml package is to provide a unified high-level API. These high-level APIs are built on DataFrame, which helps users create and adjust practical machine learning pipelines. See the algorithm guide part in the spark.ml sub-package guide below, including the unique feature converters and collections of the pipeline API.

Table of contents:
Main concepts in Pipelines

For machine learning algorithms, the spark ML standardized API makes it easy to combine multiple algorithms into the same pipeline or workflow. This section introduces the main concepts introduced by the spark ML API. The pipeline concept is a lot of inspiration from scikit-learn.

  • DataFrame: Spark ML obtains DataFrame from spark sql as a learning data set. Can hold multiple data types, for example: a df can contain text, feature vectors, true labels and multiple different columns of prediction results.


  • Converter: A converter is an algorithm that converts a DF into another DF. For example, an ML model is a converter that converts a DataFrame with features into a DF with prediction results.

  • Evaluator: An evaluator is an algorithm that fits a converter on a DF. Example: The learning algorithm is an evaluator, trains on DF and generates a model.


  • Pipeline: A pipeline connects multiple converters and evaluators together as a workflow.

  • Parameters: For the specified parameters, all converters and evaluators share a common API.

DataFrame
machine learning can be applied to many data types, such as: vector, text, pictures, structured data. ML uses DF to support multiple data types.
df supports many basic and structured types; supports types in spark sql, as well as vectors.
DF can be created explicitly or implicitly from conventional RDD. See
the column in the code df for details . The following example code name used, such as "text," "features," and "label."

Conduit assembly converter
converter comprising characteristic converter and the learning model abstraction. Technically, a converter performs a transform () method to transform DF. Generally, one or more columns are added. example:
  • A feature converter gets a DF, reads a column and maps it to a new column. Then output a new DF containing the mapped column.
  • A learning model gets a DF. Read the column containing the feature vector, predict the label for each feature vector, and then put the predicted label as a new column and output it in the DF.
Evaluator
evaluator abstract concept of an application or training algorithms on the data. Technically, the evaluator calls the fit () method. The method receives a DF and generates a model. This model is the converter. Example: A learning algorithm such as logistic regression is an Estimators, and the fit () method is called to train the LogisticRegressionModel, which is a model and therefore a transformer.

The properties of pipeline components
Transformer.transform () s and Estimator.fit () s are both stateless. In the future, stateful algorithms can be supported by alternative concepts.
Each instance of Transformer or Estimator has a unique id, which is useful when specifying parameters (discussed below).

Pipeline

in machine learning, usually runs a series of algorithms to process data and learn. Example: A simple text document processing workflow can contain some stages:

  • Divide the text in the document into words.
  • Convert each word in the document into a digitized feature vector
  • Use feature vectors and labels to learn prediction models.
spark ML describes a workflow as a pipeline. The pipeline consists of a series of PipelineStage (Transformers and Estimators), running in the specified order. In this part we will use this simple workflow as a running example.

How it works

A pipeline is specified as a series of stages, each stage is either a Transformer or an Estimator. These stages operate in an orderly manner, and the input DF is converted by each stage it passes through. For the Transformer stage, the transform () method is called in df. For the Estimator stage, the fit () method is called to generate a new Transformer (the converter becomes part of the PipelineModel, or a suitable pipeline), and then the converter calls the transform () method Applied to df.

We illustrate by a simple text document workflow, the following figure is about the usage of the training time of the pipeline.



Above, the top row represents a pipeline containing three stages. The first two blue (Tokenizer and HashingTF) are converters, and the third LogisticRegression is an evaluator. The bottom line represents the data flow through the pipeline. The cylinder is the DF. Pipeline.fit () method is called on the initial df, which is the line text and label. The Tokenizer.transform () method divides the line of text into words and adds a new column containing the words to DF. The HashingTF.transform () method converts the word column into a feature vector, and adds a new column containing the vector to the DF. Now that LogisticRegression is an evaluator, the pipeline first calls LogisticRegression.fit () to generate a LogisticRegressionModel. If the pipeline has multiple stages, it calls the transform () method of LogisticRegressionModel before passing df to the next stage.


A pipeline is an evaluator, so after the pipeline's fit () method runs, a PipelineModel is generated, which is a converter. This PipelineModel is used in the test time; the following figure illustrates the usage of PipelineModel.



In the above picture, the number of PipelineModel and the original pipeline stage are the same. But in the original pipeline all the evaluators become converters. When PipelineModel's transform () is called on the test data set, the data passes through the appropriate pipeline in an orderly manner. The transform () of each stage updates the data set and then passes it to the next stage.

Pipeline and PipelineModel help ensure that the training set and test set get the same feature processing steps.

Detailed
DAG pipeline: Multiple stages in a pipeline are designated as an ordered array. The above example gives an example of a linear pipeline. The data used by the stage in the pipeline is generated by the previous stage. You can also create a non-linear pipeline, as long as the data flow graph is a DAG graph. DAG charts can be implicitly specified based on the column names of each stage input and output. If the pipeline is in the form of DAG, then the topological sequence of the stage must be specified.

Run-time detection: Since pipelines can operate on multiple types of DataFrames, compile-time type detection cannot be performed. Pipelines and PipelineModel are detected at runtime. Use the DataFrame's schema for detection.
Unique pipeline stage: The pipeline stage should be a unique instance. For example: the same myHashing instance cannot be inserted into the pipeline twice, because the pipeline must have a unique id. However, two instances of an instance (myHashingTF1 and myHashingTF2) can be put in the same pipeline, because different instances generate different ids.


parameter

ML's Estimator and Transformer use the same API to specify parameters.
The parameter name Param is a parameter, and ParamMap is a parameter set (parameter, value)
. There are two main ways to pass parameters to the algorithm:
1. Set parameters for the instance, for example: if lr is an instance of LogisticRegression. Setting lr.setMaxIter (10) can make lr.fit () up to ten iterations. This API integrates the APIs in the spark.mllib package.
2. Pass ParamMap to the fit () or transform () method. The parameters in ParamMap will override the parameters previously set by the setter method.
The parameters belong to the specified Estimators and Transformer instance, for example: if there are two logistic regressions lr1 and lr2, then you can create a ParamMap containing two maxIter parameters: ParamMap (lr1.maxIter-> 10, lr2.maxIter-> 20). This is useful if both algorithms have maxIter parameters in a pipeline.


Saving and loading pipes

save a model and a pipeline to the disk, in order to use the next time is valuable. In spark 1.6. A model import and export function is added to the pipeline API. Most basic transformers are supported, as are some basic learning models. Please refer to the algorithm api documentation to see if it supports saving and loading.


Example code:

This section gives some code to demonstrate the functionality discussed above. See the API for more information. Some spark ML algorithms are wrappers for the spark.mllib algorithm. For details, please see the MLlib programming guide

Example: Estimator, Transformer, and Param



This section covers Estimator, Transformer, and Param

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row
// Prepare training data from a list of (label, features) tuples.
//准备带标签和特征的数据
val training = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

// Create a LogisticRegression instance.  This instance is an Estimator.
//创建一个逻辑回归事例,这个实例是评估器
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
//输出参数等默认值
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

// We may set parameters using setter methods.
//使用setter方法设置参数
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model.  This uses the parameters stored in lr.
//使用存储在lr中的参数来,学习一个模型,
val model1 = lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
//由于model1是一个模型,(也就是,一个评估器产生一个转换器),
// 我们可以看lr在fit()上使用的参数。
//输出这些参数对,参数里的names是逻辑回归实例的唯一id
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)

// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
//我们可以使用paramMap选择指定的参数,并且提供了很多方法来设置参数
val paramMap = ParamMap(lr.maxIter -> 20)
  .put(lr.maxIter, 30) // Specify 1 Param.  This overwrites the original maxIter. 指定一个参数。
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55) // Specify multiple Params. 指定多个参数

// One can also combine ParamMaps.
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability") // Change output column name 改变输出列的名称
val paramMapCombined = paramMap ++ paramMap2

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
//使用新的参数学习模型。
val model2 = lr.fit(training, paramMapCombined)
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

// Prepare test data.
//准备测试数据
val test = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
//使用转换器的transform()方法在测试数据上作出预测.
// 逻辑回归的transform方法只使用“特征”列.
// 注意model2.transform()方法输出的是myProbability列而不是probability列,因为在上面重命名了lr.probabilityCol 参数。
model2.transform(test)
  .select("features", "label", "myProbability", "prediction")
  .collect()
  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
  println(s"($features, $label) -> prob=$prob, prediction=$prediction")
}
Personally think: In this training process, the final step is to set the parameters to make the algorithm work better. yes

Example: Pipeline



This example is the text document pipeline shown in the picture above


import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
// Prepare training documents from a list of (id, text, label) tuples.
//准备训练文档,(id,内容,标签)
val training = sqlContext.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
//配置ML管道,由三个stage组成,tokenizer, hashingTF, and lr ,
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
//安装管道到数据上
val model = pipeline.fit(training)

// now we can optionally save the fitted pipeline to disk
//现在可以保存安装好的管道到磁盘上
model.save("/tmp/spark-logistic-regression-model")

// we can also save this unfit pipeline to disk
//也可以保存未安装的管道到磁盘上
pipeline.save("/tmp/unfit-lr-model")

// and load it back in during production
//加载管道
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

// Prepare test documents, which are unlabeled (id, text) tuples.
//准备测试文档,不包含标签
val test = sqlContext.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "mapreduce spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents.
//在测试文档上做出预测
model.transform(test)
  .select("id", "text", "probability", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
  println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}

Example: model selection via cross-validation




An important task in machine learning is model selection, or using data to find the best model or setting parameters for the task. This is called tuning. It is easier to promote the pipeline selection model by tuning the entire pipeline, rather than tuning each element in the pipeline separately.
Currently, spark.ml supports the cross-validator CrossValidator class selection model. This class receives an Estimator, a parameter set, and an Evaluator. CrossValidator starts to split the data set into a fold set. This fold set is used as a separate test and training The data set; Example: CrossValidator with 3 folds will generate 3 sets of (training, testing) data sets, 2/3 of each data set as training data, and 1/3 as test data. . For each ParamMap, train the given Estimator and use the given Evaluator to evaluate.

RegressionEvaluator evaluates regression problems, BinaryClassificationEvaluator evaluates binary data, and MultiClassClassificationEvaluator evaluates multivariate classification problems.
The default metric used to select the best paraMap parameter can be overridden by Evaluator's setMetric method.

The paramMap that produces the best evaluation metric is selected as the best model. CrossValidator eventually uses the best paramMap and the entire data set fit evaluator (meaning the fit method that executes the evaluator).

The following example is CrossValidator to choose from a grid parameter. Only use The ParamGridBuilder tool constructs a parametric grid.
Note that cross-checking on a grid parameter is very expensive. For example, in the following example, hashingTF.numFeatures has 3 values ​​and lr.regParam has 2 value parameter networks, and the CrossValidator fold is 2. The output of this multiplication is (3 × 2) × 2 = 12. Different details need to be trained. In the real setting, the parameters will be set larger and have more folds (usually 3 or 10). in other words. Using CorssValidator is very expensive.
However, it is also an effective method for selecting parameters.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
    // Prepare training data from a list of (id, text, label) tuples.
    //准备训练数据,id 内容,标签
    val training = sqlContext.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0),
      (4L, "b spark who", 1.0),
      (5L, "g d a y", 0.0),
      (6L, "spark fly", 1.0),
      (7L, "was mapreduce", 0.0),
      (8L, "e spark program", 1.0),
      (9L, "a e c l", 0.0),
      (10L, "spark compile", 1.0),
      (11L, "hadoop software", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
//    配置机器学习管道,由tokenizer, hashingTF,  lr评估器 组成
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // We use a ParamGridBuilder to construct a grid of parameters to search over.
    // With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
    // this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
    //使用ParamGridBuilder 构造一个参数网格,
    //hashingTF.numFeatures有3个值,lr.regParam有2个值,
    // 这个网格有6个参数给CrossValidator来选择
    val paramGrid = new ParamGridBuilder()
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .build()

    // We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
    // This will allow us to jointly choose parameters for all Pipeline stages.
    // A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    // Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
    // is areaUnderROC.
    //现在我们把管道看做成一个Estimator,把它包装到CrossValidator实例中。
    //这可以让我们连带的为管道的所有stage选择参数。
    //CrossValidator需要一个Estimator,一个评估器参数集合,和一个Evaluator。
    //注意这里的evaluator 是二元分类的BinaryClassificationEvaluator,它默认的度量是areaUnderROC.
    val cv = new CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(new BinaryClassificationEvaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(2) // Use 3+ in practice  // 在实战中使用3+

    // Run cross-validation, and choose the best set of parameters.
    //运行交叉校验,选择最好的参数集
    val cvModel = cv.fit(training)

    // Prepare test documents, which are unlabeled (id, text) tuples.
    //准备测试数据
    val test = sqlContext.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "mapreduce spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    // Make predictions on test documents. cvModel uses the best model found (lrModel).
    //在测试文档上做预测,cvModel是选择出来的最好的模型
    cvModel.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
      println(s"($id, $text) --> prob=$prob, prediction=$prediction")
    }

Personal understanding: Use cross-check to automatically select the best one from a set of parameters to build a well-organized model.

Example: model selection via train validation split Example: Model selection by training verification separation




In addition to CrossValidator, Spark also provides TrainValidationSplit for hyper-parameter tuning. TrainValidationSplit only evaluates each parameter combination once. Instead of evaluating CrossValidator k times, TrainValidationSplit only once. Therefore, it is not very expensive, but if the training data set is not large enough, it cannot produce reliable results.

TrainValidationSplit needs to pass in an Estimator, a set of paraMap containing estimatorParamMaps parameters and an Evaluator. It initially uses the trainRatio parameter value to divide the data set into training data and test data. For example: Using trainRatio = 0.75 (default), TrainValidationSplit generates 75% of the data for training and 25% of the data for testing. Similar to CrossValidator, TrainValidationSplit also iterates through the parameter set paramMap. For each parameter combination, train with the given Estimator and evaluate on the given Evaluator. The paramMap that produces the best evaluation metric is the best choice. TrainValidationSplit will eventually use the best parameters and the fit method of the entire dataset using Estimator.




import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// Prepare training and test data.
//准备训练数据和测试数据
val data = sqlContext.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345)

val lr = new LinearRegression()

// We use a ParamGridBuilder to construct a grid of parameters to search over.
// TrainValidationSplit will try all combinations of values and determine best model using
// the evaluator.
//ParamGridBuilder构建一组参数
//TrainValidationSplit将尝试从这些所有值的组合中使用evaluator选出最好的模型
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.regParam, Array(0.1, 0.01))
  .addGrid(lr.fitIntercept)
  .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
  .build()

// In this case the estimator is simply the linear regression.
// A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
//在这里estimator是简单的线性回归
//TrainValidationSplit 需要一个Estimator , 一个Estimator ParamMaps集,一个Evaluator
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(lr)
  .setEvaluator(new RegressionEvaluator)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for validation.
  //80%数据作为训练,剩下的20%作为验证
  .setTrainRatio(0.8)

// Run train validation split, and choose the best set of parameters.
//运行训练校验分离,选择最好的参数。
val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of parameters
// that performed best.
//在测试数据上做预测,模型是参数组合中执行最好的一个
model.transform(test)
  .select("features", "label", "prediction")
  .show()
Published 30 original articles · praised 74 · 230,000 views +

Guess you like

Origin blog.csdn.net/ruiyiin/article/details/75909677