Spark CrossValidator

1 Overview

An important task ML is the model selection, or use the data for a given task to find the best model or parameters. This is also called the Tuning .

Can be adjusted for a single estimator (e.g. LogisticRegression), may be adjusted for a plurality of algorithms include, features and other steps of the entire pipeline. The user can resize the entire pipeline, rather than adjust each element separately pipeline.

MLlib use of tools such as CrossValidator and TrainValidationSplit like support model selection. These tools require the following:

EstimatorEstimator: To adjust the algorithm or pipeline
A group ParamMaps: choice of parameters, sometimes called "parameter grid" to search
EvaluatorAssessor: fitting measurement model to measure how well the test data retention

At a high level, the model selection tool works as follows:

They will enter the data into separate training and test data sets.
For each pair (training, testing), they will traverse a set of ParamMap: For each ParamMap, they use these parameters to fit Estimator, get fit Model, and then use the Evaluator evaluate the performance of the Model.
They select the best performance model generated by the set of parameters.

The evaluation may be used in regression problems RegressionEvaluator for BinaryClassificationEvaluator dichotomous data or for MulticlassClassificationEvaluator polyhydric problem.

SetMetricName method of each evaluator can override the default metrics for selecting the optimal ParamMap.

2, Cross-Validation Cross Validation

CrossValidator first data set into a set of folding, the folded as separate training and test data sets. For example, k = 3
when folded, CrossValidator generates three (training test) data set for each data pair using the training 2/3, 1/3 and test data.

To assess specific ParamMap, CrossValidator three models (by Estimator fitted to the three different (training and test) of the data set) calculates an average evaluation index.
After determining the optimal ParamMap, CrossValidator paramMap best end use and re-fit the entire data set Estimator.

3, applicable

When the data set is relatively small when

Cross-validation can "take advantage of" limited model parameters to find the right data, to prevent overfitting

Generally do run deep learning standard data sets when less than

4、code

Package Penalty for com.home.spark.ml 

Import org.apache.spark.SparkConf
 Import org.apache.spark.ml.Pipeline
 Import org.apache.spark.ml.classification.LogisticRegression
 Import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
 Import org.apache.spark.ml.feature. HashingTF {,} the Tokenizer
 Import org.apache.spark.ml.tuning. CrossValidator {,} ParamGridBuilder
 Import org.apache.spark.sql. Row {,} SparkSession
 Import the org.apache .spark.ml.linalg.Vector 

/ ** 
  * @Description: selecting the best cross-validation model parameters 
  * Please note that, the high cost of cross-validation on the parameter grid.
  * For example, in the following example, the parameter lr.regParam hashingTF.numFeatures grid has two values and three values, and CrossValidator used twice folded. This is multiplied by (3 × 2) × 2 =
  * Training of different models. In the actual arrangement, attempts to use more and more parameters of the number of folds (usually k = 3 and k = 10) are very common. 
  * In other words, use CrossValidator can be very expensive. But it is also a recognized method for selecting parameters, which statistically is more reasonable than the heuristic adjusted manually. 
  * * / 
Object Ex_CrossValidator { 
  DEF main (args: the Array [String]): Unit = { 
    Val the conf = new new SparkConf ( to true .) .SetAppName ( "Spark ml Model Selection") a setMaster ( "local [2]" ) 
    Val Spark = SparkSession.builder (). config (the conf) .getOrCreate () 

//     Import spark.implicits._ 

    // the Prepare Data Training of from List A (ID, text, label) tuples. 
    Val Training = spark.createDataFrame (Seq ( 
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0),
      (4L, "b spark who", 1.0),
      (5L, "g d a y", 0.0),
      (6L, "spark fly", 1.0),
      (7L, "was mapreduce", 0.0),
      (8L, "e spark program", 1.0),
      (9L, "a e c l", 0.0),
      (10L, "spark compile", 1.0),
      (11L, "hadoop software", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // We use a ParamGridBuilder to construct a grid of parameters to search over.
    // With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
    // this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
    val paramGrid = new ParamGridBuilder()
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .build()

    // We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
    // This will allow us to jointly choose parameters for all Pipeline stages.
    // A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    // Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
    // is areaUnderROC.
    val cv = new CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(new BinaryClassificationEvaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(2)  // Use 3+ in practice
      .setParallelism(2)  // Evaluate up to 2 parameter settings in parallel

    // Run cross-validation, and choose the best set of parameters.
    val cvModel = cv.fit(training)

    // Prepare test documents, which are unlabeled (id, text) tuples.
    val test = spark.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "mapreduce spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    // Make predictions on test documents. cvModel uses the best model found (lrModel).
    cvModel.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }


    spark.stop()
  }
}

result：

(4, spark i j k) --> prob=[0.25806842225846466,0.7419315777415353], prediction=1.0
(5, l m n) --> prob=[0.9185597412653913,0.08144025873460858], prediction=0.0
(6, mapreduce spark) --> prob=[0.43203205663918753,0.5679679433608125], prediction=1.0
(7, apache hadoop) --> prob=[0.6766082856652199,0.32339171433478003], prediction=0.0

Spark CrossValidator

Guess you like