spark machine learning (ml pipeline)

1, business objectives, through training model to tag the data to be processed

Given the training sample string of text containing hello to tag 1, otherwise marked 0. 
expectations, by using machine learning approach to training model test data to do the same operation.

 

2, the training sample sample.txt

Three (id, text, labels), Hello text label. 1 
0, Why Hello World the JAVA, 1.0 
. 1, What LLO Java JSP, 0.0 
2, Test of hello2 Scala, 0.0 
. 3, ABC Spark Hello, 1.0 
. 4, J Hello C # , 1.0 
. 5, I Spark Java Hell, 0.0 
. 6, I Java Hell Spark

3, the data samples to be tested w1.txt

0,hello world
1,hello java test num
2,test hello scala
3,j hello spark
4,abc hello c#
5,hell java spark
6,hello java spark
7,num he he java spark
8,hello2 java spark
9,hello do some thing java spark
10,world hello java spark

4、code

4.1 dependence

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>

4.2 realized

Package com.home.spark.ml 

Import org.apache.spark.SparkConf
 Import org.apache.spark.ml. {the Pipeline, PipelineModel}
 Import org.apache.spark.ml.classification.LogisticRegression
 Import org.apache.spark.rdd .RDD
 Import org.apache.spark.sql. {DataFrame, Row, SparkSession}
 Import org.apache.spark.ml.feature. HashingTF {,} the Tokenizer
 Import org.apache.spark.ml.linalg.Vector 

/ ** 
  * @Description: machine learning, training data, to produce data tagging 
  * sample training data with hello text tagging is 1, and 0 otherwise 
  * through training model, we hope to be marked with the same test data in this way label. 
  * * / = 
Object Ex_label {
  main DEF (args: the Array [String]): Unit{ 
    Val the conf = new new SparkConf ( to true ) .setMaster ( "local [*]"). SetAppName ( "Spark ml label" ) 
    Val Spark = SparkSession.builder (). Config (the conf) .getOrCreate () 
    Val ERROR_COUNT = Spark. sparkContext.longAccumulator ( "ERROR_COUNT" ) 

    // load the training data, manual training data, the data marked with the 1.0 hello tag, the data marked with no hello 0.0 
    Val lineRDD: RDD [String] = spark.sparkContext.textFile ( "INPUT / sample.txt" ) 

    // RDD or converted into ds df need SparkSession examples implicit conversion
     // import implicit conversion spark note here is not the name of the package, but the object name SparkSession 
    import spark.implicits. _ 


    // generates training data, tag data must double
    val training: DataFrame = lineRDD.map(line => {
      val strings: Array[String] = line.split(",")
      if (strings.length == 3) {
        (strings(0), strings(1), strings(2).toDouble)
      }
      else {
        error_count.add(1)
        ("-1", strings.mkString(" "), 0.0)
      }

    }).filter(s => !s._1.equals("-1"))
      .toDF("id", "text", "label")

    training.printSchema()
    training.show()

    println(s"Data Error Count: $ {error_count.value} ") 


    // the Configure Pipeline AN ML, Which Three Stages Consists of:. The tokenizer, hashingTF, and LR
     // the Transformer, converter, character resolution, converts the input text, separated by a space, write the words into small turn 
    Val the tokenizer: the Tokenizer = new new the Tokenizer () 
      .setInputCol ( "text" ) 
      .setOutputCol ( "words" ) 

    // the Transformer, converter, converting the hash, the hash manner to convert words into a word frequency, converted into a feature vector 
    Val hashTF: HashingTF = new new HashingTF () .setNumFeatures (1000 ) 
      .setInputCol (tokenizer.getOutputCol) .setOutputCol ( "Features" ) 

    // Estimator, the predictor or estimator, logistic regression, 10 times the maximum iteration 
    Val LR: LogisticRegression = new new LogisticRegression () setMaxIter (10) .setRegParam (0.01. ) 

    // predicted by Fit () method, and receives an output of a model DataFrame
     // package pipeline, comprising two converters (actually comprising two models), a evaluator (contains an algorithm)
     // because there evaluator, so it is necessary to generate the final model train 
    Val Pipeline: Pipeline = new new . Pipeline () setStages (Array (tokenizer, hashTF, LR)) 


    // training Documents at the Pipeline to Fit.
     // training to generate the final model 
    Val model: PipelineModel = pipeline.fit (training) 

    // can choose to save to disk model 
    model.write.overwrite () save ( "/ tmp / spark-logistic. Model--regression " )
     // reload back
     //    PipelineModel.load sameModel = Val ( "/ tmp / the Spark-Logistic-Regression-Model") 


    // Save untrained (unfit) pipeline in the end plate
     //     pipeline.write.overwrite (). the Save ( "/ tmp / unfit- Model-LR ") 

    // reload the pipeline
     //     Val samePipeline = Pipeline.load (" / tmp / handicapped. 79A-Model-LR ") 


    // load the data to be analyzed 
    val testRDD: RDD [String] = spark.sparkContext.textFile ( "INPUT / w1.txt" ) 
    Val Test: DataFrame = testRDD.map (Line => { 
      Val strings: the Array [String] = line.split ( "," )
       IF (strings.length == 2 ) { 
        (strings ( 0),strings(1))
      }
      else {
        //        error_count.add(1)
        ("-1", strings.mkString(" "))
      }

    }).filter(s => !s._1.equals("-1"))
      .toDF("id", "text")

    //对给定数据进行预测
    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach {
        case Row(id: String, text: String, prob: Vector, prediction: Double) =>
          println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }

    spark.stop()

  }
}

/*Operation results 
(0, Hello World) -> = Prob [0.02467400198786794,0.975325998012132], Prediction = 1.0 
(. 1, Test Hello Java NUM) -> = Prob [0.48019580016300345,0.5198041998369967], 1.0 Prediction = 
(2, Test Hello Scala ) -> prob = [0.6270035488150222,0.3729964511849778] , prediction = 0.0 // this error analysis, the sample data is not enough, or interference sample 
(3, j hello spark) - > prob = [0.031182836719302286,0.9688171632806978], prediction = 1.0 
(. 4, ABC Hello C #) -> = Prob [0.006011466954209337,0.9939885330457907], 1.0 Prediction = 
(. 5, Hell Spark Java) -> = Prob [0.9210765571223096,0.07892344287769032], 0.0 Prediction = 
(. 6, Hello Spark Java) - -> = Prob [0.1785326777978406,0.8214673222021593], 1.0 Prediction = 
(. 7, Java Spark of He of He NUM) -> = Prob [0.6923088930430097,0.30769110695699026], Prediction = 0.0 
(. 8, of hello2 Spark Java) -> = Prob [0.9016001424620457,0.09839985753795444], Prediction = 0.0 
(. 9, some do Hello Spark Thing Java) -> = Prob [0.1785326777978406,0.8214673222021593], Prediction = 1.0 
(10, World Hello Spark Java) -> = Prob [0.05144953292014106,0.9485504670798589], = 1.0 prediction 
* / 
// probability is predicted probability vector value is not consistent with a first degree, second degree of match value,
 // prediction label depends threshold set in the model stringency

 

 

Guess you like

Origin www.cnblogs.com/asker009/p/12145408.html