16 classification using logistic regression prediction sentence contains a field

Commonly used classification has logistic regression classifier and decision tree classifier, this article describes the use of logistic regression determines whether a field is completed in the scala.

1 systems, software and premise constraints

  • CentOS 7 64 workstations of the machine ip is 192.168.100.200, host name danji, the reader is set according to their actual situation
  • Completed spark visit Hive
    https://www.jianshu.com/p/3965abe4d593
  • Permission to remove the effects of the operation, all operations are carried out in order to root, Spark start, Hadoop start.
  • Ensure spark start, make sure to start metastore

2 operation

  • 1 Use xshell log on to 192.168.100.200
  • 2 into the spark-shell command line
  • 3 Do the following statement:
import org.apache.spark.ml.feature._ 
import org.apache.spark.ml.classification.LogisticRegression 
import org.apache.spark.ml.{Pipeline,PipelineModel} 
import org.apache.spark.ml.linalg.Vector 
import org.apache.spark.sql.Row
val training = spark.createDataFrame(Seq( (0L, "a b c d e spark", 1.0), (1L, "b d", 0.0), (2L, "spark f g h", 1.0),  (3L, "hadoop mapreduce", 0.0),(4L, "apache spark",1.0),(5L, "hello spark",1.0))).toDF("id", "text", "label")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
val test = spark.createDataFrame(Seq((4L, "spark i j k"),(5L, "l m n"),(6L, "spark a"),(7L, "apache hadoop"),(8L, "apache hadoop"),(9L, "apache spark"),(10L, "apache hadoop"),(11L, "apache spark hadoop"))).toDF("id", "text")
model.transform(test).select("id", "text", "probability", "prediction").collect().foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>println(s"($id, $text) --> prob=$prob, prediction=$prediction")}

The results observed:


(4, spark i j k) --> prob=[0.06514785966116181,0.9348521403388381], prediction=1.0
(5, l m n) --> prob=[0.6594623918792804,0.3405376081207197], prediction=0.0
(6, spark a) --> prob=[0.016899270159272606,0.9831007298407275], prediction=1.0
(7, apache hadoop) --> prob=[0.672723276314924,0.3272767236850759], prediction=0.0
(8, apache hadoop) --> prob=[0.672723276314924,0.3272767236850759], prediction=0.0
(9, apache spark) --> prob=[0.013955984619361126,0.9860440153806388], prediction=1.0
(10, apache hadoop) --> prob=[0.672723276314924,0.3272767236850759], prediction=0.0
(11, apache spark hadoop) --> prob=[0.06887499770773359,0.9311250022922664], prediction=1.0

The results can be seen from the model we have established a complete sentence contains a spark of prediction.

Guess you like

Origin www.cnblogs.com/alichengxuyuan/p/12576831.html