Spark ml之Binarizer

  • Binarizer
  • Binarization Binarization is the process of thresholding numerical features into binary (0/1) features.
  • Binarizer (the binarization method provided by ML), the parameters involved in binarization are inputCol, outputCol, and threshold. The input feature value is greater than the threshold to binarize to 1.0, and less than or equal to the threshold will binarize to 0.0, inputCol supports vector Vector and double precision Double type

Examples:

import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.sql.SparkSession

/**
  *
  * @author wangjuncheng
  * Binarizer 二值化器
  *
  **/
object Binarizer extends App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("ml_learn")
    //  .enableHiveSupport()
    .config("", "")
    .getOrCreate()

  val data = Array((0,0.1),(1,0.8),(2,0.2))
  val dataframe = spark.createDataFrame(data).toDF("id","feature")

  //Binarizer
  val binarizer = new Binarizer()
    .setInputCol("feature")
    .setOutputCol("binarized_feature")
    .setThreshold(0.5)

  val binarizerDataFrame = binarizer.transform(dataframe)

  //结果
  println(binarizer.getThreshold)
  binarizerDataFrame.show()
  
  spark.stop()
}

Output result:
Insert picture description here

Published 14 original articles · Like1 · Visits 684

Guess you like

Origin blog.csdn.net/qq_33891419/article/details/103804499