- Binarizer
- Binarization Binarization is the process of thresholding numerical features into binary (0/1) features.
- Binarizer (the binarization method provided by ML), the parameters involved in binarization are inputCol, outputCol, and threshold. The input feature value is greater than the threshold to binarize to 1.0, and less than or equal to the threshold will binarize to 0.0, inputCol supports vector Vector and double precision Double type
Examples:
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.sql.SparkSession
/**
*
* @author wangjuncheng
* Binarizer 二值化器
*
**/
object Binarizer extends App {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("ml_learn")
// .enableHiveSupport()
.config("", "")
.getOrCreate()
val data = Array((0,0.1),(1,0.8),(2,0.2))
val dataframe = spark.createDataFrame(data).toDF("id","feature")
//Binarizer
val binarizer = new Binarizer()
.setInputCol("feature")
.setOutputCol("binarized_feature")
.setThreshold(0.5)
val binarizerDataFrame = binarizer.transform(dataframe)
//结果
println(binarizer.getThreshold)
binarizerDataFrame.show()
spark.stop()
}
Output result: