Spark ml之StopWordsRemover

  • Stop words stop words are words that appear frequently in the document, but do not carry words with too much meaning, they should not participate in the operation of the algorithm
  • StopWordsRemover is to delete the stop words in the input string (generally the output of tokenizer tokenizer)
  • The stopwords list is specified by the stopWords parameter. For some languages, the default stopwords list is by calling StopWordsRemover.loadDefaultStopWords (language), the available options are "Denmark", "Netherlands", "English", "Finnish" "," France "," Germany "," Hungary "," Italy "," Norway "," Portugal "," Russia "," Spain "," Sweden "and" Turkey "
  • The boolean parameter caseSensitive indicates whether it is case sensitive, the default is no

Example:

import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.SparkSession

/**
  *
  * @author wangjuncheng
  *   StopWordsRemover  去停用词器
  **/
object StopWordsRemover extends App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("ml_learn")
//  .enableHiveSupport()
    .config("", "")
    .getOrCreate()
  val dataSet = spark.createDataFrame(Seq(
      (0, Seq("I", "saw", "the", "red", "baloon")),
      (1, Seq("Mary", "had", "a", "little", "lamb"))
    )).toDF("id","row")

     //stopwordsRemover
  val remover = new StopWordsRemover()
      .setInputCol("row")
      .setOutputCol("filtered")

  remover.transform(dataSet).show(false)
  spark.stop()
}
Published 14 original articles · Like1 · Visits 684

Guess you like

Origin blog.csdn.net/qq_33891419/article/details/103777678