Spark ml之Tokenizer

Spark ml中的Tokenizer(分词器)

  • Tokenizer是将文本如一个句子拆分城单词的过程,在spark ml中提供Tokenizer实现此功能RegexTokenizer提供了跟高级的基于正则表达式匹配的单词拆分。默认情况下,参数pattern(默认的正则表达式:"\s+") 作为分隔符用于拆分输入的文本,或者,用户将参数 gaps设置为false,指定正则表达式pattern表示为tokens,而不是分隔符,这样作为划分结果找到的所有匹配项,很简单,主要是看自己业务数据切分的逻辑。
    示例代码,也是官网给出的示例:
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

/**
  *
  * @author wjc
  *
  *         Tokenizer
  **/
object Tokenizer extends App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("ml_learn")
    //  .enableHiveSupport()
    .config("", "")
    .getOrCreate()

  val sentenceDataFrame = spark.createDataFrame(Seq(
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
  )).toDF("id","sentence")

  sentenceDataFrame.show(false)

  //Tokenizer实例
  val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
  //RegexTokenizer分词器
  val regexTokenizer = new RegexTokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
    .setPattern("\\W")
  //或者通过gaps设置为false,指定正则表达式pattern表示tokens 而不是分隔符
  val regexTokenizer2 = new RegexTokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
    .setPattern("\\w+")
    .setGaps(false)

  //udf 计算长度
  val countTokens = udf { (words:Seq[String]) => words.length}
  //tokenizer分词结果
  val tokenized = tokenizer.transform(sentenceDataFrame)
      tokenized.select("sentence","words")
    .withColumn("tokens",countTokens(col("words"))).show(false)

  //regexTokenizer分词结果
  val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
      regexTokenized.select("sentence","words")
    .withColumn("tokens",countTokens(col("words"))).show(false)

  //regexTokenizer分词结果
  val regexTokenized2 = regexTokenizer2.transform(sentenceDataFrame)
      regexTokenized2.select("sentence","words")
    .withColumn("tokens",countTokens(col("words"))).show(false)
}

运行结果
在这里插入图片描述

发布了14 篇原创文章 · 获赞 1 · 访问量 684

猜你喜欢

转载自blog.csdn.net/qq_33891419/article/details/103767629