TF-IDF principle and use

1. What is TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency, Term Frequency-Inverse Document Frequency).

It is a commonly used weighting technique for information retrieval and information mining. TF-IDF is a statistical method to assess the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely to the frequency it appears in the corpus.

The above citation summary is that the more times a word appears in an article, and the fewer times it appears in all documents, the more representative the article is.

This is what TF-IDF means.

Term frequency (TF) refers to the number of times a given word appears in the document. This number is usually normalized (usually the word frequency divided by the total number of words in the article) to prevent it from skewing towards long files. (The same word may have a higher word frequency in a long file than a short file, regardless of whether the word is important or not.)

However, it should be noted that some common words do not have much effect on the theme, but some words that appear less frequently can express the theme of the article, so simply using TF is not suitable. The weight design must satisfy: the stronger the ability of a word to predict the topic, the greater the weight, and vice versa, the smaller the weight . In all statistical articles, some words only appear in a few articles, so such words have a great effect on the theme of the article, and the weight of these words should be designed to be larger. IDF is doing just that.

official:

 

T F w = the number of times the term w appears in a certain class The number of all terms in the class


Inverse document frequency (IDF) The main idea of ​​IDF is: if there are fewer documents containing the term t, the larger the IDF, it means that the term has a good ability to distinguish between categories. The IDF for a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient obtained.

 

official:

I D F = l o g (the total number of documents in the corpus contains the number of documents of the term w + 1), the denominator is increased by 1 to avoid the denominator being 0

 

  A high word frequency within a particular document, and a low document frequency of that word in the entire document set, can result in a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.
  

TFIDF=TFIDF

 

2. An example

Reference: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

write picture description here

3. Implementation of TF-IDF in Spark

1. TF-IDF algorithm based on spark1.4.1 ml algorithm package

// 参考自spark官网教程 http://spark.apache.org/docs/latest/ml-features.html#tf-idf
// In the following code segment, we start with a set of sentences. 
// We split each sentence into words using Tokenizer. For each sentence (bag of words),
// we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale
// the feature vectors; this generally improves performance when using text as features. // Our feature vectors could then be passed to a learning algorithm.

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.ml.feature.{Tokenizer,HashingTF, IDF}
import org.apache.spark.mllib.linalg.{Vectors, Vector}
// 创建实例数据
val sentenceData = sqlContext.createDataFrame(Seq(
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
)).toDF("label", "sentence")

//  scala> sentenceData.show
//  +-----+--------------------+
//  |label|            sentence|
//  +-----+--------------------+
//  |    0|Hi I heard about ...|
//  |    0|I wish Java could...|
//  |    1|Logistic regressi...|
//  +-----+--------------------+

//句子转化成单词数组
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)

// scala> wordsData.show
//  +-----+--------------------+--------------------+
//  |label|            sentence|               words|
//  +-----+--------------------+--------------------+
//  |    0|Hi I heard about ...|ArrayBuffer(hi, i...|
//  |    0|I wish Java could...|ArrayBuffer(i, wi...|
//  |    1|Logistic regressi...|ArrayBuffer(logis...|
//  +-----+--------------------+--------------------+

// hashing计算TF值,同时还把停用词(stop words)过滤掉了. setNumFeatures(20)表最多20个词
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)

// scala> featurizedData.show
//  +-----+--------------------+--------------------+--------------------+
//  |label|            sentence|               words|         rawFeatures|
//  +-----+--------------------+--------------------+--------------------+
//  |    0|Hi I heard about ...|ArrayBuffer(hi, i...|(20,[5,6,9],[2.0,...|
//  |    0|I wish Java could...|ArrayBuffer(i, wi...|(20,[3,5,12,14,18...|
//  |    1|Logistic regressi...|ArrayBuffer(logis...|(20,[5,12,14,18],...|
//  +-----+--------------------+--------------------+--------------------+


val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
// 提取该数据中稀疏向量的数据,稀疏向量:SparseVector(size,indices,values)
// rescaledData.select("features").rdd.map(row => row.getAs[linalg.Vector](0)).map(x => x.toSparse.indices).collect
rescaledData.select("features", "label").take(3).foreach(println)

//  [(20,[5,6,9],[0.0,0.6931471805599453,1.3862943611198906]),0]
//  [(20,[3,5,12,14,18],[1.3862943611198906,0.0,0.28768207245178085,0.28768207245178085,0.28768207245178085]),0]
//  [(20,[5,12,14,18],[0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085]),1]
// 其中,20是标签总数,下一项是单词对应的hashing ID.最后是TF-IDF结果

2. TF_IDF algorithm in RDD-based MLlib package

//参考: http://spark.apache.org/docs/1.4.1/mllib-feature-extraction.html#tf-idfark.mllib.feature.HashingTF
//进阶参考
//http://blog.csdn.net/jiangpeng59/article/details/52786344

import org.apache.spark.mllib.linalg.Vector
val sc: SparkContext = ...

// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

4. Reference

http://blog.csdn.net/google19890102/article/details/29369793
http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
http://www.cnblogs.com/rausen/p/4142838.html
http://blog.csdn.net/jiangpeng59/article/details/52786344

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325299031&siteId=291194637