TF-IDF principle and study notes of ML

What is 0x00 TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency, term frequency - inverse document frequency).

# Is a common weighting information retrieval and information technology for exploration. TF-IDF is a statistical method to evaluate the importance of a term set for a file or a document in the corpus where the. The importance of words as the number of times it appears in the file is proportional to the increase, but will also decrease as the frequency is inversely proportional to its appearance in the corpus.

Cited above summary is, the more times a word appears in an article, and the less the number, the more able to represent this article appeared in all documents.

This is the meaning of TF-IDF.

Term frequency (term frequency, TF) refers to a number of times a given word appears in the document. This number will usually be normalized (usually word word frequency divided by the total number of articles), in order to prevent its preference file length. (The same word in the long file may be higher than the short-term frequency file, regardless of whether or not the important words.)

# Note, however, that some common words for the theme and not much action, but rather some of the less frequently the word appears to be able to express the theme of the article, it is simple to use TF inappropriate. Design weights must be met: a word, the stronger the ability to predict the subject, the greater the weight the right, on the contrary, the smaller weights. All the statistics in the article, only a small number of words in which the articles appeared, then a significant role in such words to the theme of the article, the greater weight of these words should be re-designed. IDF is the completion of such work.

 

 Inverse document frequency (inverse document frequency, IDF) IDF's main idea is: the fewer if the document contains the term t, IDF greater, then the term has a good ability to distinguish between categories. IDF a particular word to be divided by the number of documents containing the words of a total number of files, then the resulting quotient is rounded to obtain a number.

 

 High frequency words in a given document, and the words in the low frequency file throughout the file set, and can produce a high weight TF-IDF. Therefore, TF-IDF tend to filter out common words, keep your important words. 

 

 

0X01 examples

Reference:  http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

  A: There are many different mathematical formulas used to calculate the TF-IDF. Examples here to calculate the above-described mathematical formula. Term frequency (TF) is the number of times a word appears divided by the total number of words in the file. If the total number of words in a document is 100, while the word "cow" appears three times, then the "cow" in the word document word frequency it is 3/100 = 0.03. The method of calculation of a document frequency (DF) is to determine how many documents appeared the word "cow", and then divided by the total number of files contained in the file set. Therefore, if the term "cow" appeared in 1,000 parts of a file, the file is the total number of 10,000,000 then its inverse document frequency is the log (10,000,000 / 1,000) = 4. The final score for the TF-IDF 4 * 0.03 = 0.12.

 

  Two: keyword k1, k2, k3 relevance of search results becomes TF1 * IDF1 + TF2 * IDF2 + TF3 * IDF3. For example, the total term document1 of 1000, k1, k2, k3 number appears in the document1 is 100,200,50. It contains k1, total docuement k2, k3 1000 respectively, 10000,5000. The total document set is 10000. TF1 = 100/1000 = 0.1 TF2 = 200/1000 = 0.2 TF3 = 50/1000 = 0.05 IDF1 = log (10000/1000) = log (10) = 2.3 IDF2 = log (10000/100000) = log (1) = 0; IDF3 = log (10000/5000) = log (2) = 0.69 so that the keyword k1, k2, k3 and docuement1 correlation * = 0.1 + 2.3 0 + 0.2 * 0.05 * 0.69 = 0.2645 wherein the proportion ratio of k1 and k3 in document1 larger, the proportion of k2 is 0.

 

  Three: In a thousand and one word of the page "Atomic Energy", "the" and "application" appeared two times, 35 times and five times, then they are on word frequency is 0.002,0.035 and 0.005, respectively. We have these three numbers together, it is appropriate and 0.042 pages and the query "atomic energy application," a simple measure of correlation. Generally speaking, if a query contains keywords w1, w2, ..., wN, they word frequency in a particular web page are: TF1, TF2, ..., TFN. (TF: term frequency). So, this query and relevance of that page is: TF1 + TF2 + ... + TFN.

  Readers may have found another loophole. In the above example, the word "of" stood more than 80% of the total word frequency, and it is determined topic of the page is almost useless. We call this word called "word should be deleted" (Stopwords), which means that the correlation is a measure of their frequency should not be considered. In Chinese, the word should be deleted and "yes", "and", "medium", "earth", "have" and so dozens. After ignore these words should be deleted, the similarity of those pages becomes 0.007, which "atomic energy" contributed 0.002 "Apply" contributed 0.005. Careful readers may also find another small loophole. In Chinese, "application" is a very common word, and "Atomic" is a very professional word, the latter is important in relevance ranking. So we need to give every word to a Chinese weight, the weight setting the following two conditions must be met:

    1. A theme stronger word prediction capability, the greater the weight, on the contrary, the smaller weights. We see the "Atomic Energy" is the word in a Web page, more or less understand the topic of the page. We see the "Apply" once the subject is basically nothing. Therefore, the right to "atomic" weight should be larger than applications.

    2. The word should be deleted weights should be zero.

 

  It is easy to find, if a keyword appears only in very few pages, we passed it easier to lock the search target, its weight will be large. Conversely, if a word appears in a large number of pages, we see that it is still not very clear what to look for content, so it should be small. Broadly speaking, assuming a keyword w appeared on pages Dw, Dw larger then the weights w of the weight, and vice versa. In information retrieval, using the most weight is "inverse document frequency index" (Inverse document frequency abbreviated as the IDF), its formula log (D / Dw) where D is the total number of pages. For example, we assume that the number of Chinese web page is D = 10 million words should be deleted "" appear in all the pages that Dw = 10 billion and then its IDF = log (10 Yi / 1000000000) = log (1 ) = 0. If the special word "atomic energy" in two million pages appear that Dw = 200 million and it weights IDF = log (500) = 6.2. General and assumed the word "application", appeared in five hundred million pages, its weight IDF = log (2) is only 0.7. He only said, to find a "atomic energy" in a web page with the equivalent of more than nine found matching "application". Using the IDF, the above-described formulas to calculate the correlation by a simple summation term frequency into a weighted sum, i.e., TF1 * IDF1 + TF2 * IDF2 + ... + TFN * IDFN. In the example above, the page and the "application of atomic energy," the correlation is 0.0161, where "atomic energy" contributed to 0.0126, while the "application" only contributed 0.0035. The proportion of our intuition and the more consistent.

 

0X02 Spark in TF-IDF implementation

2.1 TF-IDF algorithm Algorithm packet spark1.4.1 ml

#参考自spark官网教程 http://spark.apache.org/docs/latest/ml-features.html#tf-idf // In the following code segment, we start with a set of sentences. // We split each sentence into words using Tokenizer. For each sentence (bag of words), // we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale // the feature vectors; this generally improves performance when using text as features. // Our feature vectors could then be passed to a learning algorithm. import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.ml.feature.{Tokenizer,HashingTF, IDF} import org.apache.spark.mllib.linalg.{Vectors, Vector} // 创建实例数据 val sentenceData = sqlContext.createDataFrame(Seq( (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") )).toDF ( "label", "sentence") // scala> sentenceData.show // + ----- + -------------------- + // | label | sentence | // + ----- + -------------------- + // | 0 | Hi I heard about ... | // | 0 | I wish Java could ... | // | 1 | Logistic regressi ... | // + ----- + -------------------- + // sentence is converted into an array of words val tokenizer = new Tokenizer () setInputCol ( "sentence") setOutputCol ( "words") val wordsData = tokenizer.transform (sentenceData) // scala> wordsData.show // +.. - --- + -------------------- + -------------------- + // | label | sentence | words | // + ----- + -------------------- + --------------- ----- + // | 0 | Hi i heard about ... | ArrayBuffer (hi, i ... | // | 0 | i wish Java could ... | ArrayBuffer (i, wi ... | // | 1 | Logistic regressi ... | ArrayBuffer (logis ... | // + ----- + -------------------- + - ------------------ + // hashing calculate TF values, while also stop words (stop words) filtered out.setNumFeatures(20)表最多20个词 val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20) val featurizedData = hashingTF.transform(wordsData) // scala> featurizedData.show // +-----+--------------------+--------------------+--------------------+ // |label| sentence| words| rawFeatures| // +-----+--------------------+--------------------+--------------------+ // | 0|Hi I heard about ...|ArrayBuffer(hi, i...|(20,[5,6,9],[2.0,...| // | 0|I wish Java could...|ArrayBuffer(i, wi...|(20,[3,5,12,14,18...| // | 1|Logistic regressi...|ArrayBuffer(logis...|(20,[5,12,14,18],...| // +-----+--------------------+--------------------+--------------------+ val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idfModel = idf.fit (featurizedData) val rescaledData = idfModel.transform (featurizedData) // extract sparse data in the data vector, sparse vector: SparseVector (size, indices, values) // rescaledData.select ( "features") rdd.map (. row => row.getAs [linalg.Vector] (0)). map (x => x.toSparse.indices) .collect rescaledData.select ( "features", "label"). take (3) .foreach (println ) // [(20, [5,6,9], [0.0,0.6931471805599453,1.3862943611198906]), 0] // [(20, [3,5,12,14,18], [1.3862943611198906,0.0,0.28768207245178085 , 0.28768207245178085,0.28768207245178085]), 0] // [(20, [5,12,14,18], [0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085]), 1] // where 20 is the total number of tags, under an hashing ID is corresponding to a word. Finally, TF-IDF resultstoSparse.indices) .collect rescaledData.select ( "features", "label"). take (3) .foreach (println) // [(20, [5,6,9], [0.0,0.6931471805599453,1.3862943611198906]) 0] // [(20, [3,5,12,14,18], [1.3862943611198906,0.0,0.28768207245178085,0.28768207245178085,0.28768207245178085]), 0] // [(20, [5,12,14,18 ], [0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085]), 1] // where 20 is the total number of tags, it is under a hashing ID corresponding to a word. Finally, TF-IDF resultstoSparse.indices) .collect rescaledData.select ( "features", "label"). take (3) .foreach (println) // [(20, [5,6,9], [0.0,0.6931471805599453,1.3862943611198906]) 0] // [(20, [3,5,12,14,18], [1.3862943611198906,0.0,0.28768207245178085,0.28768207245178085,0.28768207245178085]), 0] // [(20, [5,12,14,18 ], [0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085]), 1] // where 20 is the total number of tags, it is under a hashing ID corresponding to a word. Finally, TF-IDF results

 

Algorithm 2.2 TF_IDF package of RDD MLlib

# 参考: http://spark.apache.org/docs/1.4.1/mllib-feature-extraction.html#tf-idfark.mllib.feature.HashingTF //进阶参考 //http://blog.csdn.net/jiangpeng59/article/details/52786344 import org.apache.spark.mllib.linalg.Vector val sc: SparkContext = ... // Load documents (one per line). val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf)

Guess you like

Origin www.cnblogs.com/JetpropelledSnake/p/12081053.html