TF-IDF and its algorithm

Abstract: TF-IDF word frequency inverse document frequency

concept

     TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and information mining. TF-IDF is a statistical method to assess the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely to the frequency it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between documents and user queries. In addition to TF-IDF, search engines on the Internet use link analysis-based ranking methods to determine the order in which documents appear in search results.

  

principle

      In a given document, term frequency (TF)  refers to the number of times a given word appears in the document. This number is usually normalized (the numerator is generally smaller than the denominator to distinguish it from the IDF) to prevent it from skewing towards long files. (The same word may have a higher word frequency in a long file than a short file, regardless of whether the word is important or not.)

  Inverse document frequency (IDF)  is a measure of the general importance of a word. The IDF for a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient obtained.

  A high word frequency within a particular document, and a low document frequency of that word in the entire document set, can result in a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.

      The main idea of ​​TFIDF is : if a word or phrase appears frequently TF in one article and rarely appears in other articles, it is considered that the word or phrase has good ability to distinguish between categories and is suitable for classification . TFIDF is actually: TF*IDF, TF Term Frequency, IDF Inverse Document Frequency. TF represents the frequency of the term appearing in the document d (in other words: TF term frequency (Term Frequency) refers to the number of times a given word appears in the document). The main idea of ​​IDF is: if there are fewer documents containing the term t, that is, the smaller the n, the larger the IDF (see the following formula), it means that the term t has a good ability to distinguish between categories. If the number of documents containing term t in a certain type of document C is m, and the total number of documents containing t in other types is k, obviously the number of documents containing t is n=m+k. When m is large, n is also large. , the value of the IDF obtained according to the IDF formula will be small, which means that the classification ability of the entry t is not strong. (Another saying: IDF inverse document frequency (Inverse Document Frequency) means that if the document containing the term is less and the IDF is larger, it means that the term has a good ability to distinguish categories.) But in fact, sometimes, if If a term appears frequently in a class of documents, it means that the term can well represent the characteristics of the text of this class. Such terms should be given a higher weight and selected as the feature of the text. word to distinguish it from other types of documents. This is where IDF falls short.

      In a given document, term frequency (TF) refers to the frequency with which a given word appears in the document. This number is normalized to the term count to prevent it from skewing towards long files . (The same word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not.) For a word in a particular file  t_{i} , its importance can be expressed as:                                                             \mathrm{tf_{i,j}} = \frac{n_{i,j}}{\sum_k n_{k,j}}

      The above formula is  the number of occurrences n_{i,j} of the word t_{i} in the document d_{j}, and the denominator is the sum of d_{j}the occurrences of all words in the document .

      Inverse document frequency (IDF) is a measure of the general importance of words. The IDF for a particular word can be calculated by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient to get:

                                                            \mathrm{idf_{i}} =  \log \frac{|D|}{|\{j: t_{i} \in d_{j}\}|}

in

  • |D|: The total number of documents in the corpus
  • |\{ j: t_{i} \in d_{j}\}|: the number of documents that contain the word t_{i}(i.e. n_{i,j} \neq 0the number of documents) if the word is not in the corpus, it will cause the dividend to be zero, so it is generally used1 + |\{j : t_{i} \in d_{j}\}|

Then

                                                                \mathrm{tf{}idf_{i,j}} = \mathrm{tf_{i,j}} \times  \mathrm{idf_{i}}

      A high word frequency within a particular document, and a low document frequency of that word in the entire document set, can result in a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.

 

Example

One: There are many different mathematical formulas that can be used to calculate TF-IDF. The example here is calculated using the above mathematical formula. Term Frequency (TF) is the number of times a term occurs divided by the total number of terms in the document. If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word frequency of the word "cow" in the document is 3/100=0.03. One way to calculate the document frequency (DF) is to measure how many documents have the word "cow" and divide by the total number of documents contained in the document set. So, if the word "cow" appears in 1,000 documents, and the total number of documents is 10,000,000, the reverse document frequency is log(10,000,000 / 1,000) = 4. The final TF-IDF score is 0.03*4=0.12.

Two: The relevance of search results based on keywords k1, k2, and k3 becomes TF1*IDF1 + TF2*IDF2 + TF3*IDF3. For example, the total number of terms in document1 is 1000, and the number of times k1, k2, and k3 appear in document1 are 100, 200, and 50. The total amount of docuements including k1, k2, and k3 is 1000, 10000, and 5000, respectively. The total amount of document set is 10000. TF1 = 100/1000 = 0.1 TF2 = 200/1000 = 0.2 TF3 = 50/1000 = 0.05 IDF1 = log(10000/1000) = log(10) = 2.3 IDF2 = log(10000/100000) = log(1) = 0; IDF3 = log(10000/5000) = log(2) = 0.69 so that the correlation between the keywords k1, k2, k3 and document1 = 0.1*2.3 + 0.2*0 + 0.05*0.69 = 0.2645 where the proportion of k1 to k3 In document1 to be larger, the proportion of k2 is 0.

Three: In a web page with a total of 1,000 words, "atomic energy", "de" and "application" appear 2, 35 and 5 times respectively, then their word frequencies are 0.002, 0.035 and 0.005 respectively. We add these three numbers, and the sum, 0.042, is a simple measure of the relevance of the corresponding web page to the query "applications of atomic energy." In general, if a query contains keywords w1,w2,...,wN, their word frequencies in a particular web page are: TF1, TF2, ..., TFN. (TF: term frequency). Then, the correlation between this query and the webpage is: TF1 + TF2 + ... + TFN.

Readers may have discovered yet another loophole. In the above example, the word "的" stands for more than 80% of the total word frequency, and it is of little use in determining the topic of the web page. We call such words "Stopwords", which means that their frequency should not be considered when measuring relevance. In Chinese, there are dozens of words that should be deleted, such as "is", "he", "zhong", "di", "de" and so on. After ignoring these words that should be deleted, the similarity of the above web page becomes 0.007, with "atomic energy" contributing 0.002 and "application" contributing 0.005. The attentive reader may also find another small loophole. In Chinese, "application" is a very general word, while "atomic energy" is a very specialized word, and the latter is more important than the former in the relevance ranking. Therefore, we need to give a weight to each word in Chinese. The setting of this weight must satisfy the following two conditions:

1. The stronger the ability of a word to predict the topic, the greater the weight, and vice versa, the smaller the weight. We see the word "atomic energy" in the web page, and we can more or less understand the theme of the web page. We saw "Apply" once and still basically knew nothing about the subject. Therefore, the weight of "atomic energy" should be larger than that of application.

2. The weight of words that should be deleted should be zero.

We can easily find that if a keyword only appears in a few web pages, we can easily lock the search target through it, and its weight should be large. Conversely, if a word appears in a large number of pages, we see that it is still not very clear what to look for, so it should be small. In general, suppose a keyword w has appeared in Dw web pages, then the larger the Dw, the smaller the weight of w, and vice versa. In information retrieval, the most used weight is "Inverse document frequency index" (Inverse document frequency abbreviated as IDF), and its formula is log(D/Dw) where D is the number of all web pages. For example, we assume that the number of Chinese web pages is D=1 billion, and the word "de" should be deleted in all web pages, that is, Dw=1 billion, then its IDF=log(1 billion/1 billion) = log(1 ) = 0. If the special word "atomic energy" appears in two million web pages, that is, Dw=2 million, its weight IDF=log(500)=6.2. Also assume that the general word "application" appears in 500 million web pages, and its weight IDF = log(2) is only 0.7. In other words, finding a match for "atomic energy" in a web page is equivalent to finding a match for nine "applications". Using IDF, the above correlation calculation formula is changed from a simple summation of word frequencies to a weighted summation, that is, TF1*IDF1 + TF2*IDF2 +... + TFN*IDFN. In the example above, the correlation between the page and "Applications of Atomic Energy" is 0.0161, with "Atomic Energy" contributing 0.0126 and "Applications" only contributing 0.0035. This ratio is more consistent with our intuition.

 

Reprint:     http://blog.csdn.net/sangyongjia/article/details/52440063

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324931754&siteId=291194637