【Introduction to TF-IDF】

TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and information mining. The main idea of ​​TFIDF is: if a word or phrase appears in an article with high frequency TF, and If it rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish between categories and is suitable for classification. TFIDF is actually: TF*IDF, TF Term Frequency, IDF Inverse Document Frequency. TF represents the frequency of the term appearing in document d.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf[2].

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

 

 

 

Term frequency

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:

The weight of a term that occurs in a document is simply proportional to the term frequency. 

 

Inverse document frequency

Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:

The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs. 

 

The main idea of ​​IDF is : if there are fewer documents containing the term t, that is, the smaller the n, the larger the IDF, it means that the term t has a good ability to distinguish between categories. If the number of documents containing term t in a certain type of document C is m, and the total number of documents containing t in other types is k, obviously the number of documents containing t is n=m + k. When m is large, n is also large. , the value of the IDF obtained according to the IDF formula will be small, which means that the classification ability of the entry t is not strong. But in fact, if a term appears frequently in the documents of a class, it means that the term can well represent the features of the text of this class, and such terms should be given a higher weight and selected as the Feature words of this type of text to distinguish it from other types of documents. This is where IDF falls short.

 

Example: If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word frequency of the word "cow" in the document is 3/100=0.03. One way to calculate the document frequency (DF) is to measure how many documents have the word "cow" and divide by the total number of documents contained in the document set. So, if the word "cow" appears in 1,000 documents and the total number of documents is 10,000,000, the reverse document frequency is lg(10,000,000 / 1,000)=4. The final TF-IDF score is 0.03*4=0.12.

 

The main idea of ​​IDF is: if there are fewer documents containing the term t, that is, the smaller the n, the larger the IDF, it means that the term t has a good ability to distinguish between categories. If the number of documents containing term t in a certain type of document C is m, and the total number of documents containing t in other types is k, obviously the number of documents containing t is n=m+k. When m is large, n is also large. , the value of the IDF obtained according to the IDF formula will be small, which means that the classification ability of the entry t is not strong. But in fact, if a term appears frequently in the documents of a class, it means that the term can well represent the features of the text of this class, and such terms should be given a higher weight and selected as the Feature words of this type of text to distinguish it from other types of documents. This is where IDF falls short. In a given document, term frequency (TF) refers to the frequency with which a given word appears in the document. This number isnormalized to the term count to prevent it from skewing towards long files . (The same word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not.) For a word in a particular file  t_{i}  , its importance can be expressed as:

 \mathrm{tf_{i,j}} = \frac{n_{i,j}}{\sum_k n_{k,j}}

The above formula  n_{i,j} is the number of occurrences of the word in the document d_{j}, and the denominator is the sum of d_{j}the occurrences of all words in the document.

Inverse document frequency (IDF) is a measure of the general importance of a word. The IDF for a particular word can be calculated by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient to get:

 \mathrm{idf_{i}} =  \log \frac{|D|}{|\{j: t_{i} \in d_{j}\}|}

in

  • |D|: The total number of documents in the corpus
  •  |\{ j: t_{i} \in d_{j}\}| : the number of documents that contain the word  t_{i} (i.e.  n_{i,j} \neq 0the number of documents) if the word is not in the corpus, it will cause the dividend to be zero, so it is generally used1 + |\{j : t_{i} \in d_{j}\}|

Then

 \mathrm{tf{}idf_{i,j}} = \mathrm{tf_{i,j}} \times  \mathrm{idf_{i}}

某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。

 

原创不易,欢迎打赏,请认准正确地址,谨防假冒



 

 



 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326286543&siteId=291194637