Knowledge Bits - What is TF-IDF

Introduction to TF-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is a statistical method for evaluating the importance of a word to a document set or a document in a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely proportional to the frequency it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of how relevant a document is to a user query.

The main idea of ​​TF-IDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, it is considered that the word or phrase has a good category discrimination ability and is suitable for use to classify. TF-IDF is actually: TF*IDF

Term Frequency (Term Frequency, TF)

Refers to the frequency with which a given term appears in the document. That is, the ratio of count(w, d) the number of times word w appears in document d to the total number of words size(d) in document d.

tf(w,d) = count(w, d) / size(d)

This number is normalized to the term count to prevent it from being biased towards long documents. (The same word may have a higher word count in a long file than in a short file, regardless of the importance of the word.)

Inverse Document Frequency (IDF)

is a measure of the universal importance of a word. The IDF of a specific term can be obtained by dividing the total number of documents by the number of documents containing the term, and taking the logarithm of the obtained quotient. That is, the logarithm of the ratio of the total number of documents n to the number of documents docs(w, D) in which word w appears.

idf(w) = log(n / docs(w, D))

The above IDF formula can already be used, but in some special cases there will be some small problems, such as a certain rare word is not in the corpus, so our denominator is 0, and IDF is meaningless. Therefore, we need to do some smoothing for the commonly used IDF, so that words that do not appear in the corpus can also get a suitable IDF value. There are many smoothing methods, and one of the most common IDF smoothing formulas is:

idf(w) = log[(n+1)/(docs(w,D)+1)]   +   1

TF-IDF definition

TF-IDF(x) = TF(x)*IDF(x)

# Where TF(x) refers to the word frequency of word x in the current text.

reference:

1. Introduction and use of TF-IDF principle

Introduction and use of TF-IDF principle - short book

Guess you like

Origin blog.csdn.net/guoqx/article/details/130921027