[Reprint] Understanding and Calculation of TF-IDF

Article reproduced from:Use TfidfVectorizer class to find TF-IDF - Li Bai and Wine - Blog Park

What is a TF-IDF value

The "TF-IDF value of a word" is mentioned in Polynomial Naive Bayes. TF-IDF is a statistical method used to evaluate the importance of a word to a document set or one of the documents in the document library.

TF-IDF is actually two phrases Term Frequency and Inverse Document Frequency is the general name, and the abbreviations are TF and IDF, which represent term frequency and inverse document frequency respectively.

Term frequency TF calculates the number of times a word appears in a document. It believes that the importance of a word is proportional to the number of times it appears in the document.

Inverse Document Frequency IDF refers to the degree of distinction of a word in a document. It believes that the fewer documents a word appears in, the better it can distinguish the document from other documents through this word. The larger the IDF, the greater the discrimination of the word.

So TF-IDF is actually the product of term frequency TF and inverse document frequency IDF. In this way, we tend to find words with high TF and IDF values ​​as distinctions, that is, this word appears many times in one document and rarely appears in other documents. Such words are suitable for classification.

How to calculate TF-IDF

First, let’s look at the formulas of term frequency TF and inverse document probability IDF.

In the denominator of IDF, 1 is added to the number of documents in which the word appears because some words may not exist in the document. In order to avoid the denominator being 0, 1 is added to the number of documents in which the word appears.

TF-IDF=TF*IDF

The TF-IDF value is simply the product of TF and IDF, which allows for more accurate classification of documents. For example, for high-frequency words like "I", although the TF word frequency is high, the IDF value is very low, and the overall TF-IDF is not high.

Give a specific example

Suppose there are 10 documents in a folder, and one document has 1000 words. The word "this" appears 20 times and "bayes" appears 5 times. "this" appears in all documents, while "bayes" appears in only 2 documents. Calculate the TF-IDF value of these two words.

For "this", calculate the TF-IDF value:

So TF-IDF=0.02*(-0.0414)=-8.28e-4.

For "bayes", calculate the TF-IDF value:

So TF-IDF=0.005*0.5229=2.61e-3.

It is obvious that the TF-IDF value of "bayes" is greater than the TF-IDF value of "this". This means that the word "bayes" is a better distinction than the word "this".

How to find TF-IDF

Use the TfidfVectorizer class directly in sklearn, which can calculate the value of the word TF-IDF vector. In this class, when taking the logarithm calculated by sklearn, the base is e, not 10.

from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = ['is', 'the', 'and']
tfidf_vec = TfidfVectorizer(stop_words=stop_words)

# TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
#                 dtype=<class 'numpy.float64'>, encoding='utf-8',
#                 input='content', lowercase=True, max_df=1.0, max_features=None,
#                 min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
#                 smooth_idf=True, stop_words=['is', 'the', 'and'],
#                 strip_accents=None, sublinear_tf=False,
#                 token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
#                 vocabulary=None)
print(tfidf_vec)

documents = [
    'this is the bayes document',
    'this is the second second document',
    'and the third one',
    'is this the document'
]

# 使用 fit_transform 计算,返回文本矩阵,该矩阵表示了每个单词在每个文档中的 TF-IDF 值
tfidf_matrix = tfidf_vec.fit_transform(documents)

# 不重复的词: ['bayes', 'document', 'one', 'second', 'third', 'this']
print('不重复的词:', tfidf_vec.get_feature_names())

# 停用词表: frozenset({'and', 'the', 'is'})
print('停用词表:', tfidf_vec.get_stop_words())

# 每个单词的ID: {'this': 5, 'bayes': 0, 'document': 1, 'second': 3, 'third': 4, 'one': 2}
print('每个单词的ID:', tfidf_vec.vocabulary_)

# 每个单词的tfidf值:
#  [[0.74230628 0.47380449 0.         0.         0.         0.47380449]
#  [0.         0.29088811 0.         0.91146487 0.         0.29088811]
#  [0.         0.         0.70710678 0.         0.70710678 0.        ]
#  [0.         0.70710678 0.         0.         0.         0.70710678]]
print('每个单词的tfidf值:\n', tfidf_matrix.toarray())

Note:

Stop words are words that are useless in classification. These words generally have high word frequency TF, but low IDF and cannot play a role in classification. In order to save space and calculation time, these words are used as stop words to tell the machine that these words do not need to be calculated. Stop words stop_words is a list of type List.

Reference

21丨Naive Bayes Classification (Part 2): How to classify documents? -geek time

sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn 1.2.0 documentation

Guess you like

Origin blog.csdn.net/tangxianyu/article/details/128516726