TF-IDF personal summary

TF-IDF is an entry-level algorithm that nlp engineers must master. As a hobby, I have read several blogs about the algorithm before, and I only know a general idea about it. Recently, I was reading "The Beauty of Mathematics" by Wu Jun. The introduction to TF-IDF gave me a deeper understanding of the algorithm. Now the personal understanding of the algorithm is organized as follows:

TF-IDF is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. The importance of a word increases in proportion to the number of times it appears in the document , but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus.

Among them, ① evaluate through the TF part of the algorithm ; ② evaluate through IDF .

TF: Term Frequency, generally referred to as "keyword frequency", or "single text word frequency". Calculation method: the number of uses of keywords in the file divided by the total number of words in the file (non-deduplication). For example, in an article with a total number of 10,000 words, "artificial intelligence" appeared 17 times, "development" appeared 23 times, and "的" appeared 113 times. Then their TF is: 0.0017, 0.0023, 0.0113.

IDF: Inverse Document Frequency, generally called "Inverse Document Frequency Index". Calculation method:, log(D/D_{w})where D is the number of files in the corpus, and is the number of files where D_{w}the keyword has appeared. For example, there are a total of 1000 articles in the corpus, 3 articles containing "artificial intelligence", 20 articles containing "development", and all articles containing "的". Then their IDF are: log(1000/3), log(1000/20), log(1000/1000).

Then the similarity between "artificial intelligence/of/development" and the article is:

0.0017*log(1000/3)+0.0023*log(1000/20)+0.0113*log(1000/1000)

 

references:

https://baike.baidu.com/item/tf-idf/8816134?fr=aladdin

"The Beauty of Mathematics": Chapter 11 How to Determine the Relevance of Web Pages and Queries

 

Guess you like

Origin blog.csdn.net/lz_peter/article/details/90676146