TF-IDF is an entry-level algorithm that nlp engineers must master. As a hobby, I have read several blogs about the algorithm before, and I only know a general idea about it. Recently, I was reading "The Beauty of Mathematics" by Wu Jun. The introduction to TF-IDF gave me a deeper understanding of the algorithm. Now the personal understanding of the algorithm is organized as follows:
TF-IDF is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. The importance of a word increases in proportion to the number of times it appears in the document , but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus.
Among them, ① evaluate through the TF part of the algorithm ; ② evaluate through IDF .
TF: Term Frequency, generally referred to as "keyword frequency", or "single text word frequency". Calculation method: the number of uses of keywords in the file divided by the total number of words in the file (non-deduplication). For example, in an article with a total number of 10,000 words, "artificial intelligence" appeared 17 times, "development" appeared 23 times, and "的" appeared 113 times. Then their TF is: 0.0017, 0.0023, 0.0113.
IDF: Inverse Document Frequency, generally called "Inverse Document Frequency Index". Calculation method:, where D is the number of files in the corpus, and is the number of files where the keyword has appeared. For example, there are a total of 1000 articles in the corpus, 3 articles containing "artificial intelligence", 20 articles containing "development", and all articles containing "的". Then their IDF are: log(1000/3), log(1000/20), log(1000/1000).
Then the similarity between "artificial intelligence/of/development" and the article is:
0.0017*log(1000/3)+0.0023*log(1000/20)+0.0113*log(1000/1000)
references:
https://baike.baidu.com/item/tf-idf/8816134?fr=aladdin
"The Beauty of Mathematics": Chapter 11 How to Determine the Relevance of Web Pages and Queries