What is the TF-IDF model?

F-IDF model (term frequency–inverse document frequency, word frequency and inverse document frequency). TF-IDF is a statistical method used to evaluate the importance of a word to a document set or a corpus. The main idea of ​​TF-IDF is that if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for use classification.

TF-IDF has two values, one is word frequency, and the other is IDF (inverse document frequency). The calculation method in the figure.

For example, there are 10,000 documents in the library, 10,000 of them mention "cow" and 10 of them mention "milk production". For example, an article about "cow milk production", this article has 100 The words, "cow" appears 5 times, "milk production" appears 2 times).

Through calculations, it is found that although the word "cow" has a high frequency, the IDF value is very low. Finally, the TF-IDF of "cow" is very low, which means that the word does not have much identification. The word frequency of the word "milk production" is not high, but its recognition degree is very high, and finally its TF-IDF is also very high.

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_47542175/article/details/114735529