Overview of the principle of TF-IDF

       Today I will talk about what TF-IDF is. I remember this thing has been entwining me since 2017. Today I tore it off and put together such an interesting article. I want to read it, even if you haven’t learned it. Too advanced mathematics, then the principle of it will be clear. In order to express its principle in a down-to-earth manner, this article hardly involves complex mathematical formulas, even though those formulas are so beautiful in my opinion...

       Here is an article, a very long article, as long as the equator (exaggerated rhetoric is used here to vividly depict the length of the article). Now we want to use a computer to extract the keywords of the article without manual intervention, so what should we do? This question involves many frontier areas of computer technology, such as information retrieval, text mining, text processing, and so on. When thinking of this, everyone will inevitably think how complicated technology is needed to complete this work, but coincidentally, there is such an algorithm that can perfectly solve this problem, which is the TF-IDF algorithm. Our story officially begins here~


       Everyone is no stranger to iris. Like most people, I also love and hate this flower, not because in France, people make it into perfume, but because its petals and calyx are simply tools on the road to machine learning. Since it is so convenient, let's use iris as an example. Here is an article called "The Growth of European Iris". Next, we want to use a computer to extract the keywords of this article.

1. Term Frequency (Term Frequency, abbreviated as TF)

       First of all, we thought that if a word can become a keyword, then at least it should appear many times in the article, then this involves a word frequency statistics. The statistical results certainly did not disappoint us. The words "of", "le", and "is" appear the most. These words must not be keywords. They have a very beautiful collective term called "stop words". ", which means to stop using them. These words are not helpful to us. Generally, they need to be filtered out before the word frequency statistics.

       Here comes the first hypothesis: suppose we have filtered out these words, and now the rest are words with practical meaning. At this time, we encountered another problem. We found that the three words "Europe", "Iris" and "Growth" appeared the same number of times. This does not mean that these three words are important keywords. Is it the same? Obviously, their importance is different. "Europe" may be more common than "iris" and "growth" in other places, so we have reason to believe that in this article, "iris" and " "Growth" is more important than "Europe". So we need another parameter to measure whether a word appears frequently. In other words, we need to assign a weight value to each word to judge the importance of the word.

1、计算词频:

(1)词频 = 某个词在文章中出现的次数

有时候为了不同文章(不同文章长短不一)之间便于比较,通常会对词频进行标准化,标准化的方式如下:

(2)词频 = 某个词在文章中出现的总次数 ÷ 文章的总词数

2. Inverse Document Frequency (IDF)

       If a word does not appear often, but it appears frequently in this article, then we have reason to suspect that it can reflect the characteristics of this article, that is, the keywords we need. The weight value mentioned above has another name here, which is the inverse document frequency, and its size is inversely proportional to the commonness of a word. The so-called TF-IDF is the product of word frequency and inverse document frequency. The higher the importance of a word to an article, the greater its TF-IDF value, so the first few words are the keywords of the article.

2、计算逆文档频率:
    此时需要一个语料库,用来模拟语言的使用环境,例如:这是属于科学类的文章,那么所有的科学类文章做成一个大的语料库。
逆文档频率 = log(语料库的文档总数 ÷ (包含该词的文档数 + 1))


3、计算TF-IDF
TF-IDF = 词频 × 逆文档频率

       It can be seen from the above formula that TF-IDF is directly proportional to the number of occurrences of a word in the document, and inversely proportional to the number of occurrences of the word in the entire corpus. Therefore, the algorithm for automatically extracting keywords is very clear, which is to calculate the TF-IDF value of each word in the document, and then sort it in descending order, taking the top few words.

       Through calculations, it is found that there are a total of 1,000 words in the article. "Europe", "Iris", and "Growth" each appear 20 times, and the "Word Frequency" (TF) of these three words are all 0.02. Then, through a web search, there are 25 billion web pages containing the word "的", which is assumed to be the total number of scientific web pages. There are 6.23 billion pages containing "Europe", 48.4 million pages containing "Iris", and 97.3 million pages containing "Growth". Then their inverse document frequency (IDF) and TF-IDF are as follows:

  The number of documents containing the word IDF TF-IDF
Europe 6.23 billion 0.603 0.0121
iris 48.4 million 2.713 0.0543
growing up 97.3 million 2.410 0.0482

       As can be seen from the above table, "Iris" has the highest TF-IDF, followed by "Growth", and "Europe" has the lowest TF-IDF.

 

       Okay, at this point, do you have a certain understanding of the principles of TF-IDF? What problems did you encounter in the process? Welcome to leave a message and let me see what problems you all encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/108997570