TF-IDF algorithm of keyword extraction

 

This article is reproduced from Ruan Yifeng teacher Bowen, the original address: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html )

————————————————————————————————————————

The title looks like a very complicated, in fact, I want to talk about is a very simple question.

There is a very long article, I use the computer to extract its keyword (Automatic Keyphrase extraction), completely unchecked human intervention , how can we do it right?

This problem involves a lot of computer frontier data mining, text processing, information retrieval, but surprisingly, there is a very simple classical algorithm, can give results were quite satisfactory. It is simple to do not need advanced mathematics, the average person only 10 minutes to understand, this is what I want to introduce today TF-IDF algorithm .

Let's start with an example of. Suppose now that there is a long article, "China's bee breeding," we are ready to extract its keywords by the computer.

Easy to think of a way of thinking, it is to find the highest number of words appear. If a word is very important, it should appear several times in this article. So, we "word frequency" (Term Frequency, abbreviated as TF) statistics.

The results you must have guessed, there is the word most frequently - "a", "Yes", "being" - the most commonly used words. They are called "stop words" (STOP words) , expressed no help to find the results to be filtered out words.

We assume that the words they are filtered out, considering only remaining meaningful. This will encounter another problem, we may find as many occurrences, "China", "bee", "farming" of these three words. Does this mean, as a keyword, their importance is the same?

Clearly not the case. Because "China" is a very common word, relatively speaking, "Bee" and "culture" is not so common. If these three words as much as the number of occurrences of an article, it is reasonable that the "bees" and "culture" an important degree than the concept "Chinese", that is, above the sorting keyword, "Bee" and " culture "should be at the" front of China ". (Note: This is a common or uncommon is relative to the entire document library, or the entire locale)

So, we need an important adjustment factor , a measure of a word is not a common word. When a word is relatively rare in the general case, but it appears more than once in this article, it is likely to reflect the characteristics of this article, is the keyword we need. For example, like the "hydrogen helium lithium beryllium" This word in everyday language environment it is rare, usually only in Chemistry article, so if an article appears in similar words, we could have it may be determined that a chemical articles, namely its high importance coefficient; and "we" the word is very common, "we" this word appears in the article, can not be inferred more information and therefore lower its importance coefficient.

Using statistical language, that is, on the basis of word frequency, to be assigned a "importance" weight for each word. The most common words ( "and", "Yes", "in") to give the minimum weight, the more common word ( "China") to give less weight, less common words ( "bee", "breeding" ) to give greater weight. This weight is called "inverse document frequency" (Inverse Document Frequency, abbreviated IDF), its size and the extent of a common word is inversely proportional .

Know the "word frequency" (TF) and "inverse document frequency" (IDF) later, these two values are multiplied, you get a TF-IDF value of a word. The higher the importance of the article a word, it's TF-IDF values greater . So, at the top of a few words, it is the key word of this article.

Here are the details of the algorithm.

The first step to calculate word frequency.

Considering article has length of the points, in order to facilitate comparison of different articles, the "word frequency" standardization.

or

The second step, to calculate the inverse document frequency.

In this case, the need for a corpus (Corpus), used to simulate the use of the locale.

If a word more common, so the larger the denominator, the smaller the inverse document frequency closer to 0. The reason for the denominator plus 1, to avoid the denominator is zero (that is, all of the documents do not contain the word). log representation of a value obtained logarithmic.

A third step of calculating TF-IDF.

We can see, TF-IDF values and the number of times a word appears in a document of proportional, inversely proportional to the number of occurrences of the word in the entire locale . Therefore, the algorithm automatically extracting keywords very clear, is calculated for each word of the document TF-IDF value, and then in descending order, taking a few words at the top.

Or "China's bee breeding" as an example, assume that the message length is 1000 words, "China", "bee", "culture" appears 20 times each, the three words of "word frequency" (TF) are 0.02. Then, Google search found, including "the" word of a total of 25 billion pages, it is assumed that the total number of Chinese web page. Pages that contain "China" A total of 6.23 billion, consisting of "bee" page to 048.4 million, a page containing "culture" of 097.3 million. They inverse document frequency (IDF) and TF-IDF follows:

Seen from the table, "Bee," the TF-IDF highest value, "breeding" Secondly, the "China" minimum. (If it is calculated "and" word of TF-IDF, it would be a great value close to zero.) So, if you select only one word, "bee" is the key word of this article.

In addition to automatic extraction of keywords, TF-IDF algorithm can also be used in many other places. For example, when information retrieval, for each document, you can calculate a set of search terms, respectively ( "China", "bee", "breeding") of the TF-IDF, add them together, you can get the entire document TF- IDF. The highest value of the document is the document most relevant to the search term.

Advantage of TF-IDF algorithm is simple and rapid, more in line with the actual situation. The downside is that the importance of simply to "word frequency" measure of a word, is not comprehensive enough, sometimes important words are not many possible occurrences. Moreover, this method can not reflect the position information of the word, the word appears with the front position after position on the word appears, are treated as the same importance, it is not true. (One solution is, the full text of the first paragraph and the first sentence of each paragraph, give greater weight.)

Published 131 original articles · won praise 22 · views 120 000 +

Guess you like

Origin blog.csdn.net/qq_38890412/article/details/104962371