Application of TF-IDF and cosine similarity (1): automatic keyword extraction

Author:  Ruan Yifeng

Date:  March 15, 2013

This title may seem complicated, but I am actually talking about a very simple issue.

There is a very long article, and I want to use the computer to extract its keywords (Automatic Keyphrase extraction) without manual intervention at all. How can I do it correctly?

This problem involves many computer frontier fields such as data mining, text processing, and information retrieval, but unexpectedly, there is a very simple classical algorithm that can give quite satisfactory results. It is so simple that it does not require advanced mathematics, and ordinary people can understand it in only 10 minutes. This is the TF-IDF algorithm I want to introduce today.

Let's start with an example. Suppose there is a long article "Bee Breeding in China" , and we are going to extract its keywords by computer.

An easy idea to think of is to find the word that occurs the most. If a word is important, it should appear multiple times in this article. So, we conduct "term frequency" (Term Frequency, abbreviated as TF) statistics.

As you must have guessed, the words that appear the most are ---- "of", "is", "in"---- the most commonly used words in this category. They are called "stop words" and represent words that are not helpful in finding results and must be filtered out.

Suppose we filter them all out and only consider the remaining meaningful words. Then we will encounter another problem, we may find that the three words "China", "bee" and "breeding" appear as many times. Does this mean that they are of the same importance as keywords?

Apparently not. Because "China" is a very common word, "bee" and "breeding" are relatively less common. If these three words appear the same number of times in an article, it is reasonable to think that "bee" and "breeding" are more important than "China", that is, in terms of keyword ranking, "bee" and " "Cultivation" should be in front of "China".

Therefore, we need an importance adjustment coefficient to measure whether a word is a common word. If a word is relatively rare, but it appears many times in this article, then it is likely to reflect the characteristics of this article, and it is the keyword we need.

Expressed in statistical language, it is to assign an "importance" weight to each word on the basis of word frequency. The most common words ("的", "is", "zai") are given the least weight, the more common words ("China") are given less weight, the less common words ("bee", "breeding") ) to give a larger weight. This weight is called "Inverse Document Frequency" (Inverse Document Frequency, abbreviated as IDF), and its size is inversely proportional to how common a word is.

After knowing "word frequency" (TF) and "inverse document frequency" (IDF), multiply these two values ​​to get the TF-IDF value of a word. The higher the importance of a word to the article, the higher its TF-IDF value. Therefore, the first few words are the keywords of this article .

Below are the details of the algorithm.

The first step is to calculate the word frequency.

Considering the length of articles, in order to facilitate the comparison of different articles, the "word frequency" is standardized.

or

The second step is to calculate the inverse document frequency.

At this time, a corpus is needed to simulate the environment in which the language is used.

如果一个词越常见,那么分母就越大,逆文档频率就越小越接近0。分母之所以要加1,是为了避免分母为0(即所有文档都不包含该词)。log表示对得到的值取对数。

第三步,计算TF-IDF。

可以看到,TF-IDF与一个词在文档中的出现次数成正比,与该词在整个语言中的出现次数成反比。所以,自动提取关键词的算法就很清楚了,就是计算出文档的每个词的TF-IDF值,然后按降序排列,取排在最前面的几个词。

还是以《中国的蜜蜂养殖》为例,假定该文长度为1000个词,"中国"、"蜜蜂"、"养殖"各出现20次,则这三个词的"词频"(TF)都为0.02。然后,搜索Google发现,包含"的"字的网页共有250亿张,假定这就是中文网页总数。包含"中国"的网页共有62.3亿张,包含"蜜蜂"的网页为0.484亿张,包含"养殖"的网页为0.973亿张。则它们的逆文档频率(IDF)和TF-IDF如下:

从上表可见,"蜜蜂"的TF-IDF值最高,"养殖"其次,"中国"最低。(如果还计算"的"字的TF-IDF,那将是一个极其接近0的值。)所以,如果只选择一个词,"蜜蜂"就是这篇文章的关键词。

除了自动提取关键词,TF-IDF算法还可以用于许多别的地方。比如,信息检索时,对于每个文档,都可以分别计算一组搜索词("中国"、"蜜蜂"、"养殖")的TF-IDF,将它们相加,就可以得到整个文档的TF-IDF。这个值最高的文档就是与搜索词最相关的文档。

TF-IDF算法的优点是简单快速,结果比较符合实际情况。缺点是,单纯以"词频"衡量一个词的重要性,不够全面,有时重要的词可能出现次数并不多。而且,这种算法无法体现词的位置信息,出现位置靠前的词与出现位置靠后的词,都被视为重要性相同,这是不正确的。(一种解决方法是,对全文的第一段和每一段的第一句话,给予较大的权重。)

下一次,我将用TF-IDF结合余弦相似性,衡量文档之间的相似程度。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324658207&siteId=291194637