TextRank do keyword extraction

TextRank inspired by PageRank algorithm, which is used as a method of sorting important pages.

And, also based on the algorithm of FIG., Each page can be seen as a node in the figure, if A is able to jump to web page B, there is a A-> B directed edge. In this way, we can construct a directed graph of.

Then, using the formula:


After several iterations can get the weight of each page corresponding weight. The following explains the meaning of each element of the formula:


 Possible to jump to the page, out of the corresponding points in the drawing.

Can be found, this method is good as long as the structure diagram, there is a correspondence between the natural, which is actually a more generic algorithms. So the case of text, is the same, as long as we are able to construct a view in FIG node is a word or a sentence, there is some relationship between these nodes if we define some way, then we can use the above algorithm, keywords or summary of an article.

Use TextRank extract keywords
extracted keywords, and selected web page which is actually similar to the more important, so, we just need to find a way to map constructed out just fine.

Figure node is actually better defined, is the word myself, the article is split into sentences and then split each sentence into words, the word for the node.

So how to define the edge of it? Here you can use the n-gram idea, simply, a word, and only the vicinity of its n words related to, that is n words near its corresponding node is connected to a free edge (to have two side).

In addition, you can also do some operations, such as to delete certain parts of speech, some custom word deleted, leaving only part of a word, can not even edge between only these words.

Here is an example given in the paper:


When the composition is successful, you can use the above formula for the iterative solution.

Abstract extracted using TextRank

Extract keywords with the word as a node, it is clear that the extraction Abstract nature is the sentence as a node. Then the side of it? How to define it? The above method does not seem very practical, because even if two adjacent sentences, you can go to talk about two completely different things.

In the paper, the author gives a method that compute the similarity between two sentences. My understanding is that this similarity is calculated, in fact, a relatively crude way to judge these two sentences are not talking about the same thing, if two sentences are talking about the same thing, then certainly use similar words and the like, can be connected so that the one edge.

既然有了相似度,那么就会有两个句子很相似,两个句子不太相似的情况了,因此,连的边也需要是带权值的边了。

下面是论文中给出的相似度的公式:

简单来说就是,两个句子单词的交集除以两个句子的长度(至于为什么用log,没想明白,论文里也没提)。然后还有一点,就是,其他计算相似度的方法应该也是可行的,比如余弦相似度,最长公共子序列之类的,不过论文里一笔带过了。

由于使用了带权的边,因此公式也要进行相应的修改:


上面的公式基本上就是把原来对应边的部分添加了权重,边的数量和改成了权重和,很好理解。

Guess you like

Origin blog.csdn.net/asdfsadfasdfsa/article/details/90705382