TextRank under Google's PageRank algorithm inspired by the text in the right sentence for the design of weight algorithm, the goal is automatic summarization. It uses the principle of voting, so that each word to its neighbors (the term for window) vote in favor of the right to vote is weighted based on their votes. This is a "chicken or the egg" paradox, PageRank by way of convergence of iterative matrix solution to this paradox. TextRank is no exception:
PageRank calculation formula:
TextRank the formula:
Formal TextRank formula based on the formula of PageRank, introduced the concept of the right side of the value, on behalf of the similarity between two sentences .
But obviously I just want to calculate keyword, if a word as a sentence, then all of a sentence (word) composed of edge weights are 0 (no intersection, no similarity), the numerator and denominator of the weights w out about the algorithm degenerates to PageRank. So, here keyword extraction algorithm called PageRank is not excessive.
Java implementation
Take a look at test data
1 |
Programmers (English Programmer) is engaged in application development, maintenance professionals. |
I took out the Baidu Encyclopedia on the definition of "programmer" as a test case, it is clear that this definition should be the keyword "programmer" and "programmer" should be the highest score.
First word of this sentence, word here can make use of a variety of projects, such as HanLP word, derived word Results:
1 |
[Programmer / n, (, English / nz, programmer / en,) , a / v, in the / v, program / n, |
Then remove the inside stop words, here I removed the "word other than nouns, verbs, adjectives, adverbs," punctuation, commonly used words, as well. Arrive at the actual useful words:
1 |
[Programmer, English, program development, maintenance, professional, personnel, programmers, |
Here is the key code to achieve:
After the establishment of two size window 5, each word will vote for it within 5 words from behind the front (in parentheses the words in alphabetical order, but the original word order can not be upset):1 |
{Dev = [professional programmers, maintenance, English, procedures, personnel], |
Then start voting iteration, the code is not that hard, is in accordance with the process of the original paper simple algorithm to achieve it again, here I give a simple comment, save trouble later look:
1 |
for (int i = 0; i < max_iter; ++i) //最外层条件是算法设定的最大迭代次数 |
排序后的投票结果:
1 |
[Programmer = 1.9249977, |
Original: Big Box TextRank extract keywords implementation principle