TextRank extract keywords implementation principle

TextRank under Google's PageRank algorithm inspired by the text in the right sentence for the design of weight algorithm, the goal is automatic summarization. It uses the principle of voting, so that each word to its neighbors (the term for window) vote in favor of the right to vote is weighted based on their votes. This is a "chicken or the egg" paradox, PageRank by way of convergence of iterative matrix solution to this paradox. TextRank is no exception:

PageRank calculation formula:

Diffusion of innovation curve

TextRank the formula:

Formal TextRank formula based on the formula of PageRank, introduced the concept of the right side of the value, on behalf of the similarity between two sentences .
Diffusion of innovation curve

But obviously I just want to calculate keyword, if a word as a sentence, then all of a sentence (word) composed of edge weights are 0 (no intersection, no similarity), the numerator and denominator of the weights w out about the algorithm degenerates to PageRank. So, here keyword extraction algorithm called PageRank is not excessive.

Java implementation

Take a look at test data

1
2
3
4
Programmers (English Programmer) is engaged in application development, maintenance professionals. 
Programmers generally divided into programmers and coders procedures, but both
boundaries are not very clear, especially in China. Software professionals is divided into junior
programmer, senior programmer, systems analyst and project manager of four categories.

I took out the Baidu Encyclopedia on the definition of "programmer" as a test case, it is clear that this definition should be the keyword "programmer" and "programmer" should be the highest score.

First word of this sentence, word here can make use of a variety of projects, such as HanLP word, derived word Results:

1
2
3
4
5
6
7
8
[Programmer / n, (, English / nz, programmer / en,) , a / v, in the / v, program / n, 
development / v,, / w, maintenance / v, a / uj, professional / n, person / n,. / w, general / a,
a / d, programmer / n, into / v, program / n, design / vn, personnel / n, and / c,
program / n, coding / n, personnel / n,, / w, but / c, both the / r, the / uj, limit / n,
and / c, not / d, is / d, clear / a,, / w, particularly / d, a / v, in / p, Chinese / ns,.
/ w, software / n, practitioners / b, art / n, into / v, the primary / b, programmer / n,, /
w, advanced / a, programmer / n,, / w, system / n, analyst / n, and / c, project / n,
the manager / n, four / m, large / a, type / q,. / w]

Then remove the inside stop words, here I removed the "word other than nouns, verbs, adjectives, adverbs," punctuation, commonly used words, as well. Arrive at the actual useful words:

1
2
3
4
[Programmer, English, program development, maintenance, professional, personnel, programmers, 
divided, programs, design, personnel, procedures, coding, people, boundaries, especially,
China, software, personnel, divided, programmers, senior programmer, systems
analyst, project manager]

Here is the key code to achieve:

After the establishment of two size window 5, each word will vote for it within 5 words from behind the front (in parentheses the words in alphabetical order, but the original word order can not be upset):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{Dev = [professional programmers, maintenance, English, procedures, personnel],

Software = [programmers, divided, boundaries, senior, China, in particular, personnel],

Programmers = [development, software, analysts, maintenance, system, project managers, divided into English, procedures, professional, design, senior, person, China],

Analysts = [programmer, systems, project manager and senior],

Maintenance = [professional development, programmers, divided into English, procedures, personnel],

System = [programmers, analysts, project managers, divided into senior],

Project = [programmers, analysts, system managers, senior],

Manager = [programmers, analysts, system, project],

Divided = [professional, software, design, programmers, maintenance, system, advanced, procedures, China, in particular, personnel],

English = [professional development, programmers, maintenance, program],

Program = [professional development, design, programmer, coding, maintenance, boundaries, divided into English, in particular, personnel],

Especially = [software, coding, divided, boundaries, program, Chinese, personnel],

Professional = [developers, programmers, maintenance, divided into English, procedures, personnel],

Design = [programmer, coding, divided, procedures, personnel],

Code = [design boundaries, procedures, China, in particular, personnel],

Boundary = [software code, program, China in particular, personnel],

High = [programmer, software, analysts, system, project, divided into personnel],

China = [programmers, software, code, divided into boundaries, in particular, personnel],

Staff = [developers, programmers, software, maintenance, into the program, in particular, professional, design, coding, boundaries, senior, China]}

Then start voting iteration, the code is not that hard, is in accordance with the process of the original paper simple algorithm to achieve it again, here I give a simple comment, save trouble later look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
for (int i = 0; i < max_iter; ++i) //最外层条件是算法设定的最大迭代次数
{
Map<String, Float> m = new HashMap<String, Float>(); //<单词,分数>
float max_diff = 0; //算法终止收敛值
//一个entry代表一个窗口列表,例如: 设计=[程序员, 编码, 分为, 程序, 人员]
//按照这个entry来举例解释下面的代码
for (Map.Entry<String, Set<String>> entry : words.entrySet())
{
String key = entry.getKey(); //设计
Set<String> value = entry.getValue(); //[程序员, 编码, 分为, 程序, 人员]
m.put(key, 1 - d); //公式里面的(1-d)
for (String other : value) //对value列表中的单词进行遍历
{
int size = words.get(other).size(); //单词的度
if (key.equals(other) || size == 0) continue; //保证列表单词与待求单词不同
m.put(key, m.get(key) + d / size * (score.get(other) == null ? 0 : score.get(other)));
}
//每次计算分数后要计算误差与收敛值的差值
max_diff = Math.max(max_diff, Math.abs(m.get(key) - (score.get(key) == null ? 0 : score.get(key))));
}
score = m;
if (max_diff <= min_diff) break;
}

排序后的投票结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[Programmer = 1.9249977,

Staff = 1.6290349,

Divided = 1.4027836,

Program = 1.4025855,

High = 0.9747374,

Software = 0.93525416,

China = 0.93414587,

In particular = 0.93352026,

Maintenance = 0.9321688,

Professional = 0.9321688,

System = 0.885048,

Code = 0.82671607,

Boundary = 0.82206935,

Development = 0.82074183,

Analysts = 0.77101076,

Project = 0.77101076,

English = 0.7098714,

Design = 0.6992446,

Manager = 0.64640945]

Original: Big Box  TextRank extract keywords implementation principle


Guess you like

Origin www.cnblogs.com/chinatrump/p/11597100.html