System learn NLP (xxii) - keyword extraction algorithm summary

Let me talk about the method of automatic summarization. Automatic Summarization ( Automatic Summarization ) There are two main ways: Extraction and Abstraction. Extraction is removable wherein the automatic summarization method, by extracting the keyword in the document already exists, the sentence is formed digest; Abstraction generated abstract automatic method, by establishing semantic abstraction that the use of natural language generation technique, generates a summary. Since the automatic summary generation method requires complex natural language understanding and generation of technical support, field of application is limited. Removable commonly used method of automatic summarization.

At present the main methods are:

  • Based on statistics: statistical information of word frequency, location, etc., to calculate weights sentence, and then select the high weights sentence as Digest, features: easy to use, but the use of the words most only stay on the surface information.
  • FIG based model: constructing topology, sort of words. For example, TextRank / LexRank
  • Based on Latent Semantic: the use of topic model, tap the words hidden information. For example, use LDA, HMM
  • Based on Integer Programming: the abstract into integer linear programming problem, find the global optimal solution.

Text Rank:

Algorithm TextRank by the relationship between the pages PageRank algorithm inspired to build a network using the relationship (co-occurrence window) between the local vocabulary, word of the importance of computing, selecting the right keywords as significant. Common processes: Original word corpus → → → stop word filtering keyword extraction

Under the first PageRank

PageRank

Initially used to calculate the importance of PageRank pages. Www whole can be seen as a directed graph map, nodes are web pages. If there is a link page A to page B, then there is a page from page A to point B is directed edges. After completion of FIG configuration, using the following formula:

S (Vi) the importance of the page i (PR value). d is the damping coefficient, is usually set to 0.85. In (Vi) is present to web pages link to a page of the set i. Out (Vj) j is a collection of web links in the presence of a link to the web page. | Out (Vj) | is the number of elements in a set.
PageRank require multiple iterations using the above formula to get results. Initially, the importance of each page can be set to 1. The results left the above formula to calculate the equal sign is the PR value of the page after i iterations, PR value to the right of the equal sign is used in full before iteration.
for example:

The figure shows the relationship between the three links page, page A intuitively the most important. The following Table can be obtained:

   End \ start A B C
A 0 1 1
B 0 0 0
C 0 0 0


In fact, the cross bar represents the nodes, the nodes vertical column represents the end. If there is a link relationship between two nodes, corresponding to a value of 1. 
According to the formula, each vertical column needs to be normalized (each element / elements sum), normalized result is: 

   End \ start A B C
A 0 1 1
B 0 0 0
C 0 0 0


The above results form a matrix M. We use matlab to see the importance of the last 100 iterations of each page: the final PR value of 0.4050 A, B, and C of the PR value of 0.1500. If the above are seen as undirected (actually a two-way) to the edge, the result is the same. 

Text Rank

TextRank algorithm is based on a sorting algorithm for text FIG. The basic idea comes from Google's PageRank algorithm, the text is divided into a number of constituent units (words, sentences) and build graphical models, an important component of the text to be sorted using the voting mechanism, using only a single piece of information to the document itself implement keyword retrieval, summarization. And different LDA, HMM-peer model, TextRank on multiple documents without prior learning training, because of its simple and effective and widely used.

  • If a word appears in a lot behind the word, then the word is more important explanation
  • Followed behind a high value TextRank word a word, then the word TextRank value will be boosted accordingly

  TextRank generally can be represented as a model to the right in FIG. G = (V, E), the set of points and edge set E consisting of V, E is a subset of V × V. FIG right side between two points of any of Vi, Vj of the weight wji, for a given point Vi, In (Vi) point to point to the set point, Out (Vi) is directed point set point Vi. Score points Vi is defined as follows:

Wherein, d is the damping coefficient, in the range of 0 to 1, the general value of 0.85, a smoothing data of each iteration, the iteration promote stable convergence. When a scoring algorithm TextRank FIG each point, the initial value to specify an arbitrary point in the figure, and recursively calculated until convergence, i.e. the figures can be achieved when the convergence point is smaller than the error rate at any given limit value, this limit is generally taken 0.0001.

1. Based on TextRank keyword extraction

  Keyword extraction task is automatically extracted a number of meaningful words or phrases from a given text. TextRank algorithm (window co-occurrence) of the subsequent relationship between the keywords are sorted using a partial vocabulary is extracted from the text itself. The main steps are as follows:

  (1) the given text T is divided according to a complete sentence, i.e.,

  (2) For each sentence , a word processing and speech tagging, filtered off and stop words, retaining only the specified word speech, such as noun, verb, adjective, i.e. , which is retained after the candidate keywords.

  (3) Construction of keyword candidates FIG. G = (V, E), where V is the set of nodes by (2) the composition of the generated keyword candidates, and between the use of (co-occurrence) configured according to any of the two co-occurrence relation edge, between two nodes corresponding to the edges thereof only if the word length of the current CPC window K, K represents the window size, i.e., up to now a total of K words.

  (4) According to the above formula, the iterative propagation heavy weight of each node, until convergence.

  (5) node weights reverse sort, whereby the most important word T, as keyword candidates. 

  (6)由(5)得到最重要的T个单词,在原始文本中进行标记,若形成相邻词组,则组合成多词关键词。例如,文本中有句子“Matlab code for plotting ambiguity function”,如果“Matlab”和“code”均属于候选关键词,则组合成“Matlab code”加入关键词序列。

关于窗口的补充:

网页之间的链接关系可以用图表示,那么怎么把一个句子(可以看作词的序列)构建成图呢?TextRank将某一个词与其前面的N个词、以及后面的N个词均具有图相邻关系(类似于N-gram语法模型)。具体实现:设置一个长度为N的滑动窗口,所有在这个窗口之内的词都视作词结点的相邻结点;则TextRank构建的词图为无向图。下图给出了由一个文档构建的词图(去掉了停用词并按词性做了筛选):

网上别人的评估:

接下来将评估TextRank在关键词提取任务上的准确率、召回率与F1-Measure,并与TFIDF做对比;准确率计算公式如下:

测试集是由刘知远老师提供的网易新闻标注数据集,共有13702篇文档。Jieba完整地实现了关键词提取TFIDF与TextRank算法,基于Jieba-0.39的评估实验代码如下:

import jieba.analyse
import json
import codecs


def precision_recall_fscore_support(y_true, y_pred):
    """
    evaluate macro precision, recall and f1-score.
    """
    doc_num = len(y_true)
    p_macro = 0.0
    r_macro = 0.0
    for i in range(doc_num):
        tp = 0
        true_len = len(y_true[i])
        pred_len = len(y_pred[i])
        for w in y_pred[i]:
            if w in y_true[i]:
                tp += 1
        if pred_len == 0:
            p = 1.0 if true_len == 0 else 0.0
        else:
            p = tp / pred_len
        r = 1.0 if true_len == 0 else tp / true_len
        p_macro += p
        r_macro += r
    p_macro /= doc_num
    r_macro /= doc_num
    return p_macro, r_macro, 2 * p_macro * r_macro / (p_macro + r_macro)


file_path = 'data/163_chinese_news_dataset_2011.dat'
with codecs.open(file_path, 'r', 'utf-8') as fr:
    y_true = []
    y_pred = []
    for line in fr.readlines():
        d = json.loads(line)
        content = d['content']
        true_key_words = [w for w in set(d['tags'])]
        y_true.append(true_key_words)
        # for w in true_key_words:
        #     jieba.add_word(w)
        key_word_pos = ['x', 'ns', 'n', 'vn', 'v', 'l', 'j', 'nr', 'nrt', 'nt', 'nz', 'nrfg', 'm', 'i', 'an', 'f', 't',
                        'b', 'a', 'd', 'q', 's', 'z']
        extract_key_words = jieba.analyse.extract_tags(content, topK=2, allowPOS=key_word_pos)
        # trank = jieba.analyse.TextRank()
        # trank.span = 5
        # extract_key_words = trank.textrank(content, topK=2, allowPOS=key_word_pos)
        y_pred.append(extract_key_words)
    prf = precision_recall_fscore_support(y_true, y_pred)
    print('precision: {}'.format(prf[0]))
    print('recall: {}'.format(prf[1]))
    print('F1: {}'.format(prf[2]))

其中,每个文档提取的关键词数为2,并按词性做过滤;span表示TextRank算法中的滑动窗口的大小。评估结果如下:

方法 Precision Recall F1-Measure
TFIDF 0.2697 0.2256 0.2457
TextRank span=5 0.2608 0.2150 0.2357
TextRank span=7 0.2614 0.2155 0.2363

如果将标注关键词添加到自定义词典,则评估结果如下:

方法 Precision Recall F1-Measure
TFIDF 0.3145 0.2713 0.2913
TextRank span=5 0.2887 0.2442 0.2646
TextRank span=7 0.2903 0.2455 0.2660

直观感受下关键词提取结果(添加了自定义词典):

// TFIDF, TextRank, labelled
['文强', '陈洪刚'] ['文强', '陈洪刚'] {'文强', '重庆'}
['内贾德', '伊朗'] ['伊朗', '内贾德'] {'制裁', '世博', '伊朗'}
['调控', '王珏林'] ['调控', '楼市'] {'楼市', '调控'}
['罗平县', '男子'] ['男子', '罗平县'] {'被砍', '副局长', '情感纠葛'}
['佟某', '黄玉'] ['佟某', '黄现忠'] {'盲井', '伪造矿难'}
['女生', '聚众淫乱'] ['女生', '聚众淫乱'] {'聚众淫乱', '东莞', '不雅视频'}
['马英九', '和平协议'] ['马英九', '推进'] {'国台办', '马英九', '和平协议'}
['东帝汶', '巡逻艇'] ['东帝汶', '中国'] {'东帝汶', '军舰', '澳大利亚'}
['墨西哥', '警方'] ['墨西哥', '袭击'] {'枪手', '墨西哥', '打死'}

从上述两组实验结果,可以发现:

  • TextRank与TFIDF均严重依赖于分词结果——如果某词在分词时被切分成了两个词,那么在做关键词提取时无法将两个词黏合在一起(TextRank有部分黏合效果,但需要这两个词均为关键词)。因此是否添加标注关键词进自定义词典,将会造成准确率、召回率大相径庭。
  • TextRank的效果并不优于TFIDF。
  • TextRank虽然考虑到了词之间的关系,但是仍然倾向于将频繁词作为关键词。

此外,由于TextRank涉及到构建词图及迭代计算,所以提取速度较慢。

Guess you like

Origin blog.csdn.net/App_12062011/article/details/89816154