Natural language keyword extraction

Key words extracted from the text which is to put some words in this article with the most relevant meaning extracted, has important applications in document retrieval, automatic summarization, text clustering / classification and so on.

Keyword extraction algorithm is generally divided into two types of supervised and unsupervised

Supervised keyword extraction method is mainly carried out by way of classification, by building a more extensive and comprehensive vocabulary, and then determine the degree of matching each document and each word in the vocabulary, playing tag in a similar way to reach keywords effect extraction. The advantage is high accuracy, the disadvantage is the need for large quantities of labeled data, labor costs are too high, the need for timely maintenance and vocabulary.

In comparison, unsupervised low data requirements, does not require an artificially generated, vocabulary maintenance and does not require manual annotation corpus aid training. Now commonly used keyword extraction algorithms are unsupervised algorithm. The TF-IDF algorithm, TextRank topic model algorithm and the algorithm (including the LSA, LSI, LDA, etc.)

1.TF-IDF algorithm

TF-IDF value is a statistical method used to reflect the importance of a word in an article of the document is expected, its main idea is: If a word appears in a document of high frequency, namely TF high; and rarely appear in other documents that the high IDF, believes that the word has a good ability to distinguish between categories.

TF is a word frequency, t represents the frequency of occurrence of words in the document d: tf (word) = (number of times the word appears in the document) / (total number of words document)

IDF used is calculated as follows: | D | documentation set for the total number of documents, the number of words of i document appears in the document. 1 plus the denominator is the use of Laplace smoothing, to avoid the situation leading to the denominator zero part of the new word does not appear in corpus.

2.TextRank algorithm

An important feature of this algorithm is to be out of context corpus, only a single article document analysis can extract keywords of the document. The basic idea comes from Google's PageRank algorithm. This algorithm is 1997 a link analysis algorithm Google founders Larry Page and Sergey Brin when making prototype system constructed in the search early, there are two basic ideas..:

1) the number of links. A web page is the more links to other pages, indicating that the more important that page
2) link quality. A web page is a link to a higher weight, but also shows that the more important this page

TextRank for keyword extraction algorithm is as follows:

(1) the given text T is divided according to a complete sentence, i.e.:

(2) For each sentence, a word processing and speech tagging, filtered off and stop words, words leaving only the designated part of speech, such as noun, verb, adjective, wherein

ti, j is a candidate for the reserved keywords.

(3) Construction of keyword candidates FIG. G = (V, E), where V is the set of nodes by (2) the composition of the generated keyword candidates, and then using the co-occurrence relationship between (Co-Occurrence) configured to any two points edge, between two nodes corresponding to the edges thereof only if the word length of the current CPC window K, K represents the window size, i.e., up to now a total of K words.

(4) The TextRank formula weights of each node iterative propagation weight, until convergence.

(5) node weights reverse sort, whereby the most important word T, as keyword candidates.

(6) (5) T of the most important words, marked in the original text, if the formation of the adjacent phrases, then combined into a multi-word keywords.

Both achieve the following main keyword extraction algorithm analyse function jieba bag:

jieba.analyse.extract_tags (sentence, TOPK = 20 is, withWeight = False, allowPOS = ())
sentence: be extracted text corpus

topK: Returns the TF / IDF weighting the maximum number of keywords, the default value is 20

withWeight: the need to return the right to re-value keyword, the default is False

allowPOS: includes only the specified part of speech the word, default is empty, that does not filter
# simple version
from jieba import analyse

= analyse.extract_tags TFIDF
# Load stop words
analyse.set_stop_words ( 'stopword.txt')
text = 'June 19, "the 2012" China Love City "charity press conference" held in Beijing. '+
' China Social Assistance Foundation, chairman Xu Jialu spoke at the meeting. Foundation senior adviser Zhu Fazhong, the National Aging '+
' Zhu Yong, deputy director of the Ministry of Civil Affairs Division of Social Assistance, Assistant Inspector Zhou Ping, vice chairman of the China Social Assistance Foundation, Zhi-Yuan Geng, '+
' Chongqing Municipal Bureau of Civil Affairs Inspector Tan Mingzheng. Jinjiang City People's Congress Chairman Chen Jianqian, as well as more than 10 provinces, municipalities and autonomous regions of Civil Affairs '+
' and more than forty leading media attended the conference.  China Social Assistance Foundation is a new introduction this year "love China Town '+ when the Secretary-General
' City 'public service activities will be" the city of love publicity, the solitary care assistance projects and the Second Conference of Chinese city of love "as the main content, Chongqing '+
', Hohhot, Changsha, Taiyuan, Bengbu City, Nanchang, Shantou City, Cangzhou City, Jinjiang City, and will actively participate in Zunhua '+
' the charity.  China Yahoo and channel director, deputy editor Zhang Yinsheng Phoenix City Zhao Yao were introduced to the advantages of each media activities '+
' promotional program.  At the meeting, the China Social Assistance Foundation and "The Second China Love Cities Conference" hosted Fang Jinjiang City contract, Xu Jialu Li '+
' long accepted thing to participate in Jinjiang City, "a million lonely old Love Action" to donate to the state's key poverty alleviation region funds and materials worth $ 4 million. Jinjiang Municipal People's Congress '+
' Standing Committee Director Chen Jianqian introduced the preparations for the General Assembly. '
Keywords = TFIDF (text, TOPK = 10, withWeight = False, allowPOS = ())
print(‘结果为:’)
print([keyword for keyword in keywords])

The results are:
[ 'Jinjiang City', 'relief', 'love', 'foundation', 'charity', 'City', 'China', 'Xu Jialu,' 'Chen Jianqian', 'lonely old']
The following is a function version:

from jieba import analyse

textrank_extract DEF (text, keyword_num = 10):
textrank = analyse.textrank
analyse.set_stop_words ( 'stopword.txt')
Keywords = textrank (text, keyword_num)
# output extracted keyword
for keyword in Keywords:
Print (+ keyword " / ", End = '')
Print ()

def tfidf_extract(text,keyword_num=10):
tfidf = analyse.extract_tags
analyse.set_stop_words(‘stopword.txt’)
keywords = tfidf(text, keyword_num)
# 输出抽取出的关键词
for keyword in keywords:
print(keyword + "/ ", end=’’)
print()

IF name == ' main ':
text = 'June 19, "the 2012" China Love City "charity press conference" held in Beijing. '+
' China Social Assistance Foundation, chairman Xu Jialu spoke at the meeting. Foundation senior adviser Zhu Fazhong, the National Aging '+
' Zhu Yong, deputy director of the Ministry of Civil Affairs Division of Social Assistance, Assistant Inspector Zhou Ping, vice chairman of the China Social Assistance Foundation, Zhi-Yuan Geng, '+
' Chongqing Municipal Bureau of Civil Affairs Inspector Tan Mingzheng. Jinjiang City People's Congress Chairman Chen Jianqian, as well as more than 10 provinces, municipalities and autonomous regions of Civil Affairs '+
' and more than forty leading media attended the conference.  China Social Assistance Foundation is a new introduction this year "love China Town '+ when the Secretary-General
' City 'public service activities will be" the city of love publicity, the solitary care assistance projects and the Second Conference of Chinese city of love "as the main content, Chongqing '+
', Hohhot, Changsha, Taiyuan, Bengbu City, Nanchang, Shantou City, Cangzhou City, Jinjiang City, and will actively participate in Zunhua '+
' the charity.  China Yahoo and channel director, deputy editor Zhang Yinsheng Phoenix City Zhao Yao were introduced to the advantages of each media activities '+
' promotional program.  At the meeting, the China Social Assistance Foundation and "The Second China Love Cities Conference" hosted Fang Jinjiang City contract, Xu Jialu Li '+
' long accepted thing to participate in Jinjiang City, "a million lonely old Love Action" to donate to the state's key poverty alleviation region funds and materials worth $ 4 million. Jinjiang Municipal People's Congress '+
' Standing Committee Director Chen Jianqian introduced the preparations for the General Assembly. '

print('TF-IDF模型结果:')
tfidf_extract(text)
print('TextRank模型结果:')
textrank_extract(text)

TF - IDF model results:
Jinjiang City / rescue / love / Foundation / charity / City / China / Xu Jialu / Chen Jianqian / lonely old /
TextRank model results:
City / love / rescue / China / Society / Jinjiang / Foundation / Assembly / presentation / charity
/ ----------------
Disclaimer: This article is the original article CSDN bloggers "blue sky Ge", and follow CC 4.0 BY-SA copyright agreement, reproduced attach the original source link and this statement.
Original link: https: //blog.csdn.net/qq_38923076/article/details/81630442

Published 14 original articles · won praise 4 · Views 7837

Guess you like

Origin blog.csdn.net/myword1314/article/details/104395945