NLP - jieba (keyword extraction (TFIDF / TextRand))

Keyword extraction -TFIDF

Word frequency (Term Frequency, abbreviated as TF): the highest number of words appear
Here Insert Picture Description
When a word is relatively rare, but it appears more than once in this article,
it is likely to reflect the characteristics of this article, is exactly what we required keywords.
"Inverse document frequency" (IDF)
Here Insert Picture Description
TF-IDF and the number of occurrences of a word in a document directly proportional, inversely proportional to the number of occurrences of the word in the entire language

Key words extraction based on TF-IDF algorithm

import jieba.analyse

  • jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
    • sentence to be extracted text
    • topK to return several TF / IDF weight of the largest keyword, the default value is 20
    • withWeight is whether you also go back and try the right keyword value, the default is False
    • allowPOS include only specified parts of speech of the word, default is empty, that does not filter
import jieba
import jieba.analyse as analyse

lines=open('NBA.txt',encoding='utf-8').read()
print ("  ".join(analyse.extract_tags(lines, topK=20, withWeight=False, allowPOS=())))
韦少  杜兰特  全明星  全明星赛  MVP  威少  正赛  科尔  投篮  勇士 
球员  斯布鲁克  更衣柜  张卫平  三连庄  NBA  西部  指导  雷霆  明星队

Keyword extraction supplement on TF-IDF algorithm

  • Inverse document frequency (IDF) used in the text corpus keyword extraction path can be switched into a customized corpus
    • Usage: jieba.analyse.set_idf_path (file_name) # file_name custom Corpus path
      • Custom corpus example see here
      • The usage examples, see here
  • Stop words (Stop Words) keyword extraction text corpus can be used to custom path switching corpus
    • Usage: jieba.analyse.set_stop_words (file_name) # file_name custom Corpus path
      • Custom corpus example see here
      • The usage examples, see here
  • Key words together with the right to return to re-value example keyword
    • The usage examples, see here

Extraction algorithm based on keywords TextRank

jieba.analyse.textrank (sentence, topK = 20, withWeight = False, allowPOS = ( 'ns', 'n', 'vn', 'v')) is directly used, the same interfaces, Note that the default filtered speech.
jieba.analyse.TextRank () New Custom TextRank instance
the basic idea:

  • The text will be extracted keywords are word
  • In a fixed window size (default 5, by adjusting the span attribute), co-occurrence relationships between words, FIG construct
  • FIG computing nodes in PageRank, undirected weighted note FIG.
import jieba.analyse as analyse
lines = open('NBA.txt').read()
print "  ".join(analyse.textrank(lines, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))
print "---------------------我是分割线----------------"
print "  ".join(analyse.textrank(lines, topK=20, withWeight=False, allowPOS=('ns', 'n')))

Speech tagging

  • jieba.posseg.POSTokenizer (tokenizer = None) New Custom tokenizer, tokenizer parameter specifies internal use jieba.Tokenizer tokenizer. jieba.posseg.dt speech tagging is the default word breaker.
  • Labeling each sentence word after word of speech, and use of compatible ictclas notation.
  • Referring to the specific part of speech table calculating the set of Chinese speech tags
import jieba.posseg as pseg
words = pseg.cut("我爱自然语言处理")
for word, flag in words:
    print('%s %s' % (word, flag))

Parallel word

Principle: The target text press line separator, assigned to each line of text to a considerable upgrade multiple Python processes parallel word, and then merge the results to obtain speed-based word comes with python multiprocessing module currently does not support Windows

Usage:
jieba.enable_parallel (4) # open parallel word mode, the parameter is the number of parallel processes
jieba.disable_parallel () # close Sentence Mode parallel
results: on a 4-core 3.4GHz Linux machine, Complete Works of Jin Yong precise word, access speed of 1MB / s, and is 3.3 times the single version of the process.

Note: Parallel word only supports default tokenizer jieba.dt and jieba.posseg.dt.

import sys
import time
import jieba

jieba.enable_parallel()
content = open(u'西游记.txt',"r").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('并行分词速度为 %s bytes/second' % (len(content)/tm_cost))

jieba.disable_parallel()
content = open(u'西游记.txt',"r").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('非并行分词速度为 %s bytes/second' % (len(content)/tm_cost))

Tokenize: Returns the start and end of words in the original text

Note that the input parameters only accept unicode

print "这是默认模式的tokenize"
result = jieba.tokenize(u'自然语言处理非常有用')
for tk in result:
    print("%s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

print "\n-----------我是神奇的分割线------------\n"

print "这是搜索模式的tokenize"
result = jieba.tokenize(u'自然语言处理非常有用', mode='search')
for tk in result:
    print("%s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

Guess you like

Origin blog.csdn.net/lgy54321/article/details/90670902