Keyword extraction -TFIDF
Word frequency (Term Frequency, abbreviated as TF): the highest number of words appear
When a word is relatively rare, but it appears more than once in this article,
it is likely to reflect the characteristics of this article, is exactly what we required keywords.
"Inverse document frequency" (IDF)
TF-IDF and the number of occurrences of a word in a document directly proportional, inversely proportional to the number of occurrences of the word in the entire language
Key words extraction based on TF-IDF algorithm
import jieba.analyse
- jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence to be extracted text
- topK to return several TF / IDF weight of the largest keyword, the default value is 20
- withWeight is whether you also go back and try the right keyword value, the default is False
- allowPOS include only specified parts of speech of the word, default is empty, that does not filter
import jieba
import jieba.analyse as analyse
lines=open('NBA.txt',encoding='utf-8').read()
print (" ".join(analyse.extract_tags(lines, topK=20, withWeight=False, allowPOS=())))
韦少 杜兰特 全明星 全明星赛 MVP 威少 正赛 科尔 投篮 勇士
球员 斯布鲁克 更衣柜 张卫平 三连庄 NBA 西部 指导 雷霆 明星队
Keyword extraction supplement on TF-IDF algorithm
- Inverse document frequency (IDF) used in the text corpus keyword extraction path can be switched into a customized corpus
- Stop words (Stop Words) keyword extraction text corpus can be used to custom path switching corpus
- Usage: jieba.analyse.set_stop_words (file_name) # file_name custom Corpus path
- Custom corpus example see here
- The usage examples, see here
- Usage: jieba.analyse.set_stop_words (file_name) # file_name custom Corpus path
- Key words together with the right to return to re-value example keyword
- The usage examples, see here
Extraction algorithm based on keywords TextRank
jieba.analyse.textrank (sentence, topK = 20, withWeight = False, allowPOS = ( 'ns', 'n', 'vn', 'v')) is directly used, the same interfaces, Note that the default filtered speech.
jieba.analyse.TextRank () New Custom TextRank instance
the basic idea:
- The text will be extracted keywords are word
- In a fixed window size (default 5, by adjusting the span attribute), co-occurrence relationships between words, FIG construct
- FIG computing nodes in PageRank, undirected weighted note FIG.
import jieba.analyse as analyse
lines = open('NBA.txt').read()
print " ".join(analyse.textrank(lines, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))
print "---------------------我是分割线----------------"
print " ".join(analyse.textrank(lines, topK=20, withWeight=False, allowPOS=('ns', 'n')))
Speech tagging
- jieba.posseg.POSTokenizer (tokenizer = None) New Custom tokenizer, tokenizer parameter specifies internal use jieba.Tokenizer tokenizer. jieba.posseg.dt speech tagging is the default word breaker.
- Labeling each sentence word after word of speech, and use of compatible ictclas notation.
- Referring to the specific part of speech table calculating the set of Chinese speech tags
import jieba.posseg as pseg
words = pseg.cut("我爱自然语言处理")
for word, flag in words:
print('%s %s' % (word, flag))
Parallel word
Principle: The target text press line separator, assigned to each line of text to a considerable upgrade multiple Python processes parallel word, and then merge the results to obtain speed-based word comes with python multiprocessing module currently does not support Windows
Usage:
jieba.enable_parallel (4) # open parallel word mode, the parameter is the number of parallel processes
jieba.disable_parallel () # close Sentence Mode parallel
results: on a 4-core 3.4GHz Linux machine, Complete Works of Jin Yong precise word, access speed of 1MB / s, and is 3.3 times the single version of the process.
Note: Parallel word only supports default tokenizer jieba.dt and jieba.posseg.dt.
import sys
import time
import jieba
jieba.enable_parallel()
content = open(u'西游记.txt',"r").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('并行分词速度为 %s bytes/second' % (len(content)/tm_cost))
jieba.disable_parallel()
content = open(u'西游记.txt',"r").read()
t1 = time.time()
words = "/ ".join(jieba.cut(content))
t2 = time.time()
tm_cost = t2-t1
print('非并行分词速度为 %s bytes/second' % (len(content)/tm_cost))
Tokenize: Returns the start and end of words in the original text
Note that the input parameters only accept unicode
print "这是默认模式的tokenize"
result = jieba.tokenize(u'自然语言处理非常有用')
for tk in result:
print("%s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
print "\n-----------我是神奇的分割线------------\n"
print "这是搜索模式的tokenize"
result = jieba.tokenize(u'自然语言处理非常有用', mode='search')
for tk in result:
print("%s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))