NLPIR Introduction
Official Website: NLPIR-ICTCLAS Chinese word system
NLPIR Chinese word segmentation system
Key features include Chinese word; the English word; speech tagging; named entity recognition; new word recognition; keyword extraction; user support specialized dictionaries and microblogging analysis. NLPIR system supports a variety of coding, multiple operating systems, a variety of development languages and platforms.
Features
English mixed word function
Automatic Chinese and English information word speech tagging, cover all Chinese word, English word segmentation, POS tagging, unknown word recognition and user dictionaries and other functions.
Keyword extraction function
Using cross entropy algorithm automatically calculate keyword, including new words with known words, here is the result of the extraction of keywords eighth session of the Third Plenary part of the report.
New word identification word and an adaptive function
From a longer text content, based on the information cross entropy find the new features of the language, and adaptive language test corpus probability distribution model, adaptive segmentation.
Professional user dictionary function
Can be introduced into a single user dictionaries, user dictionaries may be introduced into the bulk. As can be given "channel report sensitive points", where the user is to report channel words, the sensitive point is user-defined speech tags.
PYNLPIR
pynlpir the API under Python, may be used directly mounted pip
Start or close API
pynlpir.open()
pynlpir.close()
Add User Dictionary
pynlpir.nlpir.ImportUserDict(b'xxx.txt')
The most important drop, word
text_segment = pynlpir.segment(contents)
The complete code
from collections import Counter
import matplotlib.pyplot as plt
import wordcloud
import pynlpir
with open('./paper.txt', encoding='utf-8') as text:
contents = text.read()
# 分词
pynlpir.open() # 启动API
pynlpir.nlpir.ImportUserDict(b'user_dict.txt') # 读取用户字典,路径需要是二进制字符串
text_segment = pynlpir.segment(contents) # 分词
words = []
disliked_tag = ['numeral', 'time word', 'punctuation mark',
'preposition', 'conjunction', 'noun of locality'] # 不喜欢的标签
for w in text_segment:
w0 = w[0].strip()
if len(w0) > 1 and w[1] not in disliked_tag: # 删除单个字和不喜欢的标签
print(w)
words.append(w0)
pynlpir.close() # 关闭API
# 词频统计
word_cnt = Counter(words)
print(word_cnt)
# 生成词云
wc = wordcloud.WordCloud(
scale=8, # 设置图像清晰度,只在保存时起作用,显示时不起作用
font_path='C:/Windows/Fonts/simhei.ttf', # 设置字体格式
max_words=50, # 最多显示词数
max_font_size=100,
background_color='white'
)
wc.generate_from_frequencies(word_cnt) # 从字典生成词云
wc.to_file('./1.png') # 保存词云图像
# 显示词云
plt.imshow(wc)
plt.axis('off') # 关闭坐标轴
plt.show() # 显示图像