PYNLPIR Chinese word, generate word cloud picture

NLPIR Introduction

Official Website: NLPIR-ICTCLAS Chinese word system

NLPIR Chinese word segmentation system

Key features include Chinese word; the English word; speech tagging; named entity recognition; new word recognition; keyword extraction; user support specialized dictionaries and microblogging analysis. NLPIR system supports a variety of coding, multiple operating systems, a variety of development languages ​​and platforms.

Features

English mixed word function

Automatic Chinese and English information word speech tagging, cover all Chinese word, English word segmentation, POS tagging, unknown word recognition and user dictionaries and other functions.

Keyword extraction function

Using cross entropy algorithm automatically calculate keyword, including new words with known words, here is the result of the extraction of keywords eighth session of the Third Plenary part of the report.

New word identification word and an adaptive function

From a longer text content, based on the information cross entropy find the new features of the language, and adaptive language test corpus probability distribution model, adaptive segmentation.

Professional user dictionary function

Can be introduced into a single user dictionaries, user dictionaries may be introduced into the bulk. As can be given "channel report sensitive points", where the user is to report channel words, the sensitive point is user-defined speech tags.

PYNLPIR

pynlpir the API under Python, may be used directly mounted pip

Start or close API

pynlpir.open()
pynlpir.close()

Add User Dictionary

pynlpir.nlpir.ImportUserDict(b'xxx.txt')

The most important drop, word

text_segment = pynlpir.segment(contents)

The complete code

from collections import Counter
import matplotlib.pyplot as plt
import wordcloud
import pynlpir

with open('./paper.txt', encoding='utf-8') as text:
    contents = text.read()

# 分词
pynlpir.open()    # 启动API
pynlpir.nlpir.ImportUserDict(b'user_dict.txt')    # 读取用户字典,路径需要是二进制字符串
text_segment = pynlpir.segment(contents)    # 分词
words = []
disliked_tag = ['numeral', 'time word', 'punctuation mark',
                'preposition', 'conjunction', 'noun of locality']  # 不喜欢的标签
for w in text_segment:
    w0 = w[0].strip()
    if len(w0) > 1 and w[1] not in disliked_tag:    # 删除单个字和不喜欢的标签
        print(w)
        words.append(w0)
pynlpir.close()    # 关闭API
# 词频统计
word_cnt = Counter(words)
print(word_cnt)

# 生成词云
wc = wordcloud.WordCloud(
    scale=8,    # 设置图像清晰度,只在保存时起作用,显示时不起作用
    font_path='C:/Windows/Fonts/simhei.ttf',    # 设置字体格式
    max_words=50,    # 最多显示词数
    max_font_size=100,
    background_color='white'
)
wc.generate_from_frequencies(word_cnt)  # 从字典生成词云
wc.to_file('./1.png')    # 保存词云图像

# 显示词云
plt.imshow(wc)
plt.axis('off')    # 关闭坐标轴
plt.show()    # 显示图像

Generated pictures

Released nine original articles · won praise 2 · Views 316

Guess you like

Origin blog.csdn.net/Z_Pythagoras/article/details/105036744