Do Chinese word with word cloud made with Python

Author: Mei Haoming

1. Introduction

In the era of big data, we often see a variety of information in the media or the site map. Word cloud is an important way to text large data visualization can be large section of text to highlight key phrases and vocabulary show. For visualization of Chinese text, we need first Chinese word text; then do keyword text word cloud show. This article will teach you how to use Python to do the Chinese word and make a word cloud, welcome to follow the tutorial step by step implementation.

Project Address: https://momodel.cn/workspace/5e77afb7a8a7dc6753f582b9?type=app

2. Chinese word

Getting Started with Word 2.1 points

The so-called word sequences according to the text that is full of meaning Siqie into a one word, to facilitate the next step of the analysis (word frequency statistics, sentiment analysis, etc.). The English words and word comes with a space as a separator, compared to the Chinese word should be simple. Here we take an example to introduce the Chinese word. Python provides us with Jieba library word, then how to use the library at noon it Jieba word.

import jieba

# 文本数据
text = "MomodelAI是一个支持在线数据分析和AI建模的平台。"
result = jieba.cut(text)

# 将切分好的文本用" "分开
print("分词结果: " + " ".join(result)) 

'''
分词结果: MomodelAI 是 一个 支持 在线 数据分析 和 AI 建模 的 平台 。
'''

2.2 Special Noun

For some special terms, in order to make it cut time-sharing is not separated, we can choose to emphasize in these terms before slicing.

text = "Mo平台是一种支持模型开发与部署的人工智能建模平台。"

# 强调特殊名词
jieba.suggest_freq(('Mo平台'), True)
result = jieba.cut(text)

print("分词结果: "+" ".join(result)) 

'''
分词结果: Mo平台 是 一种 支持 模型 开发 与 部署 的 人工智能 建模 平台 。
'''

2.3 clean text

Some special symbols after splitting will separate into words, these words will impact analysis after us. Here we can use a punctuation library stopwords.txt, slicing out the special symbols to weed out; for "the," "of," such as the length of a word, apparently without any help us analyze the text. The method of treatment is a length of a word of all weed out.

#从文件导入停用词表
stpwrdpath = "stop_words.txt"
stpwrd_dic = open(stpwrdpath, 'rb')
stpwrd_content = stpwrd_dic.read()

#将停用词表转换为list  
stpwrdlst = stpwrd_content.splitlines()
stpwrd_dic.close()
segs = jieba.cut(text)
mytext_list = []

# 文本清洗
for seg in segs:
    if seg not in stpwrdlst and seg!=" " and len(seg)!=1:
        mytext_list.append(seg.replace(" "," "))
        
cloud_text=" ".join(mytext_list) 
print("清洗后的分词结果: " + cloud_text)

'''
清洗后的分词结果: Mo平台 一种 支持 模型 开发 部署 人工智能 建模 平台
'''

3. word cloud production

3.1 simple word cloud production

Before making Chinese text data word cloud, we first use the method described above will be Chinese text word.

# 中文分词
from wordcloud import WordCloud

with open('./Mo.txt',encoding = 'utf-8', mode = 'r')as f:
    myText = f.read()

myText = " ".join(jieba.cut(myText)) 
print(myText)

After obtaining a good word text data, we then use the library to make WordCloud word cloud. (Note: Due to WordCloud itself does not support Chinese font, we need to download simsun.ttf, as the designated output font.)

# 制作词云
wordcloud = WordCloud(background_color="white", font_path="simsun.ttf", height=300, width = 400).generate(myText)

# 图片展示
import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# 将词云图片导出到当前文件夹
wordcloud.to_file("wordCloudMo.png") 

3.2 Draw the shape of the specified word cloud

Production specify the shape of a word cloud, we need to read the external shape of the word cloud picture, here we use imageio library.

# 导入词云制作库wordcloud和中文分词库jieba
import jieba
import wordcloud

# 导入imageio库中的imread函数,并用这个函数读取本地图片,作为词云形状图片
import imageio
mk = imageio.imread("chinamap.png")
w = wordcloud.WordCloud(mask=mk)

# 构建并配置词云对象w,注意要加scale参数,提高清晰度
w = wordcloud.WordCloud(width=1000, height=700,background_color='white',font_path='simsun.ttf',mask=mk,scale=15)

# 对来自外部文件的文本进行中文分词,得到string
f = open('新时代中国特色社会主义.txt',encoding='utf-8')
txt = f.read()
txtlist = jieba.lcut(txt)
string = " ".join(txtlist)

# 将string变量传入w的generate()方法,给词云输入文字
wordcloud = w.generate(string)


import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# 将词云图片导出到当前文件夹
w.to_file('chinamapWordCloud.png')

3.3 The results show

4. References

  1. Blog: https://www.jianshu.com/p/e4b24a734ccc
  2. Github project: https://github.com/TommyZihao/zihaowordcloud
  3. Video Tutorial: https://www.bilibili.com/video/av53917673/?p=1

About ##
Mo (URL: https: //momodel.cn) is a support Python artificial intelligence online modeling platform that can help you quickly develop, training and deployment model.

Recent Mo are ongoing related to machine learning introductory courses and thesis sharing activities, the public are welcome to look at our numbers for the latest information!

Published 36 original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_44015907/article/details/105220597