[python] Draw a word cloud map based on the wordcloud library

Word cloud Wordcloud is a visual representation of text data. It expresses the importance of each term by setting different font sizes or colors. Word clouds are widely used in social media because it allows readers to quickly perceive the most prominent terms. However, the output of word cloud has no uniform standard and lacks logic. It has a good degree of discrimination for words with a large difference in word frequency, but the effect is not good for words with similar colors and similar frequencies. Therefore word cloud is not suitable for scientific drawing. This article draws word clouds based on the python library wordcloud . The wordcloud installation method is as follows:

pip install wordcloud

0 wordcloud drawing instructions

The relevant functions of the wordcloud library for drawing word clouds are provided by its built-in class WordCloud.

The initial function of the WordCloud class is as follows:

WordCloud(font_path=None, width=400, height=200, margin=2,
          ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,
          color_func=None, max_words=200, min_font_size=4,
          stopwords=None, random_state=None, background_color='black',
          max_font_size=None, font_step=1, mode="RGB",
          relative_scaling='auto', regexp=None, collocations=True,
          colormap=None, normalize_plurals=True, contour_width=0,
          contour_color='black', repeat=False,
          include_numbers=False, min_word_length=0, collocation_threshold=30)

The initial function parameters are introduced as follows:

parameter type illustrate
font_path str Font path, Chinese word cloud drawing must provide font path
width int output canvas width
height int output canvas height
margin int Output Canvas Per-Vocabulary Border Margins
prefer_horizontal float The frequency of horizontal typesetting of words
mask numpy-array Use the default mask to draw the word cloud if it is empty, draw the word cloud with the given mask if it is not empty, and the width and height values ​​will be ignored
scale float Enlarge the canvas length and width proportionally
color_func func color setting function
max_words int Maximum count of words
min_font_size int Minimum font size
stopwords list plot words to filter
random_state int Random number, mainly used to set the color
background_color str background color
max_font_size int maximum font size
font_step int font step
mode str The drawing mode of the pillow image
relative_scaling float Correlation between word frequency and font size
regexp str Separate input text using regular expressions
collocations bool Does it include collocations of two words
colormap str Randomly assign colors to each word, if color_func is specified, this method is ignored
normalize_plurals bool Whether to replace plural English words with singular
contour_width int word cloud outline size
contour_color str word cloud outline color
repeat bool Whether to repeat the input text up to the maximum number of words allowed
include_numbers bool whether to include numbers as phrases
min_word_length int word contains minimum number of letters

The main function interfaces provided by the WordCloud class are as follows:

  • generate_from_frequencies(frequencies): generate word cloud based on word frequency
  • fit_words(frequencies): equivalent to the generate_from_frequencies function
  • process_text(text): participle
  • generate_from_text(text): Generate word cloud from text
  • generate(text):等同generate_from_text
  • to_image: output drawing result as pillow image
  • recolor: reset the color
  • to_array: output drawing result as numpy array
  • to_file(filename): save as a file
  • to_svg: save as svg file

1 Drawing example

1.1 Draw a word cloud for a single word

import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

text = "hello"

# 返回两个数组,只不过数组维度分别为n*1 和 1* m
x, y = np.ogrid[:300, :300]

# 设置绘图区域
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

# 绘制词云,repeat表示重复输入文本直到允许的最大词数max_words,scale设置放大比例
wc = WordCloud(background_color="white", repeat=True,max_words=32, mask=mask,scale=1.5)
wc.generate(text)

plt.axis("off")
plt.imshow(wc, interpolation="bilinear")
plt.show()

# 输出到文件
_ = wc.to_file("result.jpg")

png

1.2 Basic drawing


from wordcloud import WordCloud

# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()
# 生成词云, WordCloud对输入的文本text进行切词展示。
wordcloud = WordCloud().generate(text)

import matplotlib.pyplot as plt
plt.axis("off")
plt.imshow(wordcloud, interpolation='bilinear')
plt.show()

png

# 修改显示的最大的字体大小
wordcloud = WordCloud(max_font_size=50).generate(text)

# 另外一种展示结果方式
image = wordcloud.to_image()
image.show()

png

1.3 Customize word cloud shape

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()

# 想生成带特定形状的词云,首先得准备具备该形状的mask图片
# 在mask图片中除了目标形状外,其他地方都是空白的
mask = np.array(Image.open("mask.png"))

# 要跳过的词
stopwords = set(STOPWORDS)
# 去除better
stopwords.add("better")

# contour_width绘制mask边框宽度,contour_color设置mask区域颜色
# 如果mask边框绘制不准,设置contour_width=0表示不绘制边框
wc = WordCloud(background_color="white", max_words=2000, mask=mask,
               stopwords=stopwords, contour_width=2, contour_color='red',scale=2,repeat=True)

# 生成图片
wc.generate(text)

# 存储文件
wc.to_file("result.png")

# 展示词云结果
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
# 展示mask图片
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

png

png

1.4 Plotting with word frequency dictionary

# pip install multidict安装
import multidict as multidict

import numpy as np

import re
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 统计词频
def getFrequencyDictForText(sentence):
    fullTermsDict = multidict.MultiDict()
    tmpDict = {
    
    }

    # 按照空格分词
    for text in sentence.split(" "):
        # 如果匹配到相关词,就跳过,这样做可以获得定制度更高的结果
        if re.match("a|the|an|the|to|in|for|of|or|by|with|is|on|that|be", text):
            continue
        val = tmpDict.get(text, 0)
        tmpDict[text.lower()] = val + 1
    # 生成词频字典
    for key in tmpDict:
        fullTermsDict.add(key, tmpDict[key])
    return fullTermsDict


def makeImage(text):
    mask = np.array(Image.open("mask.png"))

    wc = WordCloud(background_color="white", max_words=1000, mask=mask, repeat=True)
    wc.generate_from_frequencies(text)

    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()



# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()

# 获得词频字典
fullTermsDict = getFrequencyDictForText(text)
# 绘图
makeImage(fullTermsDict)

png

1.5 Color change

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()

# 图片地址https://github.com/amueller/word_cloud/blob/master/examples/alice_color.png
alice_coloring = np.array(Image.open("alice_color.png"))
stopwords = set(STOPWORDS)
stopwords.add("better")

wc = WordCloud(background_color="white", max_words=500, mask=alice_coloring,
               stopwords=stopwords, max_font_size=50, random_state=42,repeat=True)
# 生成词云结果
wc.generate(text)
# 绘制
image = wc.to_image()
image.show()


# 绘制类似alice_coloring颜色的词云图片
# 从图片中提取颜色
image_colors = ImageColorGenerator(alice_coloring)
# 重新设置词云颜色
wc.recolor(color_func=image_colors)
# 绘制
image = wc.to_image()
image.show()

# 展示mask图片
plt.imshow(alice_coloring, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

png

png

png

1.6 Set color for specific words

from wordcloud import (WordCloud, get_single_color_func)
import matplotlib.pyplot as plt


# 直接赋色函数
class SimpleGroupedColorFunc(object):
    def __init__(self, color_to_words, default_color):
        # 特定词颜色
        self.word_to_color = {
    
    word: color
                              for (color, words) in color_to_words.items()
                              for word in words}
        # 默认词颜色
        self.default_color = default_color

    def __call__(self, word, **kwargs):
        return self.word_to_color.get(word, self.default_color)


class GroupedColorFunc(object):

    def __init__(self, color_to_words, default_color):
        self.color_func_to_words = [
            (get_single_color_func(color), set(words))
            for (color, words) in color_to_words.items()]

        self.default_color_func = get_single_color_func(default_color)

    def get_color_func(self, word):
        """Returns a single_color_func associated with the word"""
        try:
            color_func = next(
                color_func for (color_func, words) in self.color_func_to_words
                if word in words)
        except StopIteration:
            color_func = self.default_color_func

        return color_func

    def __call__(self, word, **kwargs):
        return self.get_color_func(word)(word, **kwargs)


text = """The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""

# 直接输入文本时,在统计数据时是否包括两个词的搭配
wc = WordCloud(collocations=False).generate(text.lower())

# 为特定词设置颜色
color_to_words = {
    
    
    'green': ['beautiful', 'explicit', 'simple', 'sparse',
                'readability', 'rules', 'practicality',
                'explicitly', 'one', 'now', 'easy', 'obvious', 'better'],
    '#FF00FF': ['ugly', 'implicit', 'complex', 'complicated', 'nested',
            'dense', 'special', 'errors', 'silently', 'ambiguity',
            'guess', 'hard']
}

# 设置除特定词外其他词的颜色为grey
default_color = 'grey'

# 直接赋色函数,直接按照color_to_words设置的RGB颜色绘图,输出的颜色不够精细
# grouped_color_simple = SimpleGroupedColorFunc(color_to_words, default_color)

# 更精细的赋色函数,将color_to_words设置的RGB颜色转到hsv空间,然后进行绘图
grouped_color = GroupedColorFunc(color_to_words, default_color)

# 应用颜色函数
wc.recolor(color_func=grouped_color)

# 绘图
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

png

1.7 Drawing Chinese word cloud

import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import numpy as np
# 读取文本
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/CalltoArms.txt
with open('CalltoArms.txt','r',encoding='utf-8') as f:
    text = f.read()

# 中文必须设置字体文件
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/fonts/SourceHanSerif/SourceHanSerifK-Light.otf
font_path =  'SourceHanSerifK-Light.otf'

# 不用于绘制词云的词汇列表
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/stopwords_cn_en.txt
stopwords_path = 'stopwords_cn_en.txt'
# 词云
# 模板图片
back_coloring = np.array(Image.open("alice_color.png"))

# 向jieba分词词典添加新的词语
userdict_list = ['阿Q', '孔乙己', '单四嫂子']


# 分词
def jieba_processing_txt(text):
    for word in userdict_list:
        jieba.add_word(word)

    mywordlist = []
    # 分词
    seg_list = jieba.cut(text, cut_all=False)
    liststr = "/ ".join(seg_list)

    with open(stopwords_path, encoding='utf-8') as f_stop:
        f_stop_text = f_stop.read()
        f_stop_seg_list = f_stop_text.splitlines()

    for myword in liststr.split('/'):
        if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1:
            mywordlist.append(myword)
    return ' '.join(mywordlist)
# 文字处理
text = jieba_processing_txt(text)

# margin设置词云每个词汇边框边距
wc = WordCloud(font_path=font_path, background_color="black", max_words=2000, mask=back_coloring,
               max_font_size=100, random_state=42, width=1000, height=860, margin=5,
               contour_width=2,contour_color='blue')


wc.generate(text)

# 获得颜色
image_colors_byImg = ImageColorGenerator(back_coloring)

plt.imshow(wc.recolor(color_func=image_colors_byImg), interpolation="bilinear")
plt.axis("off")
plt.figure()
plt.imshow(back_coloring, interpolation="bilinear")
plt.axis("off")
plt.show()

png

png

2 Reference

Guess you like

Origin blog.csdn.net/LuohenYJ/article/details/128217880