Homepage: https://codeshellme.github.io
Today I will introduce how to use Python to make a word cloud .
A word cloud is also called a word cloud. It can count the more frequent words in the text and visualize these words so that we can intuitively understand the key words in the text.
Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 828010317
The higher the frequency of the word, the larger the size of the word display.
1. The wordcloud module
wordcloud is a word cloud generator. It is not only a Python library, but also a command line tool . We can learn how to use it through the official wordcloud documentation and sample library .
Before using wordcloud , you need to install it first:
pip install wordcloud
2. WordCloud class
The WordCloud class is used to create 词云对象
, let's take a look at its prototype:
WordCloud(font_path=None,
width=400, height=200,
margin=2, ranks_only=None,
prefer_horizontal=0.9,
mask=None, scale=1,
color_func=None, max_words=200,
min_font_size=4, stopwords=None,
random_state=None,
background_color='black',
max_font_size=None,
font_step=1, mode='RGB',
relative_scaling='auto',
regexp=None, collocations=True,
colormap=None, normalize_plurals=True,
contour_width=0, contour_color='black',
repeat=False, include_numbers=False,
min_word_length=0,
collocation_threshold=30)
As you can see, the WordCloud class has many parameters that can be set. Here are some commonly used parameters:
- font_path : Set the font file path to the font file
.ttf
suffix.- If the analyzed text is Chinese, you need to set the Chinese font, otherwise it will be garbled.
- background_color : Set the background color of the picture, the default is black , or it can be set to white, etc.
- mask : Set the background image.
- max_words : Set the maximum number of words, the default is 200.
- stopwords : Set stop words.
- max_font_size : Set the maximum font size.
- width : Set the width of the canvas, the default is 400.
- height : Set the height of the canvas, the default is 200.
- random_state : How many random states are set, that is, how many colors.
After creation 词云对象
, you can use the generate
method to generate the word cloud, and use the to_file
method to save the word cloud image in the file.
generate
The prototype of the method is as follows:
generate(text)
The parameter text
is a 空格
separated text string. If the analysis is in Chinese, you need to use jieba for word segmentation first , you can refer to here .
In addition to saving the word cloud image in a file, you can also use the Matplotlib module to display the word cloud image. The sample code is as follows:
import matplotlib.pyplot as plt
plt.imshow(wordcloud) # wordcloud 是词云对象
plt.axis("off") # 用于关闭坐标轴
plt.show()
3. A simple example
Here is a simple example to see how to use wordcloud .
First create 词云对象
:
from wordcloud import WordCloud
wc = WordCloud()
Generate word cloud:
text = "Python is a programming language, it is easy to use."
wc.generate(text)
The words_
attribute of the word cloud object stores the (normalized) weight of each word, and the range of the weight is (0, 1]
.
words_
The attribute is a dictionary type, and the maximum number of keys it stores is max_words
, that is WordCloud
, the parameters of the class.
as follows:
>>> wc.words_
{'Python': 1.0, 'programming': 1.0, 'language': 1.0, 'easy': 1.0, 'use': 1.0}
# 示例中的这些单词出现的频率都相等(均为 1),
# 所以它们的权重都是 1。
Use Matplotlib to display word cloud images:
import matplotlib.pyplot as plt
plt.imshow(wc)
plt.axis("off")
plt.show()
The word cloud image is as follows:
4. Do word cloud analysis of ancient poems
I have prepared a case here , which is a word cloud analysis of 1,000 ancient poems.
The code directory is as follows:
wordcloud/ ├── SimHei.ttf ├── gushi.txt └── gushi_wordcloud.py
among them:
- SimHei.ttf : is a font file to avoid Chinese garbled characters in word cloud analysis.
- gushi.txt : This file contains 1000 ancient poems .
- gushi_wordcloud.py : is the word cloud analysis code .
I also put the code here for easy viewing:
#!/usr/bin/env python
# coding=utf-8
import os
import sys
import jieba
from wordcloud import WordCloud
if sys.version.startswith('2.'):
reload(sys)
sys.setdefaultencoding('utf-8')
# 去掉一些作者的名字
STOPWORDS = [
u'李白', u'杜甫', u'辛弃疾', u'李清照', u'苏轼',
u'李商隐', u'王维', u'白居易', u'李煜', u'杜牧',
]
def load_file(file_path):
if sys.version.startswith('2.'):
with open(file_path) as f:
lines = f.readlines()
else:
with open(file_path, encoding='utf-8') as f:
lines = f.readlines()
content = ''
for line in lines:
line = line.encode('unicode-escape').decode('unicode-escape')
line = line.strip().rstrip('\n')
content += line
words = jieba.cut(content)
l = []
for w in words:
# 如果词的长度小于 2,则舍去
if len(w) < 2: continue
l.append(w)
return ' '.join(l)
if __name__ == '__main__':
file_path = './gushi.txt'
content = load_file(file_path)
wc = WordCloud(
font_path="./SimHei.ttf",
stopwords=STOPWORDS,
width=2000, height=1200)
wc.generate(content)
wc.to_file("wordcloud.jpg")
among them:
STOPWORDS
The stop word list is the names of some authors.load_file
The method is used to load text, which uses jieba word segmentation.
Finally, the word cloud image is saved in the wordcloud.jpg
file, as follows:
We can also words_
view the weight of each word from the attributes of the word cloud object . Here I list the top ten:
('Ming Moon', 1.0) ('Today', 0.9130434782608695) ('I don't know', 0.8405797101449275) ('Where', 0.8260869565217391) ('No See', 0.8115942028985508) ('Spring Breeze', 0.7536231884057971) ('Nobody', 0.7536231884057971 ) ('Not possible', 0.7536231884057971) ('Wanli', 0.7536231884057971) ('Modern', 0.6666666666666666)
(End of this section.)