How to make a word cloud with Python-a word cloud analysis of 1000 ancient poems

Homepage:  https://codeshellme.github.io

Today I will introduce how to use  Python to  make a  word cloud  .

A word cloud is also called a word cloud. It can count the more frequent words in the text and visualize these words so that we can intuitively understand the key words in the text.

Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 828010317

The higher the frequency of the word, the larger the size of the word display.

1. The wordcloud module

wordcloud  is a word cloud generator. It is not only a  Python  library, but also a  command line tool  . We can  learn how to use it through the  official wordcloud documentation  and  sample library .

Before using  wordcloud  , you need to install it first:

pip install wordcloud

2. WordCloud class

The WordCloud  class is used to create  词云对象 , let's take a look at its prototype:

WordCloud(font_path=None, 
  width=400, height=200, 
  margin=2, ranks_only=None, 
  prefer_horizontal=0.9, 
  mask=None, scale=1, 
  color_func=None, max_words=200, 
  min_font_size=4, stopwords=None, 
  random_state=None, 
  background_color='black', 
  max_font_size=None, 
  font_step=1, mode='RGB', 
  relative_scaling='auto', 
  regexp=None, collocations=True, 
  colormap=None, normalize_plurals=True, 
  contour_width=0, contour_color='black', 
  repeat=False, include_numbers=False, 
  min_word_length=0, 
  collocation_threshold=30)

As you can see, the  WordCloud  class has many parameters that can be set. Here are some commonly used parameters:

  • font_path  : Set the font file path to the font file  .ttf suffix.
    • If the analyzed text is Chinese, you need to set the Chinese font, otherwise it will be garbled.
  • background_color  : Set the background color of the picture, the default is  black  , or it can be set to  white,  etc.
  • mask  : Set the background image.
  • max_words  : Set the maximum number of words, the default is 200.
  • stopwords  : Set stop words.
  • max_font_size  : Set the maximum font size.
  • width  : Set the width of the canvas, the default is 400.
  • height  : Set the height of the canvas, the default is 200.
  • random_state  : How many random states are set, that is, how many colors.

After creation  词云对象 , you can use the  generate method to generate the word cloud, and use the  to_file method to save the word cloud image in the file.

generate The prototype of the method is as follows:

generate(text)

The parameter  text is a  空格 separated text string. If the analysis is in Chinese, you need to use jieba  for word segmentation first  , you can refer to  here  .

In addition to saving the word cloud image in a file, you can also use the  Matplotlib  module to display the word cloud image. The sample code is as follows:

import matplotlib.pyplot as plt

plt.imshow(wordcloud)  # wordcloud 是词云对象
plt.axis("off")        # 用于关闭坐标轴
plt.show()

3. A simple example

Here is a simple example to see how to use  wordcloud  .

First create  词云对象 :

from wordcloud import WordCloud
wc = WordCloud()

Generate word cloud:

text = "Python is a programming language, it is easy to use."
wc.generate(text)

The words_ attribute of the word cloud object  stores the (normalized) weight of each word, and the range of the weight is  (0, 1] .

words_ The attribute is a dictionary type, and the maximum number of keys it stores is  max_words , that is  WordCloud , the parameters of the class.

as follows:

>>> wc.words_
{'Python': 1.0, 'programming': 1.0, 'language': 1.0, 'easy': 1.0, 'use': 1.0}
# 示例中的这些单词出现的频率都相等(均为 1),
# 所以它们的权重都是 1。

Use  Matplotlib to  display word cloud images:

import matplotlib.pyplot as plt

plt.imshow(wc) 
plt.axis("off")
plt.show()

The word cloud image is as follows:

4. Do word cloud analysis of ancient poems

I have   prepared a case here , which is a word cloud analysis of 1,000 ancient poems.

The code directory is as follows:

wordcloud/
├── SimHei.ttf
├── gushi.txt
└── gushi_wordcloud.py

among them:

  • SimHei.ttf  : is a font file to avoid Chinese garbled characters in word cloud analysis.
  • gushi.txt  : This file contains  1000 ancient poems  .
  • gushi_wordcloud.py  : is the  word cloud analysis code  .

I also put the code here for easy viewing:

#!/usr/bin/env python
# coding=utf-8

import os
import sys
import jieba
from wordcloud import WordCloud

if sys.version.startswith('2.'):
    reload(sys)
    sys.setdefaultencoding('utf-8')

# 去掉一些作者的名字
STOPWORDS = [
        u'李白', u'杜甫', u'辛弃疾', u'李清照', u'苏轼',
        u'李商隐', u'王维', u'白居易', u'李煜', u'杜牧',
        ]

def load_file(file_path):
    if sys.version.startswith('2.'):
        with open(file_path) as f:
            lines = f.readlines()
    else:
        with open(file_path, encoding='utf-8') as f:
            lines = f.readlines()

    content = ''

    for line in lines:
        line = line.encode('unicode-escape').decode('unicode-escape')
        line = line.strip().rstrip('\n')

        content += line

    words = jieba.cut(content)

    l = []
    for w in words:
        # 如果词的长度小于 2,则舍去
        if len(w) < 2: continue

        l.append(w)

    return ' '.join(l)


if __name__ == '__main__':
    file_path = './gushi.txt'
    content = load_file(file_path)

    wc = WordCloud(
            font_path="./SimHei.ttf",
            stopwords=STOPWORDS,
            width=2000, height=1200)

    wc.generate(content)
    wc.to_file("wordcloud.jpg")

among them:

  • STOPWORDS The stop word list is the names of some authors.
  • load_file The method is used to load text, which uses  jieba  word segmentation.

Finally, the word cloud image is saved in the  wordcloud.jpg file, as follows:

We can also words_ view the weight of each word from the attributes of the word cloud object  . Here I list the top ten:

('Ming Moon', 1.0) 
('Today', 0.9130434782608695) 
('I don't know', 0.8405797101449275) 
('Where', 0.8260869565217391) 
('No See', 0.8115942028985508) 
('Spring Breeze', 0.7536231884057971) 
('Nobody', 0.7536231884057971 ) 
('Not possible', 0.7536231884057971) 
('Wanli', 0.7536231884057971) 
('Modern', 0.6666666666666666)

(End of this section.)

Guess you like

Origin blog.csdn.net/Python_sn/article/details/111266392