Word cloud production based on Python

1 Installation and introduction of third-party libraries

1.1 Python third-party library jieba (Chinese word segmentation)

1. Features

(1) Support three word segmentation modes:

  • Precise mode , trying to cut the sentence most accurately, suitable for text analysis;
  • Full mode , scan all the words that can be formed into words in the sentence, the speed is very fast, but can not solve the ambiguity
  • Search engine mode , on the basis of accurate mode, segment long words again, improve the recall rate, suitable for search engine word segmentation.

(2) Support traditional Chinese word segmentation

(2) Support custom dictionary

(3) MIT license agreement

2. Installation and usage instructions

pip install jieba / pip3 install jieba

Reference via import jieba

3. Main functions

The word segmentation function is mainly involved here, as detailed below:

(1) The jieba.cut method accepts three input parameters:

  • A string that needs word segmentation;
  • The cut_all parameter is used to control whether to use the full mode;
  • The HMM parameter is used to control whether to use the HMM model.

(2) The jieba.cut_for_search method accepts two parameters:

  • A string that needs word segmentation;
  • Whether to use the HMM model.

This method is suitable for the search engine to construct the inverted word segmentation, the granularity is relatively fine.

1. The character string to be segmented can be unicode, UTF-8 character string, or GBK character string.

2. It is not recommended to directly enter the GBK string, which may be unexpectedly decoded into UTF-8.

3. The structure returned by jieba.cut and jieba.cut_for_search is an iterable generator. You can use the for loop to get each word (unicode) obtained after word segmentation, or use jieba.lcut and jieba.lcut_for_search to return the list directly.

(3) jieba.Tokenizer (dictionary = DEFAULT_DICT) Create a new custom tokenizer, which can be used to use different dictionaries at the same time. jieba.dt is the default tokenizer, and all global tokenizer related functions are mappings of this tokenizer.

4. Sample code

(1) Implementation code:

1  # coding = utf-8 
2  import jieba
 3  
4 text = " Gently I left as I gently came " 
5  
6 seg_list = jieba.cut (text, cut_all = False)
 7  print ( " Default Mode: " + " / " .join (seg_list))        # fine mode 
. 8  
. 9 seg_list = jieba.cut (text, cut_all = True)
 10  Print ( " Full mode: " + " / " .join (seg_list))           #Full mode 
11  
12 seg_list = jieba.cut_for_search (text)
 13  Print ( " Search Mode: " + " / " .join (seg_list))         # search engine mode

 

(2) Operation result:

 

1.2 Python third-party library wordcloud (word cloud)

1. Installation and usage instructions

pip install wordcloud / pip3 install wordcloud

Quote through import wordcloud

2. Main functions

Wordcloud treats the word cloud as an object. It can draw the word cloud using the frequency of words in the text as a parameter, and the size, color, shape, etc. of the word cloud can be set.

The steps to generate a word cloud are as follows:

(1) Configure object parameters

(2) Load word cloud text

(3) Output word cloud file (if not specified, the default picture size is 400 * 200)

3. Common parameter list

 

2 Make a word cloud

2.1  Generate the word cloud of "Management Standards for Asymptomatic Infected Patients of New Coronavirus"

(1) Implementation code:

1  # coding = utf-8 
2  import matplotlib.pyplot as plt
 3  import jieba
 4  from wordcloud import WordCloud
 5  
6  # 1. Read txt text data 
7 with open ( " test.txt " , ' r ' ) as f:
 8            text = f.read ()
 9  
10  # 2.
 Participle 11 cut_text = "  " .join (jieba.cut (text))
 12  
13  # 3. Generate word cloud 
14 wc = WordCloud(
15     font_path=r'.\simhei.ttf',
16     background_color = 'white',
17     width = 1000,
18     height = 880,
19 ).generate(cut_text)
20 
21 # 4.显示词云图片
22 plt.imshow(wc, interpolation="bilinear")
23 plt.axis('off')
24 plt.show()

 

(2) Operation result:

 

2.2 Generate the word cloud of "Notice on Doing a Good Job in Employment and Entrepreneurship for College Graduates"

(1) Implementation code:

 1 # coding=utf-8
 2 import PIL.Image as image
 3 import numpy as np
 4 import matplotlib.pyplot as plt
 5 import jieba
 6 from wordcloud import WordCloud, ImageColorGenerator
 7 
 8 def GetWordCloud():
 9     path_txt = "test.txt"
10     path_img = "test.jpg"
11     # 1.读入txt文本数据
12     with open(path_txt, 'r') as f:
13           text=f.read()
14     background_image = np.array(image.open(path_img))
15 
16     # 2.分词
17     cut_text = " ".join(jieba.cut(text))
18 
19     # 3.生成词云
20     wc = WordCloud(
21         font_path=r'.\simhei.ttf',
22         background_color = 'white',
23         mask=background_image
24     ).generate(cut_text)
25 
26      # to generate color values 
27      image_colors = ImageColorGenerator (background_image)
 28  
29      # 4. Display word cloud image 
30      plt.imshow (wc.recolor (color_func = image_colors), interpolation = " Bilinear " )
 31 is      plt.axis ( ' OFF ' )
 32      plt.show ()
 33  
34  
35  if  __name__ == " __main__ " :
 36      GetWordCloud ()

 

(2) Operation result:

 

Python third-party library jieba (Chinese word segmentation)

Guess you like

Origin www.cnblogs.com/yangmi511/p/12676116.html