1 Installation and introduction of third-party libraries
1.1 Python third-party library jieba (Chinese word segmentation)
1. Features
(1) Support three word segmentation modes:
- Precise mode , trying to cut the sentence most accurately, suitable for text analysis;
- Full mode , scan all the words that can be formed into words in the sentence, the speed is very fast, but can not solve the ambiguity
- Search engine mode , on the basis of accurate mode, segment long words again, improve the recall rate, suitable for search engine word segmentation.
(2) Support traditional Chinese word segmentation
(2) Support custom dictionary
(3) MIT license agreement
2. Installation and usage instructions
pip install jieba / pip3 install jieba
Reference via import jieba
3. Main functions
The word segmentation function is mainly involved here, as detailed below:
(1) The jieba.cut method accepts three input parameters:
- A string that needs word segmentation;
- The cut_all parameter is used to control whether to use the full mode;
- The HMM parameter is used to control whether to use the HMM model.
(2) The jieba.cut_for_search method accepts two parameters:
- A string that needs word segmentation;
- Whether to use the HMM model.
This method is suitable for the search engine to construct the inverted word segmentation, the granularity is relatively fine.
1. The character string to be segmented can be unicode, UTF-8 character string, or GBK character string.
2. It is not recommended to directly enter the GBK string, which may be unexpectedly decoded into UTF-8.
3. The structure returned by jieba.cut and jieba.cut_for_search is an iterable generator. You can use the for loop to get each word (unicode) obtained after word segmentation, or use jieba.lcut and jieba.lcut_for_search to return the list directly.
(3) jieba.Tokenizer (dictionary = DEFAULT_DICT) Create a new custom tokenizer, which can be used to use different dictionaries at the same time. jieba.dt is the default tokenizer, and all global tokenizer related functions are mappings of this tokenizer.
4. Sample code
(1) Implementation code:
1 # coding = utf-8 2 import jieba 3 4 text = " Gently I left as I gently came " 5 6 seg_list = jieba.cut (text, cut_all = False) 7 print ( " Default Mode: " + " / " .join (seg_list)) # fine mode . 8 . 9 seg_list = jieba.cut (text, cut_all = True) 10 Print ( " Full mode: " + " / " .join (seg_list)) #Full mode 11 12 seg_list = jieba.cut_for_search (text) 13 Print ( " Search Mode: " + " / " .join (seg_list)) # search engine mode
(2) Operation result:
1.2 Python third-party library wordcloud (word cloud)
1. Installation and usage instructions
pip install wordcloud / pip3 install wordcloud
Quote through import wordcloud
2. Main functions
Wordcloud treats the word cloud as an object. It can draw the word cloud using the frequency of words in the text as a parameter, and the size, color, shape, etc. of the word cloud can be set.
The steps to generate a word cloud are as follows:
(1) Configure object parameters
(2) Load word cloud text
(3) Output word cloud file (if not specified, the default picture size is 400 * 200)
3. Common parameter list
2 Make a word cloud
2.1 Generate the word cloud of "Management Standards for Asymptomatic Infected Patients of New Coronavirus"
(1) Implementation code:
1 # coding = utf-8 2 import matplotlib.pyplot as plt 3 import jieba 4 from wordcloud import WordCloud 5 6 # 1. Read txt text data 7 with open ( " test.txt " , ' r ' ) as f: 8 text = f.read () 9 10 # 2. Participle 11 cut_text = " " .join (jieba.cut (text)) 12 13 # 3. Generate word cloud 14 wc = WordCloud( 15 font_path=r'.\simhei.ttf', 16 background_color = 'white', 17 width = 1000, 18 height = 880, 19 ).generate(cut_text) 20 21 # 4.显示词云图片 22 plt.imshow(wc, interpolation="bilinear") 23 plt.axis('off') 24 plt.show()
(2) Operation result:
2.2 Generate the word cloud of "Notice on Doing a Good Job in Employment and Entrepreneurship for College Graduates"
(1) Implementation code:
1 # coding=utf-8 2 import PIL.Image as image 3 import numpy as np 4 import matplotlib.pyplot as plt 5 import jieba 6 from wordcloud import WordCloud, ImageColorGenerator 7 8 def GetWordCloud(): 9 path_txt = "test.txt" 10 path_img = "test.jpg" 11 # 1.读入txt文本数据 12 with open(path_txt, 'r') as f: 13 text=f.read() 14 background_image = np.array(image.open(path_img)) 15 16 # 2.分词 17 cut_text = " ".join(jieba.cut(text)) 18 19 # 3.生成词云 20 wc = WordCloud( 21 font_path=r'.\simhei.ttf', 22 background_color = 'white', 23 mask=background_image 24 ).generate(cut_text) 25 26 # to generate color values 27 image_colors = ImageColorGenerator (background_image) 28 29 # 4. Display word cloud image 30 plt.imshow (wc.recolor (color_func = image_colors), interpolation = " Bilinear " ) 31 is plt.axis ( ' OFF ' ) 32 plt.show () 33 34 35 if __name__ == " __main__ " : 36 GetWordCloud ()
(2) Operation result:
Python third-party library jieba (Chinese word segmentation)