Word text mining of cloud and personalized word cloud
A: word cloud -WordCloud
Word cloud: for keywords that appear in the text by changing the frequency of occurrence in accordance with the font size color style centralized display, etc.
Personal views, wordcloud words is a will (support English, Chinese and other languages vocabulary) as the basic elements of image files very efficiently filled with text presentation tool. At the same time, it can also be used masks (mask) function can also be combined with segmentation tools, etc., more intuitive, beautiful, creative and personalized text show text
High frequency "key words" to be visually prominent, giving the keyword set intuitive level, to filter out a lot of inefficient textual information, so long as a sweep viewers can enjoy the text word cloud thrust
Installation WordCloud library
- Command to install: pip install wordcloud
- Wordcloud download and install the official website to download the installer, attention needs to match the Python version you are using. https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud
Two: jieba
jieba, stuttering Chinese word, an excellent third-party Python Chinese sub thesaurus, natural language analysis is one of the tools (NLP). It is based on "dictionary" to determine the correlation between the probability of Chinese characters, the probability of a large composition of words, word formation results.
jieba produced Address: https://github.com/fxsjy/jieba
jieba dapper, segmentation capabilities, but should not be overlooked. It has three sub-word mode, support for custom dictionary, support Traditional word.
Three kinds of word modes:
- (A) fine mode
- Trying to sentence most accurately cut, fit text analysis.
- (B) Full mode
- All the words in the sentence can be scanned into words are very fast, but does not resolve the ambiguity.
- (C) the search engine mode
- In fine mode basis, the long-term re-segmentation, improve recall for search engines word.
jieba installation
Python 2/3 are compatible. Referenced by import jieba.
- (A) automatic installation
- easy_install jieba or pip install jieba or pip3 install jieba
- (B) semi-automatic installation
- http://pypi.python.org/pypi/jieba/
- After downloading unzip run Python setup.py install
- (C) manually install
- The jieba or directory in the current directory is placed site-packages directory.
- (Iv) mounted PyCharm
- 【File】-【Settings】-【Project Interpreter】-【+】-搜索jieba-【Install Package】
Case: the "Westward Journey" draw lines excerpt word cloud
import numpy as np
import matplotlib.pyplot as plt
import jieba from wordcloud import WordCloud from PIL import Image # 载入文本数据 with open('F:/data/大话西游.txt', 'r', encoding='gbk') as f: # print(f.read()) txt = f.read() txt2 = ' '.join(jieba.cut(txt)) # 绘制词云 # 停用词 # 方式2:读入停用词文件为列表 with open('F:/data/stopword.txt', 'r', encoding='gbk') as f: # print(f.read()) s = f.read() stopword = s.split('\n') stopword # 词云绘制时去停用词 # 基本版词云 wordcloud = WordCloud( font_path="F:/data/FZSTK.TTF", ).generate(txt2) wordcloud plt.imshow(wordcloud)
wordcloud = WordCloud(
font_path="F:/data/arial unicode ms.ttf", # 字体,不设置则汉字乱码
background_color='white',# 设置背景颜色 max_words=80, # 设置最大现显示词数 max_font_size=80, # font_size可选 stopwords = stopword, # 去停用词 ).generate(txt2) wordcloud plt.figure(figsize=(18, 10), dpi=72) plt.imshow(wordcloud, interpolation='bilinear') # 绘制数据内的图片,双线性插值绘图 plt.axis("off") # 去掉坐标轴 plt.savefig('F:/data/test.png', dpi=300, bbox_inches='tight') # 保存为:带有最小白边且分辨率为300DPI的PNG图片
Personalized word cloud
#读取背景图
alice_mask = np.array(Image.open("F:/data/heart.jpg"))
wordcloud = WordCloud(
background_color='white',# 设置背景颜色 max_words=100, # 设置最大现显示词数 font_path="F:/data/arial unicode ms.ttf", # 字体,不设置则汉字乱码 stopwords = stopword, # 去停用词 mask=alice_mask, # 设置背景图片 ).generate(txt2) wordcloud plt.figure(figsize=(18, 10), dpi=72) plt.imshow(wordcloud, interpolation='bilinear') # 绘制数据内的图片,双线性插值绘图 plt.axis("off") # 去掉坐标轴 plt.savefig('F:/data/test2.png', dpi=300, bbox_inches='tight')
Higher Order: word cloud image color template
The completion of the step of the operation can be called great God. Let's look together to improve the user experience, optimizing image color.
ImageColorGenerator (image, default_color = None) is based on the color of the color image generator. Generating a color image according to RGB. The average word rectangular color images are surrounded by colored. After configuration, the object serving as callable objects can be passed to a word cloud color_func or recolor constructor method. In addition, the parameters need to define color_func WordCloud class word cloud for re-coloring.
import wordcloud
import jieba
# 使用ImageColorGenerator类根据获取的模板图像生成颜色,并赋值变量 color_new = wordcloud.ImageColorGenerator(alice_mask) wordcloud = WordCloud( background_color='white',# 设置背景颜色 max_words=100, # 设置最大现显示词数 font_path="F:/data/arial unicode ms.ttf", # 字体,不设置则汉字乱码 contour_width=25, # 词云形状边宽宽度 contour_color='red', # 词云形状边宽颜色 color_func=color_new, # 将上面模板图像生成的颜色传入词云 stopwords = stopword, # 去停用词 mask=alice_mask, # 设置背景图片 ).generate(txt2) wordcloud plt.figure(figsize=(18, 10), dpi=72) plt.imshow(wordcloud, interpolation='bilinear') # 绘制数据内的图片,双线性插值绘图 plt.axis("off") # 去掉坐标轴 plt.savefig('F:/data/test3.png', dpi=300, bbox_inches='tight')
Seven: Data Modeling
Import jieba Import numpy AS NP Import matplotlib.pyplot AS PLT from the PIL Import Image from wordcloud Import wordcloud the WC AS Import wordcloud WD AS # Loading text data with Open ( ' F.: / Data / Odyssey .txt ' , ' R & lt ' , = encoding ' GBK ' ) aS f: TXT = f.read () # word txt2 = ' ' .join (jieba.cut (TXT)) # mode 2: read the stop word file list with open (' F.: /Data/stopword.txt ' , ' R & lt ' , encoding = ' GBK ' ) AS F: S = reached, f.read () Stopword = s.split ( ' \ n- ' ) wordcloud = wordcloud (font_path = " F. : /data/FZSTK.TTF " ) .generate (txt2) # read background alice_mask = np.array (Image.open ( " F.: /data/heart.jpg " )) # use ImageColorGenerator class template image acquired in accordance with generating a color, and variable assignment color_new = wd.ImageColorGenerator (alice_mask) wordcloud = WC( BACKGROUND_COLOR = ' White ' , # set the background color MAX_WORDS = 100, # set the maximum number of words will now be displayed font_path = " F.: / Data / ms.ttf Arial Unicode " , # font, the character display is not provided contour_width = 25, # word cloud shaped wide-width side contour_color = ' Red ' , # word cloud shaped edge width color color_func = color_new, # the above template image generated color incoming word cloud stopwords = stopword, # to stop words mask = alice_mask, # set background pictures ) .generate (txt2) wordcloud plt.figure (figsize = (18 is, 10), dpi = 72 ) plt.imshow (wordcloud, interpolation = ' Bilinear ' ) # draw a picture in the data, a bilinear interpolation mapping plt.axis ( " OFF " ) # removed axis plt.savefig ( ' F.: /data/test22.png ' , = 300 dpi, bbox_inches = ' ' tight ' )
Welcome concern: a A Wood