Word text mining of cloud and personalized word cloud

Word text mining of cloud and personalized word cloud

A: word cloud -WordCloud

Word cloud: for keywords that appear in the text by changing the frequency of occurrence in accordance with the font size color style centralized display, etc.

Personal views, wordcloud words is a will (support English, Chinese and other languages ​​vocabulary) as the basic elements of image files very efficiently filled with text presentation tool. At the same time, it can also be used masks (mask) function can also be combined with segmentation tools, etc., more intuitive, beautiful, creative and personalized text show text

High frequency "key words" to be visually prominent, giving the keyword set intuitive level, to filter out a lot of inefficient textual information, so long as a sweep viewers can enjoy the text word cloud thrust

Installation WordCloud library

Two: jieba

jieba, stuttering Chinese word, an excellent third-party Python Chinese sub thesaurus, natural language analysis is one of the tools (NLP). It is based on "dictionary" to determine the correlation between the probability of Chinese characters, the probability of a large composition of words, word formation results.

jieba produced Address: https://github.com/fxsjy/jieba

jieba dapper, segmentation capabilities, but should not be overlooked. It has three sub-word mode, support for custom dictionary, support Traditional word.

Three kinds of word modes:

  • (A) fine mode
  • Trying to sentence most accurately cut, fit text analysis.
  • (B) Full mode
  • All the words in the sentence can be scanned into words are very fast, but does not resolve the ambiguity.
  • (C) the search engine mode
  • In fine mode basis, the long-term re-segmentation, improve recall for search engines word.

jieba installation

Python 2/3 are compatible. Referenced by import jieba.

  • (A) automatic installation
  • easy_install jieba or pip install jieba or pip3 install jieba
  • (B) semi-automatic installation
  • http://pypi.python.org/pypi/jieba/
  • After downloading unzip run Python setup.py  install 
  • (C) manually install
  • The jieba or directory in the current directory is placed site-packages directory.
  • (Iv) mounted PyCharm
  • 【File】-【Settings】-【Project Interpreter】-【+】-搜索jieba-【Install Package】

Case: the "Westward Journey" draw lines excerpt word cloud

import numpy as np
import matplotlib.pyplot as plt

import jieba from wordcloud import WordCloud from PIL import Image # 载入文本数据 with open('F:/data/大话西游.txt', 'r', encoding='gbk') as f: # print(f.read()) txt = f.read() txt2 = ' '.join(jieba.cut(txt)) # 绘制词云 # 停用词 # 方式2:读入停用词文件为列表 with open('F:/data/stopword.txt', 'r', encoding='gbk') as f: # print(f.read()) s = f.read() stopword = s.split('\n') stopword # 词云绘制时去停用词 # 基本版词云 wordcloud = WordCloud( font_path="F:/data/FZSTK.TTF", ).generate(txt2) wordcloud plt.imshow(wordcloud) 

w1

wordcloud = WordCloud(
    font_path="F:/data/arial unicode ms.ttf",  # 字体,不设置则汉字乱码
    background_color='white',# 设置背景颜色 max_words=80, # 设置最大现显示词数 max_font_size=80, # font_size可选 stopwords = stopword, # 去停用词 ).generate(txt2) wordcloud plt.figure(figsize=(18, 10), dpi=72) plt.imshow(wordcloud, interpolation='bilinear') # 绘制数据内的图片,双线性插值绘图 plt.axis("off") # 去掉坐标轴 plt.savefig('F:/data/test.png', dpi=300, bbox_inches='tight') # 保存为:带有最小白边且分辨率为300DPI的PNG图片 

w2

Personalized word cloud

#读取背景图
alice_mask = np.array(Image.open("F:/data/heart.jpg"))

wordcloud = WordCloud(
    background_color='white',# 设置背景颜色 max_words=100, # 设置最大现显示词数 font_path="F:/data/arial unicode ms.ttf", # 字体,不设置则汉字乱码 stopwords = stopword, # 去停用词 mask=alice_mask, # 设置背景图片 ).generate(txt2) wordcloud plt.figure(figsize=(18, 10), dpi=72) plt.imshow(wordcloud, interpolation='bilinear') # 绘制数据内的图片,双线性插值绘图 plt.axis("off") # 去掉坐标轴 plt.savefig('F:/data/test2.png', dpi=300, bbox_inches='tight') 

w3

Higher Order: word cloud image color template

The completion of the step of the operation can be called great God. Let's look together to improve the user experience, optimizing image color.

ImageColorGenerator (image, default_color = None) is based on the color of the color image generator. Generating a color image according to RGB. The average word rectangular color images are surrounded by colored. After configuration, the object serving as callable objects can be passed to a word cloud color_func or recolor constructor method. In addition, the parameters need to define color_func WordCloud class word cloud for re-coloring.

import wordcloud
import jieba

# 使用ImageColorGenerator类根据获取的模板图像生成颜色,并赋值变量 color_new = wordcloud.ImageColorGenerator(alice_mask) wordcloud = WordCloud( background_color='white',# 设置背景颜色 max_words=100, # 设置最大现显示词数 font_path="F:/data/arial unicode ms.ttf", # 字体,不设置则汉字乱码 contour_width=25, # 词云形状边宽宽度 contour_color='red', # 词云形状边宽颜色 color_func=color_new, # 将上面模板图像生成的颜色传入词云 stopwords = stopword, # 去停用词 mask=alice_mask, # 设置背景图片 ).generate(txt2) wordcloud plt.figure(figsize=(18, 10), dpi=72) plt.imshow(wordcloud, interpolation='bilinear') # 绘制数据内的图片,双线性插值绘图 plt.axis("off") # 去掉坐标轴 plt.savefig('F:/data/test3.png', dpi=300, bbox_inches='tight') 

w4

Seven: Data Modeling

Import jieba
 Import numpy AS NP
 Import matplotlib.pyplot AS PLT
 from the PIL Import Image
 from wordcloud Import wordcloud the WC AS
 Import wordcloud WD AS 

# Loading text data 
with Open ( ' F.: / Data / Odyssey .txt ' , ' R & lt ' , = encoding ' GBK ' ) aS f: 
    TXT = f.read () 

# word 
txt2 = '  ' .join (jieba.cut (TXT)) 
    
# mode 2: read the stop word file list 
with open (' F.: /Data/stopword.txt ' , ' R & lt ' , encoding = ' GBK ' ) AS F: 
    S = reached, f.read () 
Stopword = s.split ( ' \ n- ' ) 

wordcloud = wordcloud (font_path = " F. : /data/FZSTK.TTF " ) .generate (txt2) 

# read background 
alice_mask = np.array (Image.open ( " F.: /data/heart.jpg " )) 


# use ImageColorGenerator class template image acquired in accordance with generating a color, and variable assignment 
color_new = wd.ImageColorGenerator (alice_mask) 

wordcloud = WC(
    BACKGROUND_COLOR = ' White ' , # set the background color 
    MAX_WORDS = 100, # set the maximum number of words will now be displayed 
    font_path = " F.: / Data / ms.ttf Arial Unicode " ,   # font, the character display is not provided 
    contour_width = 25,   # word cloud shaped wide-width side 
    contour_color = ' Red ' ,   # word cloud shaped edge width color 
    color_func = color_new,   # the above template image generated color incoming word cloud 
    stopwords = stopword,   # to stop words 
    mask = alice_mask,   # set background pictures 
) .generate (txt2) 
wordcloud

plt.figure (figsize = (18 is, 10), dpi = 72 ) 
plt.imshow (wordcloud, interpolation = ' Bilinear ' ) # draw a picture in the data, a bilinear interpolation mapping 
plt.axis ( " OFF " ) # removed axis 

plt.savefig ( ' F.: /data/test22.png ' , = 300 dpi, bbox_inches = ' ' tight ' )

 

Welcome concern: a A Wood

Guess you like

Origin www.cnblogs.com/yizhiamumu/p/12650648.html