word cloud how to do

NLP Text Diversity Visualization Open Source Component Award: Summary of projects such as TextGrapher map, wordcloud word cloud, shifterator difference, etc.

1. Text word cloud visualization based on wordcloud/stylecloud library
1. Use pyecharts for word cloud visualization
Apache ECharts is a data visualization open sourced by Baidu. With its good interactivity and exquisite chart design, it has been recognized by many developers. , Pyecharts is the python implementation version of echarts. The wordcloud subclass under the pyecharts library implements the task of word cloud. It is similar to the way Tableau generates word clouds. pyecharts also requires the input data to be filtered and counted data, but all One of its strengths is that the charts are interactive.
Address: https://github.com/pyecharts/pyecharts
1) Realization case

from pyecharts import options as opts
from pyecharts.charts import Page, WordCloud
 
words = [
    ("我爱你", 10000), ("美丽", 6181),("天空", 4386),
    ("江西", 4055),("任务", 2467), ("红色", 2244),
    ("大海", 1868), ("人民", 1281)
]
 
def wordcloud_base() -> WordCloud:
    c = (
        WordCloud()
        .add("", words, word_size_range=[20, 100])
        .set_global_opts(title_opts=opts.TitleOpts(title="案例"))
    )
    return c
 
wd = wordcloud_base()
wd.render("word_demo.html")

2. The wordcloud
wordcloud.WordCloud() method represents a word cloud corresponding to a text. The word cloud can be drawn according to parameters such as the frequency of words in the text. The shape, size and color of the drawn word cloud can be set, and can be given Take any picture and generate a word cloud result of the corresponding shape.
Project address: https://github.com/amueller/word_cloud
1) Practice cases

import wordcloud
text = "An angry mob of anti-vaccine protesters stormed what they thought was BBC headquarters in London on Monday, only to discover that they had been grossly misinformed about its location, among other things.
Videos show the protesters trying to push through a police line and charge the doors of the Television Centre in west London on Monday in a misguided attempt to disrupt BBC Television’s operations. They were about eight years too late, as the U.K.’s public broadcaster moved out of that space in 2013."
w = wordcloud.WordCloud(background_color="white")  # 把词云当做一个对象
w.generate(text)
w.to_file("wordcloud.png")

3. stylecloud
stylecloud is based on the wordcloud library, and it is easier to use. It supports the icon shape setting of the word cloud map, and easily solves the problem of word cloud masking [pre-set graphic path] and color matching. stylecloud can support the input of word and string lists. It can directly read csv files (csv has two columns, word and freq), and is compatible with wordcloud.
Address: https://github.com/minimaxir/stylecloud
1) Practice cases
here, also refer to the open source project of [Da Deng and his python], for example, given a word frequency csv file (gaokao.csv) , including two columns, one column is the word word, and the other column is the word frequency freq, which can be read directly by style-cloud.
2) Code implementation

import stylecloud
stopwords = open('data/stopwords.txt', encoding='utf-8').read().split('\n')

stylecloud.gen_stylecloud(file_path='高考.csv',
                          font_path='SourceHanSansCN-Regular.otf',
                          output_name='高考.png',
                          icon_name='fas fa-user-graduate',//设定的图形文件
                          size=500,
                          custom_stopwords=stopwords)

2. Visual display of text graph based on TextGrapher

How to best semantically represent the input text content in a graph and structured way, that is, in a concise way, is a difficult problem. This project will try to solve this problem by inputting a document, extracting key information from the document, using high-frequency words, keywords, named entity recognition, subject-predicate-object phrase recognition and other extraction methods, and Attempt to organize the three types of information into graphs, and finally organize them into graphs to form a graphed display of the semantic information of the article.
This is an open source project of Lao Liu:
Project address: https://github.com/liuhuanyong/TextGrapher
1) Code implementation

from text_grapher import *
handler = CrimeMining()
content = """
  5月7日20时许,昌平警方针对霍营街道某小区一足疗店存在卖淫嫖娼问题的线索,组织便衣警力前往开展侦查。
  21时14分,民警发现雷某(男,29岁,家住附近)从该足疗店离开,立即跟进,亮明身份对其盘查。雷某试图逃跑,在激烈反抗中咬伤民警,并将民警所持视频拍摄设备打落摔坏,后被控制带上车。行驶中,雷某突然挣脱看管,从车后座窜至前排副驾驶位置,踢踹驾驶员迫使停车,打开车门逃跑,被再次控制。因雷某激烈反抗,为防止其再次脱逃,民警依法给其戴上手铐,并于21时45分带上车。在将雷某带回审查途中,发现其身体不适,情况异常,民警立即将其就近送往昌平区中西医结合医院,22时5分进入急诊救治。雷某经抢救无效于22时55分死亡。
  当晚,民警在足疗店内将朱某(男,33岁,黑龙江省人)、俞某(女,38岁,安徽省人)、才某(女,26岁,青海省人)、刘某(女,36岁,四川省人)和张某(女,25岁,云南省人)等5名涉嫌违法犯罪人员抓获。经审查并依法提取、检验现场相关物证,证实雷某在足疗店内进行了嫖娼活动并支付200元嫖资。目前,上述人员已被昌平警方依法采取强制措施。
  为进一步查明雷某死亡原因,征得家属同意后,将依法委托第三方在检察机关监督下进行尸检。
  男子“涉嫌嫖娼死亡”,家属提多个疑点 要求公开执法记录视频
  5月7日晚,中国人民大学环境学院2009级硕士研究生雷洋离家后身亡,昌平警方通报称,警方查处足疗店过程中,将“涉嫌嫖娼”的雷某控制并带回审查,此间雷某突然身体不适经抢救无效身亡。
  面对雷洋的突然死亡,他的家人表示现在只看到了警方的一条官方微博,对于死因其中只有一句“该人突然身体不适”的简单描述,他们希望能够公布执法纪录仪视频,尽快还原真相。
  由雷洋的同学发布的一份情况说明称,5月7日,由于雷洋夫妇刚得一女,其亲属欲来京探望,航班预计当晚23点30分到达。当晚21时左右,雷洋从家里出门去首都机场迎接亲属,之后雷洋失联。(来源:央视、新京报)
"""
handler.main(content)

insert image description here
3. Calculation of word difference between texts based on shifterator

Shifterator is a Python package for visualizing pairwise comparisons between texts via word shifts, a general method for extracting which words contribute to the difference between two texts, visualized via word shift graphs, detailed , Interpretable horizontal bar graphs showing the interaction components of word shifts, which can be used in scenarios such as direct text comparison, sentiment analysis, etc.
Project address: https://shifterator.readthedocs.io/en/latest/
1. Basic overview
Specifically, this tool treats text as data and maps out complex issues of how two texts are similar or different, covering common Text comparison measures, including relative frequency, Shannon entropy, Tsallis entropy, Kullback-Leibler divergence, and Jensen-Shannon divergence.
insert image description here

For example, the figure above shows the sentiment word shift map of US President Lyndon and Bush's presidential speeches. Figure, using the SocialSent decade-adaptive sentiment lexicon (one score per word per decade).
Among them, word shift shows the 50 words that contribute the most to the sentiment difference. As you can see, the words on the left are those that caused Bush's speech to be more negative than Johnson's, the words on the right are positive, and the bars at the top show the overall sentiment difference and the contribution of each type of word to that difference .
2. Practical cases
Generally speaking, given the data about takeaway reviews, that is, the review data with 0/1 labels, 0 means bad reviews and 1 means good reviews, how to mine and determine the distribution of different words in the two evaluations, we can directly think of It is word cloud comparison, which is divided into words and compared after removing stop words. But Shifterator gives us another perspective.
1) given data

label,review
0,差评,11点14订餐,13点20饭才到,2个小时才把我的午饭送到,而且还是打了2次客服电话,1次投诉电话才给送来,要是不打电话都不知道几点能吃上午饭?
0,我让多加汁也没加,怎么吃啊?干了吧唧的
0,慢,多远的距离,那么长时间
0,盒子很精致,味道还好。整体不错
0,餐呢?就为了不半价,餐没送来就确认了?必须投诉
0,不管口味如何,送了两个多小时,不骂街我已经是素质好了
0,总是很晚送到
0,"不好吃,饭根本不像饭,难吃的咽不下去"
0,除了速度慢,其他都挺好。速度忒慢了,等了一个小时二十分钟。
1,南瓜粥一般般,卷饼不错
1,味道不错,送餐时间一个多小时
1,天天吃,会不会长胖
1,很好,也很快不错
1,"虽然送餐时间稍晚,但是味道好吃,没得说,态度也很好"
1,速度好快!
1,很准时,而且这次包装很认真,用券太划算了!
1,味道不错,就是量不大
1,肘子挺好吃~
1,经济实惠,又好吃,打包的还好

2) Implement the code

import pandas as pd
import collections
import jieba
import re 
from shifterator import EntropyShift
import matplotlib

reviews_df = pd.read_csv("data/WaiMai8k.csv", encoding='utf-8')
reviews_df.head()
texts_neg = reviews_df[reviews_df['label']==0]['review'].tolist()
texts_pos = reviews_df[reviews_df['label']==1]['review'].tolist()
def clean_text(docs):
    stop_words = open('data/stopwords.txt', encoding='utf-8').read().split('\n')
    text = "".join(docs)
    text = "".join(re.findall("[\u4e00-\u9fa5]+", text))
    words = jieba.lcut(text)
    words = [w for w in words if w not in stop_words]
    wordfreq_dict = collections.Counter(words)
    return wordfreq_dict
clean_texts_neg = clean_text(texts_neg)
clean_texts_pos = clean_text(texts_pos)

matplotlib.rc("font", family='Arial Unicode MS')

entropy_shift = EntropyShift(type2freq_1=clean_texts_neg,
                             type2freq_2=clean_texts_pos,
                             base=2)
entropy_shift.get_shift_graph(title='外卖差评 vs 外卖好评')

It can be seen from this that the term that most determines the negative reviews of takeaways is the delivery time, followed by the taste. It seems that the most important factor that determines the positive reviews of takeaways is the taste, followed by the delivery time.
Address:
https://github.com/ryanjgallagher/shifterator
insert image description here
4. Dynamic data visualization based on Matplotlib-barchart-race

1. Basic overview
Matplotlib is a very powerful Python drawing tool. You can use this tool to present a lot of data more intuitively in the form of charts. It supports drawing line graphs, scatter plots, contour maps, bar graphs, and columnar graphs. diagrams, 3D graphics, even graphic animations, and more.
Correspondingly, Pyplot is a sub-library of Matplotlib, which provides a drawing API similar to MATLAB, including a series of related functions of drawing functions. Each function will modify the current image, such as adding marks to the image, generating new image, new drawing area in the image, etc.
insert image description here
Project address: https://matplotlib.org/stable/gallery/index.html
Among them, for a data with time-permitting characteristics, how to obtain a dynamic data evolution graph is a usage scenario of our scenario, such as observing different countries in The comparison of airport throughput over the past few decades, the ranking comparison of different film and television stars, the comparison of import and export/GDP data of different countries, etc.
2. Practical case
Pratapvardhan provides a simple implementation case, assuming we have the following data, including the population of different years and different countries, how to use matplotlib to complete the drawing.
1) The given data
matplotlie provides a simple way to realize the animation by using the animation in it.
Implementation code

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.animation as animation
from IPython.display import HTML
url = 'https://gist.githubusercontent.com/johnburnmurdoch/4199dbe55095c3e13de8d5b2e5e5307a/raw/fa018b25c24b7b5f47fd0568937ff6c04e384786/city_populations'
df = pd.read_csv(url, usecols=['name', 'group', 'year', 'value'])
df.head(3)
colors = dict(zip(
    ["India", "Europe", "Asia", "Latin America", "Middle East", "North America", "Africa"],
    ["#adb0ff", "#ffb3ff", "#90d595", "#e48381", "#aafbff", "#f7bb5f", "#eafb50"]
))
group_lk = df.set_index('name')['group'].to_dict()

"""Run below cell `draw_barchart(2018)` draws barchart for `year=2018`"""

fig, ax = plt.subplots(figsize=(15, 8))

def draw_barchart(current_year):
    dff = df[df['year'].eq(current_year)].sort_values(by='value', ascending=True).tail(10)
    ax.clear()
    ax.barh(dff['name'], dff['value'], color=[colors[group_lk[x]] for x in dff['name']])
    dx = dff['value'].max() / 200
    for i, (value, name) in enumerate(zip(dff['value'], dff['name'])):
        ax.text(value-dx, i,     name,           size=14, weight=600, ha='right', va='bottom')
        ax.text(value-dx, i-.25, group_lk[name], size=10, color='#444444', ha='right', va='baseline')
        ax.text(value+dx, i,     f'{value:,.0f}',  size=14, ha='left',  va='center')
    ax.text(1, 0.4, current_year, transform=ax.transAxes, color='#777777', size=46, ha='right', weight=800)
    ax.text(0, 1.06, 'Population (thousands)', transform=ax.transAxes, size=12, color='#777777')
    ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    ax.xaxis.set_ticks_position('top')
    ax.tick_params(axis='x', colors='#777777', labelsize=12)
    ax.set_yticks([])
    ax.margins(0, 0.01)
    ax.grid(which='major', axis='x', linestyle='-')
    ax.set_axisbelow(True)
    ax.text(0, 1.15, 'The most populous cities in the world from 1500 to 2018',
            transform=ax.transAxes, size=24, weight=600, ha='left', va='top')
    ax.text(1, 0, 'by @pratapvardhan; credit @jburnmurdoch', transform=ax.transAxes, color='#777777', ha='right',
            bbox=dict(facecolor='white', alpha=0.8, edgecolor='white'))
    plt.box(False)
draw_barchart(2018)
fig, ax = plt.subplots(figsize=(15, 8))
animator = animation.FuncAnimation(fig, draw_barchart, frames=range(1900, 2019))
HTML(animator.to_jshtml())
# or use animator.to_html5_video() or animator.save()

Address:
https://colab.research.google.com/github/pratapvardhan/notebooks/blob/master/barchart-race-matplotlib.ipynb#scrollTo=FwlbfAoVzzXd

Guess you like

Origin blog.csdn.net/weixin_36378508/article/details/127847100