"Garbage" how you say? B Analysis in Python station barrage
table of Contents
0 Preface
1 environment
Needs analysis
3 code implementation
4 Postscript
0 Preface
Wet paper towels and then also dry garbage? Melon skin is wet garbage do it again? ? Recently we are not garbage torture, silly you carry so clear? Since 2019.07.01, Shanghai has taken the lead in segregated regime, but also against the rules face fines.
In order to avoid huge losses, I decided to stop learning at skills b garbage classification. Why come b station, I heard that one of the most popular young contemporary way to learn but.
Open b station, searched under the garbage, it was scare up this title (suction) to (lead): the correct posture in Shanghai shameful.
Of course, there's shame nor that shame, referring to taking out the trash lost.
Point to open to find the original was a comic dialogue ah, sister or two Meng (AI) of comic dialogue, and instantly to the interest, it explained about how the garbage classification.
After reading over and over again, just could not stop, it has opened a brainwashing mode, video is fun, after all, barrage video is fun!
Better Together alone, and better to use Python to save the barrage down, make a word cloud? So happily decided!
1 environment
Operating System: Windows
Python Version: 3.7.3
Needs analysis
We first need <F12> development and debugging tools, data query cid barrage of this video.
After get cid, and then fill in the following link.
http://comment.bilibili.com/{cid}.xml
After opening, you can see a list of the barrage video.
With the barrage of data, we need to first resolve well and saved locally to facilitate further processing, such as word cloud made on display.
3 code implementation
Here, we get requests a web page requests using the module; parse URL aid beautifulsoup4 module; saved as CSV data, here borrow pandas module. Because all third-party modules, such as the environment can not be installed using pip.
pip install requests
pip install beautifulsoup4
pip install lxml
pip install pandas
模块安装好之后,进行导入
import requests
from bs4 import BeautifulSoup
import pandas as pd
请求、解析、保存弹幕数据
# 请求弹幕数据
url = 'http://comment.bilibili.com/99768393.xml'
html = requests.get(url).content
# 解析弹幕数据
html_data = str(html, 'utf-8')
bs4 = BeautifulSoup(html_data, 'lxml')
results = bs4.find_all('d')
comments = [comment.text for comment in results]
comments_dict = {'comments': comments}
# 将弹幕数据保存在本地
br = pd.DataFrame(comments_dict)
br.to_csv('barrage.csv', encoding='utf-8')
接下来,我们就对保存好的弹幕数据进行深加工。
制作词云,我们需要用到 wordcloud 模块、matplotlib 模块、jieba 模块,同样都是第三方模块,直接用 pip 进行安装。
pip install wordcloud
pip install matplotlib
pip install jieba
模块安装好之后,进行导入,因为我们读取文件用到了 panda 模块,所以一并导入即可
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import pandas as pd
import jieba
我们可以自行选择一张图片,并基于此图片来生成一张定制的词云图。我们可以自定义一些词云样式,代码如下:
# 解析背景图片
mask_img = plt.imread('Bulb.jpg')
'''设置词云样式'''
wc = WordCloud(
# 设置字体
font_path='SIMYOU.TTF',
# 允许最大词汇量
max_words = 2000,
# 设置最大号字体大小
max_font_size = 80,
# 设置使用的背景图片
mask = mask_img,
# 设置输出的图片背景色
background_color=None, mode="RGBA",
# 设置有多少种随机生成状态,即有多少种配色方案
random_state=30)
接下来,我们要读取文本信息(弹幕数据),进行分词并连接起来:
Reading the contents of the file #
br = pd.read_csv ( 'barrage.csv', header = None)
# for word, and a space to link
text = ''
for Line in br [. 1]:
text + = '' .join (jieba .cut (line, cut_all = False) )
Finally, take a look at our renderings
I feel there is no enthusiasm for the subject of waste separation, inexplicable sense of joy in my heart.
4 Postscript
Both AI sister said Meng comic is very good, do not know what to think Guo Degang to see this work. Back to the topic of garbage, the current "Shanghai Domestic Waste Management Regulations" has been officially implemented, not in Shanghai friends are not too happy, the Ministry of Housing said that 46 other key cities nationwide will soon experience ...... ha ha ha ha ha, interesting!