Damn it! It turned out to be as simple as crawling the barrage of station B

The official account backstage replied "Book" to learn more about the content of the new book of the account owner

 Author: Ye Tingyun, https: //blog.csdn.net/fyfugoyfa

  • 1. Analyze the webpage

  • 2. Obtain barrage data

  • Three, draw a word cloud diagram

Video link: https://www.bilibili.com/video/BV1zE411Y7JY

1. Analyze the webpage

Click on the barrage list to view the historical barrage, and select any day's historical barrage. At this time, you can find the ajax data package that stores the barrage of that date. All the barrage data is placed in an i tag.

Looking at the relevant information of the request, you can find that the key to the Request URL is the  oid  and  date  parameters. The date is obviously the date. Changing the date can achieve page flipping and crawling the barrage. The oid should be something like a video logo. If you change the oid, you can access others. Video barrage page.

Insert picture description here

2. Obtain barrage data

This article crawls the historical barrage data of the video from January 1 to August 6, and needs to construct a time series:

import pandas as pd

start = '20200101'
end = '20200806'
# 生成时间序列
date_list = [x for x in pd.date_range(start, end).strftime('%Y-%m-%d')]
print(date_list)

The results are as follows:

['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', ... '2020-08-06']

Process finished with exit code 0

The crawler code is as follows:

# -*- coding: UTF-8 -*-
"""
@File    :spider.py
@Author  :叶庭云
@CSDN    :https://yetingyun.blog.csdn.net/
"""

import requests
import pandas as pd
import re
import time
import random
from concurrent.futures import ThreadPoolExecutor
import datetime

user_agent = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
start_time = datetime.datetime.now()


def  Grab_barrage(date):
    # 伪装请求头
    headers = {
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "same-site",
        "origin": "https://www.bilibili.com",
        "referer": "https://www.bilibili.com/video/BV1Z5411Y7or?from=search&seid=8575656932289970537",
        "cookie": "_uuid=0EBFC9C8-19C3-66CC-4C2B-6A5D8003261093748infoc; buvid3=4169BA78-DEBD-44E2-9780-B790212CCE76155837infoc; sid=ae7q4ujj; DedeUserID=501048197; DedeUserID__ckMd5=1d04317f8f8f1021; SESSDATA=e05321c1%2C1607514515%2C52633*61; bili_jct=98edef7bf9e5f2af6fb39b7f5140474a; CURRENT_FNVAL=16; rpdid=|(JJmlY|YukR0J'ulmumY~u~m; LIVE_BUVID=AUTO4315952457375679; CURRENT_QUALITY=80; bp_video_offset_501048197=417696779406748720; bp_t_offset_501048197=417696779406748720; PVID=2",
        "user-agent": random.choice(user_agent),
    }
    # 构造url访问   需要用到的参数
    params = {
        'type': 1,
        'oid': '128777652',
        'date': date
    }
    # 发送请求  获取响应
    response = requests.get(url, params=params, headers=headers)
    # print(response.encoding)   重新设置编码
    response.encoding = 'utf-8'
    # print(response.text)
    # 正则匹配提取数据
    comment = re.findall('<d p=".*?">(.*?)</d>', response.text)
    # 将每条弹幕数据写入txt
    with open('barrages.txt', 'a+') as f:
        for con in comment:
            f.write(con + '\n')
    time.sleep(random.randint(1, 3))   # 休眠


def main():
    # 开多线程爬取   提高爬取效率
    with ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(Grab_barrage, date_list)
    # 计算所用时间
    delta = (datetime.datetime.now() - start_time).total_seconds()
    print(f'用时:{delta}s')


if __name__ == '__main__':
    # 目标url
    url = "https://api.bilibili.com/x/v2/dm/history"
    start = '20200101'
    end = '20200806'
    # 生成时间序列
    date_list = [x for x in pd.date_range(start, end).strftime('%Y-%m-%d')]
    count = 0
    # 调用主函数
    main()

The program runs, and the barrage data is successfully crawled and saved to txt.

用时:32.040222s

Process finished with exit code 0

Three, draw a word cloud diagram

1. Read the barrage data in txt

with open('barrages.txt') as f:
    data = f.readlines()
    print(f'弹幕数据:{len(data)}条')

The results are as follows:

弹幕数据:52708条

Process finished with exit code 0

2. Pyecharts draws word cloud

import jieba
import collections
import re
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType
from pyecharts import options as opts
from pyecharts.globals import ThemeType, CurrentConfig

CurrentConfig.ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/'

with open('barrages.txt') as f:
    data = f.read()

# 文本预处理  去除一些无用的字符   只提取出中文出来
new_data = re.findall('[\u4e00-\u9fa5]+', data, re.S)  # 只要字符串中的中文
new_data = " ".join(new_data)

# 文本分词--精确模式分词
seg_list_exact = jieba.cut(new_data, cut_all=True)

result_list = []
with open('stop_words.txt', encoding='utf-8') as f:
    con = f.readlines()
    stop_words = set()
    for i in con:
        i = i.replace("\n", "")   # 去掉读取每一行数据的\n
        stop_words.add(i)

for word in seg_list_exact:
    # 设置停用词并去除单个词
    if word not in stop_words and len(word) > 1:
        result_list.append(word)
print(result_list)

# 筛选后统计
word_counts = collections.Counter(result_list)
# 获取前100最高频的词
word_counts_top100 = word_counts.most_common(100)
# 可以打印出来看看统计的词频
print(word_counts_top100)

word1 = WordCloud(init_opts=opts.InitOpts(width='1350px', height='750px', theme=ThemeType.MACARONS))
word1.add('词频', data_pair=word_counts_top100,
          word_size_range=[15, 108], textstyle_opts=opts.TextStyleOpts(font_family='cursive'),
          shape=SymbolType.DIAMOND)
word1.set_global_opts(title_opts=opts.TitleOpts('弹幕词云图'),
                      toolbox_opts=opts.ToolboxOpts(is_show=True, orient='vertical'),
                      tooltip_opts=opts.TooltipOpts(is_show=True, background_color='red', border_color='yellow'))
# 渲染在html页面上
word1.render("弹幕词云图.html")

The operation effect is as follows:

3. stylecloud draw word cloud

# -*- coding: UTF-8 -*-
"""
@File    :stylecloud_词云图.py
@Author  :叶庭云
@CSDN    :https://yetingyun.blog.csdn.net/
"""
from stylecloud import gen_stylecloud
import jieba
import re


# 读取数据
with open('barrages.txt') as f:
    data = f.read()

# 文本预处理  去除一些无用的字符   只提取出中文出来
new_data = re.findall('[\u4e00-\u9fa5]+', data, re.S)
new_data = " ".join(new_data)

# 文本分词
seg_list_exact = jieba.cut(new_data, cut_all=False)

result_list = []
with open('stop_words.txt', encoding='utf-8') as f:
    con = f.readlines()
    stop_words = set()
    for i in con:
        i = i.replace("\n", "")   # 去掉读取每一行数据的\n
        stop_words.add(i)

for word in seg_list_exact:
    # 设置停用词并去除单个词
    if word not in stop_words and len(word) > 1:
        result_list.append(word)
print(result_list)

# stylecloud绘制词云
gen_stylecloud(
    text=' '.join(result_list),    # 输入文本
    size=600,                      # 词云图大小
    collocations=False,            # 词语搭配
    font_path=r'C:\Windows\Fonts\msyh.ttc',   # 字体
    output_name='词云图.png',                 # stylecloud 的输出文本名
    icon_name='fas fa-apple-alt',             # 蒙版图片
    palette='cartocolors.qualitative.Bold_5'  # palettable调色方案
)

The operation effect is as follows:

◆ ◆ ◆  ◆ ◆
麟哥新书已经在京东上架了,我写了本书:《拿下Offer-数据分析师求职面试指南》,目前京东正在举行100-40活动,大家可以用相当于原价5折的预购价格购买,还是非常划算的:

数据森麟公众号的交流群已经建立,许多小伙伴已经加入其中,感谢大家的支持。大家可以在群里交流关于数据分析&数据挖掘的相关内容,还没有加入的小伙伴可以扫描下方管理员二维码,进群前一定要关注公众号奥,关注后让管理员帮忙拉进群,期待大家的加入。

管理员二维码:


猜你喜欢

● 麟哥拼了!!!亲自出镜推荐自己新书《数据分析师求职面试指南》● 厉害了!麟哥新书登顶京东销量排行榜!● 笑死人不偿命的知乎沙雕问题排行榜
● 用Python扒出B站那些“惊为天人”的阿婆主!● 你相信逛B站也能学编程吗

Guess you like

Origin blog.csdn.net/weixin_38753213/article/details/109554851