Python obtains the simple processing of B station video barrage

B station barrage

Overview

I saw some people on the Internet doing some up-station barrage data visualization, which felt quite interesting, so I did it myself (without visualization). Bless Wuhan, bless China!

Project realization

(1) Obtaining the barrage data. I originally thought to get the barrage data of all the videos of a certain up, but later found a problem: the barrage data is stored in JSON or XMl. The data is identified by an id number. I thought it was the av number of the video, but when I checked the elements, I found that it was not. Refer to some documents on the Internet and found that the barrage data is stored in a URL, so only the content in this URL is needed.

Python obtains the simple processing of B station video barrage

Python obtains the simple processing of B station video barrage

(2) To obtain data, just use the requests module that crawlers are used to

(3) Analyze the data

(4) Data persistence and simple data processing

initialization

First initialize some basic parameters that need to be used below


def __init__(self):
   self.headers = {
    # 自己添加自己的即可
   }
    self.base_url = 'https://api.bilibili.com/x/v1/dm/list.so?oid={}'
    self.url = ''
    self.barrage_result = []
    self.danmu = []
    self.danmu_count = []

Initialize your own headers, and initialize a basic URL. Then initialize a few parameters to be used below.

Get webpage

Get web page information and store the data locally


# 获取信息
def get_page(self):
  # 延时操作,防止太快爬取
  time.sleep(0.5)
  response = requests.get(self.url, headers=self.headers)
  with open('bilibili.xml', 'wb') as f:
       f.write(response.content

Analytical data

Parse the webpage and store all barrage information in the danmu list

# 解析网页  将所有弹幕信息存储在danmu列表中
def param_page(self):
    time.sleep(1)
    if self.barrage_result:
        # 文件路径,html解析器
        html = etree.parse('bilibili.xml', etree.HTMLParser())
        # xpath解析,获取当前所有的d标签下的所有文本内容
        results = html.xpath('//d//text()')
        # 将去重后的弹幕存储起来
        for one in results:
            self.danmu.append(one)
        print('总的实时弹幕数量:', len(self.danmu))

Data persistence

Simple processing of the acquired data, and persistence, because there are other operations later, so I saved two forms.


# 对获取的弹幕数据进行简单处理
def ana_result(self):
    print('开始处理弹幕')
    for one_danmu in self.danmu:
        if one_danmu not in self.danmu_count:
            self.danmu_count.append(one_danmu)
    print('弹幕去重后数量为:', len(self.danmu_count))
    with open('tanmu.txt', 'w', encoding='utf-8') as f:
        for danmu in self.danmu_count:
            # 数量的统计
            amount = self.danmu.count(danmu)
            f.write(danmu + ':' + str(amount) + '\n')
    book = xlwt.Workbook(encoding='utf-8-sig', style_compression=0)
    sheet = book.add_sheet('B站部分视频弹幕', cell_overwrite_ok=True)
    sheet.write(0, 0, '弹幕内容')
    sheet.write(0, 1, '弹幕出现次数')
    n = 1
    for danmu in self.danmu_count:
        amount = self.danmu.count(danmu)
        sheet.write(n, 0, danmu)
        sheet.write(n, 1, amount)
        n = n + 1
    book.save(u'B站部分视频弹幕.xls')

Generally speaking, it ends after the data is persisted, but for the following operations, here I am doing a function of counting barrage of a certain keyword


# 对含有某关键字的弹幕计数
def key_count(self, key):
    value = key
    pattern = '.*' + value + '.*'
    tempList = []
    for one in self.danmu:
        obj = re.findall(pattern, one)
        if len(obj) > 0:
            tempList.extend(obj)
    print('弹幕中含有', key, '的弹幕数量为:', len(tempList))

I use the following program segment to call the entire crawler.


def run(self):
    for i in range(145134329, 145144329):
        self.get_url(i)
        self.barrage_result = self.get_page()
        self.param_page()
        self.key_count('武汉加油')
        self.key_count('武汉')
    self.ana_result()

Fill in an oid list in the range, traverse this list to grab the barrage data. If there is no known oid list, it can be traversed like my method above.

The runtime interface is as follows.
Python obtains the simple processing of B station video barrage

Guess you like

Origin blog.51cto.com/15069472/2577358