python3 web crawler--the latest crawler of Bilibili video barrage so files (source code attached)


This blog post records the main ideas and complete code for crawling Xiaopozhan barrages.

one. Preface

Approval required

In 2023, Xiaopo Station changed the return value of the barrage interface from .xml to .so file.
For example, the following address:

https://api.bilibili.com/x/v2/dm/wbi/web/seg.so?type=1&oid=1258114431&pid=575703555&segment_index=1&pull_mode=1&ps=0&pe=120000&web_location=1315873&w_rid=fec78ad870a48b68b35024304ba8460f&wts=1694223505

Return value example:
Insert image description here
It is obvious that some data is encrypted.

two. Configure Protobuf environment & generate compiled files

1. Configure Protobuf environment

I found out through searching that this format is called **Protobuf **. This format is binary encoding transmission.

Protobuf (Protocol Buffers) is a lightweight data serialization protocol developed by Google. It can be used for serialization and deserialization of structured data, and is often used in scenarios such as network communication, data storage, and configuration files.
Protobuf uses a concise syntax to define data structures, and then generates corresponding code through the compiler for serialization and deserialization of data in different programming languages. Compared with other serialization protocols, Protobuf has higher performance and smaller data size.
Using Protobuf, you can define the field type, field name, field order and other information of the message, and then read and write data through the code generated by the compiler. Protobuf supports multiple programming languages, including C++, Java, Python, etc., so it can transmit and exchange data between different platforms and languages.
In general, the Protobuf protocol is an efficient, flexible and scalable data serialization protocol suitable for data exchange and storage needs in various scenarios.

Simply put, it is a kind of data that is lighter than XML.
To decrypt data in this format, we need to download the Protobuf compiler (my computer is Windows 64-bit, just download win64-bit)

https://github.com/protocolbuffers/protobuf/releases/tag/v3.17.3
Insert image description here

After the download is complete, unzip it. The bin
Insert image description here
directory is the directory where the executable program is stored. We add it to the environment variables: The steps for win10 are: right-click "This PC" - Advanced System Settings - Environment Variables - Double-click path - New -Enter value-OK

Insert image description here

Enter protoc in cmd to verify whether our configuration is successful. If your console output is the same as mine, congratulations, the environment configuration is successful.
Insert image description here

2. Generate compiled files

First you need to download dm.proto

https://github.com/SocialSisterYi/bilibili-API-collect/blob/master/grpc_api/bilibili/community/service/dm/v1/dm.proto

Then enter in the console

protoc --python_out=. dm.proto

Insert image description here

A dm_pb2.py file will be generated in the same directory.
Insert image description here
This file is very critical.

three. Analyze barrages

Place the dm_pb2.py file compiled in the previous step at the same level of the script. Here is a demonstration of parsing the local .so file
and writing code .

import dm_pb2
from google.protobuf import text_format

my_seg = dm_pb2.DmSegMobileReply()
with open('./seg.so', 'rb') as f:
    DATA = f.read()
my_seg.ParseFromString(DATA)

parse_data = text_format.MessageToString(my_seg.elems[0], as_utf8=True)
print(parse_data)

Output results
Insert image description here

Four. Automatically parse barrages

Here I contribute a kind of automatic analysis of barrages, just enter the BVID of the video.

import json

import requests
import google.protobuf.text_format as text_format
import dm_pb2 as Danmaku
import re


class BEngine():
    """
    bilibili引擎
    """

    def __init__(self):
        self.headers = {
    
    
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"}

    def do_request(self, url):
        headers = {
    
    
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
        }
        r = requests.get(url, headers=headers)
        if r.status_code == 200:
            r.encoding = 'utf-8'
            return r.text
        else:
            return False

    def get_video_cid(self, bvid):
        """
        通过bvid获取cid
        :param bvid:
        :return:
        """
        api_url = f'https://api.bilibili.com/x/web-interface/view?bvid={bvid}'
        try:
            html = self.do_request(api_url)
            if html:
                _json = json.loads(html)
                cid = _json['data'].get('cid')
                return cid
            else:
                return False
        except:
            return False

    def bvid_to_avid(self, bvid):
        """
        通过bvid获取avid
        :param bvid:
        :return:
        """
        table = 'fZodR9XQDSUm21yCkr6zBqiveYah8bt4xsWpHnJE7jL5VG3guMTKNPAwcF'
        tr = {
    
    }
        for i in range(58):
            tr[table[i]] = i
        s = [11, 10, 3, 8, 4, 6]
        xor = 177451812
        add = 8728348608

        def dec(x):
            r = 0
            for i in range(6):
                r += tr[x[s[i]]] * 58 ** i
            return (r - add) ^ xor

        return dec(bvid)

    def get_danmu(self, avid, cid):
        """
        通过so文件获取解密后的弹幕列表
        :return:
        """
        result = []
        url = 'http://api.bilibili.com/x/v2/dm/web/seg.so'
        params = {
    
    
            'type': 1,  # 弹幕类型
            'oid': cid,  # cid
            'pid': avid,  # avid
            'segment_index': 1  # 弹幕分段
        }
        resp = requests.get(url, params, headers=self.headers)
        data = resp.content
        danmaku_seg = Danmaku.DmSegMobileReply()
        danmaku_seg.ParseFromString(data)
        for j in danmaku_seg.elems:
            parse_data = text_format.MessageToString(j, as_utf8=True)
            result.append(parse_data.replace("\n", ",").rstrip(","))
        print(result)
        return result

    def parse_danmu(self, danmu_list):
        """
        解析出每个弹幕列表内容
        :param danmu_list:
        :return:
        """
        result = []
        for each_dm in danmu_list:
            res = re.findall(
                '''id: \d+,progress: (\d+),mode: (\d+),fontsize: (\d+),color: (\d+),midHash: "(.*?)",content: "(.*?)",ctime: (\d+),weight: (\d+),idStr: "(\d+)"''',
                each_dm)
            if res and len(res[0]) == 9:
                item = {
    
    
                    "progress": res[0][0],
                    "mode": res[0][1],
                    "fontsize": res[0][2],
                    "color": res[0][3],
                    "midHash": res[0][4],
                    "content": res[0][5],
                    "ctime": res[0][6],
                    "weight": res[0][7],
                    "idStr": res[0][8],
                }
                result.append(item)
            else:
                continue
        return result

    def getdanmu_format(self, bvid):
        """
        弹幕直接格式化
        :param bvid:
        :return:
        """
        avid = e.bvid_to_avid(bvid)
        cid = e.get_video_cid(bvid)
        danmu_raw = self.get_danmu(avid, cid)
        return self.parse_danmu(danmu_raw)


if __name__ == '__main__':
    e = BEngine()
    bvid = "BV1Dz4y1L7hj"
    print(e.getdanmu_format(bvid))

Example of output results
Insert image description here

five. Summarize

This time, by investigating the protobuf protocol and building an environment, we used Python to write code to achieve the analysis of B-war barrages. For most people, it may be difficult to set up a local environment. Here is the encapsulated dm_pb2.py file. Click to download and place it in the same directory as your own script. Finally, I wish you all a good time. Can you give me a thumbs up?
Insert image description here

six. refer to

Xiaopozhan barrage Protobuf format analysis

Python implements crawling of Bilibili video likes and other information

Xiaopozhan Danmaku so file parsing/deserialization

Python converts Xiaopozhan AV number and BV number

Guess you like

Origin blog.csdn.net/a1397852386/article/details/132773549