Article directory
This blog post records the main ideas and complete code for crawling Xiaopozhan barrages.
one. Preface
Approval required
In 2023, Xiaopo Station changed the return value of the barrage interface from .xml to .so file.
For example, the following address:
Return value example:
It is obvious that some data is encrypted.
two. Configure Protobuf environment & generate compiled files
1. Configure Protobuf environment
I found out through searching that this format is called **Protobuf **. This format is binary encoding transmission.
Protobuf (Protocol Buffers) is a lightweight data serialization protocol developed by Google. It can be used for serialization and deserialization of structured data, and is often used in scenarios such as network communication, data storage, and configuration files.
Protobuf uses a concise syntax to define data structures, and then generates corresponding code through the compiler for serialization and deserialization of data in different programming languages. Compared with other serialization protocols, Protobuf has higher performance and smaller data size.
Using Protobuf, you can define the field type, field name, field order and other information of the message, and then read and write data through the code generated by the compiler. Protobuf supports multiple programming languages, including C++, Java, Python, etc., so it can transmit and exchange data between different platforms and languages.
In general, the Protobuf protocol is an efficient, flexible and scalable data serialization protocol suitable for data exchange and storage needs in various scenarios.
Simply put, it is a kind of data that is lighter than XML.
To decrypt data in this format, we need to download the Protobuf compiler (my computer is Windows 64-bit, just download win64-bit)
https://github.com/protocolbuffers/protobuf/releases/tag/v3.17.3
After the download is complete, unzip it. The bin
directory is the directory where the executable program is stored. We add it to the environment variables: The steps for win10 are: right-click "This PC" - Advanced System Settings - Environment Variables - Double-click path - New -Enter value-OK
Enter protoc in cmd to verify whether our configuration is successful. If your console output is the same as mine, congratulations, the environment configuration is successful.
2. Generate compiled files
First you need to download dm.proto
Then enter in the console
protoc --python_out=. dm.proto
A dm_pb2.py file will be generated in the same directory.
This file is very critical.
three. Analyze barrages
Place the dm_pb2.py file compiled in the previous step at the same level of the script. Here is a demonstration of parsing the local .so file
and writing code .
import dm_pb2
from google.protobuf import text_format
my_seg = dm_pb2.DmSegMobileReply()
with open('./seg.so', 'rb') as f:
DATA = f.read()
my_seg.ParseFromString(DATA)
parse_data = text_format.MessageToString(my_seg.elems[0], as_utf8=True)
print(parse_data)
Output results
Four. Automatically parse barrages
Here I contribute a kind of automatic analysis of barrages, just enter the BVID of the video.
import json
import requests
import google.protobuf.text_format as text_format
import dm_pb2 as Danmaku
import re
class BEngine():
"""
bilibili引擎
"""
def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"}
def do_request(self, url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
}
r = requests.get(url, headers=headers)
if r.status_code == 200:
r.encoding = 'utf-8'
return r.text
else:
return False
def get_video_cid(self, bvid):
"""
通过bvid获取cid
:param bvid:
:return:
"""
api_url = f'https://api.bilibili.com/x/web-interface/view?bvid={bvid}'
try:
html = self.do_request(api_url)
if html:
_json = json.loads(html)
cid = _json['data'].get('cid')
return cid
else:
return False
except:
return False
def bvid_to_avid(self, bvid):
"""
通过bvid获取avid
:param bvid:
:return:
"""
table = 'fZodR9XQDSUm21yCkr6zBqiveYah8bt4xsWpHnJE7jL5VG3guMTKNPAwcF'
tr = {
}
for i in range(58):
tr[table[i]] = i
s = [11, 10, 3, 8, 4, 6]
xor = 177451812
add = 8728348608
def dec(x):
r = 0
for i in range(6):
r += tr[x[s[i]]] * 58 ** i
return (r - add) ^ xor
return dec(bvid)
def get_danmu(self, avid, cid):
"""
通过so文件获取解密后的弹幕列表
:return:
"""
result = []
url = 'http://api.bilibili.com/x/v2/dm/web/seg.so'
params = {
'type': 1, # 弹幕类型
'oid': cid, # cid
'pid': avid, # avid
'segment_index': 1 # 弹幕分段
}
resp = requests.get(url, params, headers=self.headers)
data = resp.content
danmaku_seg = Danmaku.DmSegMobileReply()
danmaku_seg.ParseFromString(data)
for j in danmaku_seg.elems:
parse_data = text_format.MessageToString(j, as_utf8=True)
result.append(parse_data.replace("\n", ",").rstrip(","))
print(result)
return result
def parse_danmu(self, danmu_list):
"""
解析出每个弹幕列表内容
:param danmu_list:
:return:
"""
result = []
for each_dm in danmu_list:
res = re.findall(
'''id: \d+,progress: (\d+),mode: (\d+),fontsize: (\d+),color: (\d+),midHash: "(.*?)",content: "(.*?)",ctime: (\d+),weight: (\d+),idStr: "(\d+)"''',
each_dm)
if res and len(res[0]) == 9:
item = {
"progress": res[0][0],
"mode": res[0][1],
"fontsize": res[0][2],
"color": res[0][3],
"midHash": res[0][4],
"content": res[0][5],
"ctime": res[0][6],
"weight": res[0][7],
"idStr": res[0][8],
}
result.append(item)
else:
continue
return result
def getdanmu_format(self, bvid):
"""
弹幕直接格式化
:param bvid:
:return:
"""
avid = e.bvid_to_avid(bvid)
cid = e.get_video_cid(bvid)
danmu_raw = self.get_danmu(avid, cid)
return self.parse_danmu(danmu_raw)
if __name__ == '__main__':
e = BEngine()
bvid = "BV1Dz4y1L7hj"
print(e.getdanmu_format(bvid))
Example of output results
five. Summarize
This time, by investigating the protobuf protocol and building an environment, we used Python to write code to achieve the analysis of B-war barrages. For most people, it may be difficult to set up a local environment. Here is the encapsulated dm_pb2.py file. Click to download and place it in the same directory as your own script. Finally, I wish you all a good time. Can you give me a thumbs up?
six. refer to
Python implements crawling of Bilibili video likes and other information