In order to automatically collect the barrage of station B, I developed a downloader with Python

If you think the article is well written, if you want the data in the blog post, please pay attention to the official account : [ Xiao Zhang Python ], 50+ Python e-books and 200G + high-quality video materials have been prepared for you. The backstage reply keywords: 1024 can be obtained; add the author [personal WeChat], you can directly communicate with the author,

Hello everyone, this is Xiao Zhang!

In the article " Making a Word Cloud Video in Python, and Watching Miss Dancing Through a Word Cloud Image " article, I briefly introduced the method of crawling the barrage at station B. You only need to find the parameter cid in the video to collect all the bullets in the video. Although the idea is very simple, but personally feels more troublesome. For example, one day later, I want to collect a certain video barrage on station B, and I need to start from scratch: find cid parameters, write code, and repeat the monotony;

So I was wondering if it is possible to do it in one step, and only one step is required to collect a certain video barrage in the future, such as inputting the video link that I want to crawl, and the program can automatically recognize the download

Achieve effect

Based on this, with the help of PyQt5, I wrote a small tool that only needs to provide the url of the target video and the target txt path. The program automatically collects the barrage under the video and saves the data to the target txt text. Let’s take a look at the preview effect:

effect

PS WeChat public account has a limit on the number of frames of animations, I cut some of the content when making animations, so the effect may not be smooth

The overall tool implementation is divided into two parts : UI interface and data collection . Python libraries used:

import requests
import re
from PyQt5.QtWidgets import *
from PyQt5 import QtCore
from PyQt5.QtGui import *
from PyQt5.QtCore import QThread, pyqtSignal
from bs4 import BeautifulSoup

UI interface

The UI interface uses PyQt5, with two buttons ( start download, save to ), editline control for input video link and debug window;

image-20210217185155009

code show as below:

    def __init__(self,parent =None):
        super(Ui_From,self).__init__(parent=parent)
        self.setWindowTitle("B站弹幕采集")
        self.setWindowIcon(QIcon('pic.jpg'))# 图标
        self.top_label = QLabel("作者:小张\n 微信公号:小张Python")
        self.top_label.setAlignment(QtCore.Qt.AlignHCenter)
        self.top_label.setStyleSheet('color:red;font-weight:bold;')
        self.label = QLabel("B站视频url")
        self.label.setAlignment(QtCore.Qt.AlignHCenter)
        self.editline1 = QLineEdit()
        self.pushButton = QPushButton("开始下载")
        self.pushButton.setEnabled(False)#关闭启动
        self.Console = QListWidget()
        self.saveButton = QPushButton("保存至")
        self.layout = QGridLayout()
        self.layout.addWidget(self.top_label,0,0,1,2)
        self.layout.addWidget(self.label,1,0)
        self.layout.addWidget(self.editline1,1,1)
        self.layout.addWidget(self.pushButton,2,0)
        self.layout.addWidget(self.saveButton,3,0)
        self.layout.addWidget(self.Console,2,1,3,1)
        self.setLayout(self.layout)
        self.savepath = None

        self.pushButton.clicked.connect(self.downButton)
        self.saveButton.clicked.connect(self.savePushbutton)

        self.editline1.textChanged.connect(self.syns_lineEdit)

When the url is not empty and the destination text storage path has been set, you can enter the data collection module

Effect 12

The code to implement this function:

 def syns_lineEdit(self):
        if self.editline1.text():
            self.pushButton.setEnabled(True)#打开按钮

  def savePushbutton(self):
        savePath = QFileDialog.getSaveFileName(self,'Save Path','/','txt(*.txt)')
        if savePath[0]:# 选中 txt 文件路径
            self.savepath  = str(savePath[0])#进行赋值

data collection

After the program gets the url, the first step is to visit the url to extract the cid parameters ( a series of numbers ) of the video in the current page

image-20210217194745469

Use the cid parameters to construct the API interface for storing the video barrage, and then use the regular requests and bs4 packages to achieve text collection

image-20210217195252765

Data collection part of the code:

f = open(self.savepath, 'w+', encoding='utf-8')  # 打开 txt 文件
        res = requests.get(url)
        res.encoding = 'utf-8'
        soup = BeautifulSoup(res.text, 'lxml')
        items = soup.find_all('d')  # 找到 d 标签

        for item in items:
            text = item.text
            f.write(text)
            f.write('\n')
        f.close()

The cid parameter is not located on the label of the regular html. When extracting, I choose re regular matching; but this step consumes more machine memory. In order to reduce the impact on the response speed of the UI interface, this step is implemented by a single thread

class Parsetext(QThread):
    trigger = pyqtSignal(str) # 信号发射;
    def __init__(self,text,parent = None):
        super(Parsetext,self).__init__()
        self.text = text
    def __del__(self):
        self.wait()
    def run(self):
        print('解析 -----------{}'.format(self.text))
        result_url = re.findall('.*?"baseUrl":"(.*?)","base_url".*?', self.text)[0]
        self.trigger.emit(result_url)

summary

Well, the above is all the content of this article, I hope the content can be helpful to your work or study.

Finally, thank you all for reading, see you in the next issue~

Source code acquisition

About the source code used in this article, may be concerned about the public micro-channel number Zhang Python , background replies keyword 210,217 you can get!

Guess you like

Origin blog.csdn.net/weixin_42512684/article/details/113914376