How to use Python crawler to crawl the barrage data of station B?

Bilibili is well known as the paradise of barrage. The more people watch the video, the more barrage it will be. Today, Xiaoqian will teach you how to use Python to develop a crawler to crawl the barrage data of station B.

1. Where to find the bullet screen?

Usually when we watch the video, the barrage appears on the video. In fact, in the web page, the barrage is hidden in the source code and loaded in the data format of XML:

1.jpg

XML, like JSON and YAML, is a common way of expressing markup information, which can be simply understood as a format for recording data. XML is very similar to HTML, the language used to describe web pages, so you will see tags like <d></d> in the screenshot.

So what is the url of the barrage file in the picture above ?

https://comment.bilibili.com/92542241.xml

It is composed of a fixed url address + video cid+.xml. As long as you find the video cid you want, replace this url to crawl all the bullet screens (the limit of subtitles given on most webpages of station b is 1000).

Where is the cid of a video ? Right-click the webpage, open the source code of the webpage, and search for "cid" to find:

Cid is a very common phrase in web source code, and the correct cid we are looking for will be written in the form of "cid":xxxxxxxx. In order to narrow the search scope, adding a quotation mark will make the search faster.

With the correct cid, spell the url, let's write a crawler!

2. What exactly is a crawler library?

Basically all those who are new to Python crawlers will come into contact with the two tool libraries of requests and BeautifulSoup, which are two commonly used basic libraries. requests is used to initiate a request to the website url to obtain the webpage code; BeautifulSoup is used to parse the HTML/XML content and extract the important information inside.

3.jpg

These two libraries simulate the process of people accessing webpages, reading the webpages and copying and pasting the corresponding information, and can complete data crawling in batches and quickly.

3. Start crawling

Observing the webpage, you can find that all the bullet screens are placed under the <d> tag, then we need to build a program to get all the <d> tags:

The first step is to import the requests library and use the request.get method to access the barrage url:

import requests

#Get page data html

url = r'https: //comment.bilibili.com/78830153.xml '

r=requests.get(url)#Access url

r.encoding='utf8'

The second step is to import the BeautifulSoup library and use the lxml parser to parse the page:

from bs4 import BeautifulSoup

#Parse page

soup=BeautifulSoup(r.text,'lxml')#lxml is a commonly used parser, you need to install the lxml library using the pip tool in advance

d=soup.find_all('d')#Find the d tags of all pages

#print(d)

After doing this, all the barrage content hidden in the d tag is captured by python:

4.gif

#Analyze the bullet screen, organize the bullet screen, URL, and time into a dictionary, and finally add it to a list, a total of 1000 data

8.png

6.jpg

After the data is sorted, we can also analyze, such as the frequency of vocabulary, etc., which can be handled freely according to needs.

7.jpg

This article is from Qianfeng Education , please indicate the source for reprinting.

Guess you like

Origin blog.51cto.com/15128702/2668077