I heard that there are many barrage ghosts at station B? I do not believe

How many barrage at station B? I do not believe

Click me to get the source code, welcome to star

In my opinion, crawlers will always only have four major steps. Although the four major steps contain many infinite derivations, they are all based on these four major steps from the beginner's entry to the old bird's decryption.

  1. Determine the URL and construct the request header
  2. Send request, get response
  3. Parse the response and get the data
  4. save data

Goal: Obtain the video barrage of station B according to the video BV

The code address is as follows:

Capture the URL to determine the packet:

Import:

Videos have a unique video: BV number

Then the video URL rule is:'https://wwww.bilibili.com/video/BV{BVID}'

Find the address of the barrage and search directly! as follows

Insert picture description here

From the above capture, we can see that the URL of the barrage:'https://api.bilibili.com/x/v1/dm/list.so?oid=oid',

We get the oid, then this step is complete

Come, look back and find out where did oid come from?

According to the old man's years of experience, he must be in the video URL. (In fact, I have been looking for it for a long time, and even the reverse hand, breakpoint debugging, call stack, etc. are used. In the end, the effort paid off and I found it)

In fact, looking back, oid is equal to the cid parameter in the video_URL page (verified Payne's conjecture). The process is uncomfortable

Insert picture description here

URL and its parameter rules have also been found, so don’t just do whatever I want. As long as you get the video address, you can get the barrage directly. of course!

30,000 words are omitted here (request, analysis, network principle...)

In fact, I knew that I tried both methods at the time. I won’t talk about JS. If you are interested, you can do it.

Let’s talk about this extraction of cid parameters. I use regularization. In this case, regularization is best, but it also depends on personal preference.

You can look back at the second picture. At first glance, I don’t seem to be able to, haha~

After optimization (mainly after watching other videos): write this magical regular

cid = re.findall('.*?cid":+(\d{9})+', text)[0]

Guess you like

Origin blog.csdn.net/wzp7081/article/details/107371508