B station video downloader (BV number, dash, audio and video separation)

Why should I write a video crawler at station B

In the past few days, I have been addicted to some clips at station B. In the brainwashing cycle, such as the "White Snake: The Origin" of the up and down world, the full version of the ending song "Previous Life and Modern Life" 4 minutes and 06 seconds Sing: Gong Xiaoxiao, the link is as follows

https://www.bilibili.com/video/BV1Qb411q7Xu

Insert picture description here

In the face of such a video that has won my heart, the need to download it to my locality suddenly popped up in my mind. It was convenient for offline playback, and it could be re-created (of course, the source will be noted ~), so I went online. Find the video download crawler of station B.

About the things behind the station B video crawler

I do n’t know, I ’ve found a lot of plug-ins or libraries that can download video from station B, you-get (https://github.com/soimort/you-get/releases/tag/v0.4.486) and IDM (more than one Thread download tool, with exe version and browser plug-in, you can sniff and download media files on the web, including pictures, audio, video), I tried it, but it is soft, too many pits, give up, or write it yourself Right ~

Amazingly, I actually found a B-site video crawler I wrote earlier (https://blog.csdn.net/ygdxt/article/details/84501500), I tried it ecstatically and found that it did n’t work Now, looking at the data changed a code that json parsed, and I can run again:

def parseHtml(self,html):
#用pq解析得到视频标题
doc = pq(html)
video_title = doc('#viewbox_report > h1 > span').text()

#用正则、json得到视频url;用pq失败后的无奈之举
pattern = r'\<script\>window\.__playinfo__=(.*?)\</script\>'
result = re.findall(pattern, html)[0]
temp = json.loads(result)
# 改了此处,原来是 temp['durl']
for item in temp['data']['durl']:
    if 'url' in item.keys():
        video_url = item['url']
		return{
			'title': video_title,
			'url': video_url
		}

There are many good things, some videos can be downloaded correctly, and some downloads are 0kb, I think things are definitely not that simple. Further review of the information shows that the video formats used by station B before 2018 are all flv, and later the technical upgrade has been converted to dash (see the notice of station b: https://www.bilibili.com/read/cv855111), and I The previous crawlers were only for flv, so only certain videos could be downloaded.

Not only that, the following videos are separated from sound and images. To download a complete video, we need to download the video and audio separately. The address of the video and audio is similar to my previous code, but it should be noted that the download of audio and video must be First send an OPTIONS request to station b, we usually use GET / POST, this OPTIONS may be used less, but the use in the requests library is not much different, and then synthesis, the mainstream library of synthetic audio and video is preferred ffmpeg, well, roughly A technical route.

To insert a sentence, just last month (2020/3/23), station B upgraded the video av number to BV number

All along, the AV number is an important sign of the video manuscript of station B, which has played a key role in the dissemination and sharing of the video.

In order to protect the safety of the manuscript information, accommodate more submissions, and safeguard the rights and interests of the UP master, from March 23, 2020, the AV number will be fully upgraded to the BV number. Unlike the pure digital AV number, the BV number is a string of numbers and uppercase and lowercase letters, which is automatically generated by an algorithm. In the future, the BV number will be used uniformly as the manuscript identification.

At the same time, the related functions of generating AV numbers before March 23, 2020 remain unchanged. For example, links to shared manuscripts, AV number search, and highlight jumps in updates, comments, and private messages.

In addition, after copying the BV number or the link containing the BV number, the user will automatically jump to the video when opening the B station APP

In short, the video before March 23, 2020 is determined by av, and can also be determined by the BV number, but the video after March 23, 2020 is just the BV index.

So we simply crawled directly according to the BV number. The vast majority of the b-site video crawlers on the Internet are directed to the av number.
As for how I wrote the ideas into code, and the pits encountered in it, I can omit the 1 w word here ...
or just look at how to get the tool and use it ~

how to use

Please enjoy the demo video below, at a glance

Click me, the video is at the end of this link

Public concern No. month long small water background replies, Cheers , you can get download tool

For ease of use, the ffmpeg environment is integrated into the download tool without the need to reconfigure the ffmpeg environment or the Python environment. Out of the box ~

Published 85 original articles · Like 283 · Visits 160,000+

Guess you like

Origin blog.csdn.net/ygdxt/article/details/105485580