Practical case of python crawler - video crawler of a certain website

Preface

I suddenly discovered today that videos from a certain website cannot be downloaded on my computer. Ever since, I decided to crawl the video of a certain website on my computer. So that everyone can watch it on the computer,

Prepare

The videos on a certain website are separated from audio and video. I searched online and found that an audio and video synthesis library called ffmpeg is needed. There are many online tutorials. You can find them by searching, so I won’t go into details here.

Start the topic

It’s the same old rule – grab the bag. I captured the packets of a certain website. I was very lucky. The certain website was very kind and did not encrypt the video. You can find the video link in one click.

74e74b71c152463fb485b4254aac9aa5.png

 Remember to click the format button in the lower left corner, otherwise you won’t be able to see the video link at a glance.

I'm stuck here because there are too many video links.

7f3dd749e5d949ad9f96771dde46f3ab.png

So I was a little unsure at first whether this was the link to the video I was looking for. So what is the best way to verify this link? That's right, just copy this link and search for it in the search bar. But nothing can be found in this link , which is exactly where I fell into. At that time, I just lost the data packet and looked for other data packets. It was okay if I didn't look for it. Once I looked for it, I started to wonder if a certain website was encrypted. (Because I am a self-taught crawler, I just recently learned js decryption. , I thought it was very interesting, so when I couldn’t find the data packet, all I could think about was that it was encrypted. How could I not find the video link?

Then, I really couldn’t find it later, so I had to search online for articles by big guys. Then I looked at it, with a face full of disbelief. This link obviously couldn't find anything. So, with a disdainful expression and the mentality of watching a show (a dog's head is to save his life), I copied and pasted the codes of the big guys to see if I could download the video. Oh my gosh, I actually got it down, using the first URL of the packet I lost. So I put the link in the browser and searched it again, but still couldn't find anything, but it can indeed download the video. I still don't understand what's going on, I hope someone can give me some answers . But I understand that if I encounter this kind of link again in the future, just go to the code and do it (dog head saves life).

Then, it comes to Ctrl+C and Ctrl+V. Just kidding, then it’s time to type the code. The parts of the previous requests are actually the same.

# url = input("视频链:")
url = 'https://www.bilibili.com/video/BV1614y1z7yr/?spm_id_from=333.1007.tianma.8-4-30.click&vd_source=19577052a287f6b91b30d9f7ecbda428'
# def get_proxy():
#     json_data = requests.get("http://demo.spiderpy.cn/get/").json()
#     proxies = {
#         'http' : 'http://{}'.format(json_data['proxy']),
#         'https' : 'https://{}'.format(json_data['proxy'])
#     }
#     return proxies
headers = {
    'referer': 'https://www.bilibili.com/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42'
}
response = requests.get(url=url, headers=headers)

If you want to use it, just comment out the url, then light up url=input, and then directly enter the link of the video you want to download. Then which get_proxy method is simply to get the IP address, a proxy pool project on github, because I am very afraid that my IP will be blocked. After all, if you are good at crawling, you will be in jail, although I am not good at it ( The dog's head saves his life). It is an anti-crawling parameter, but sometimes the foreign IP is obtained, but the data cannot be requested, so I comment it out. Then there are the parameters in the headers. The referer must be brought. You have to tell it that you are coming from a certain site. You can only request it when you request the video later. Otherwise, you will not be able to request the video. Maybe I didn't add this when searching for the link, so I couldn't find it. I'll search for the link later.

Then after requesting the data, I had another problem with the video resolution. It was because there were so many links. I wanted to figure out which link had which video resolution. In the end, I searched for a long time, but I couldn’t find out where the video resolution was. Degree identification. So I simply requested all the videos from the previous links, and the difference I got is as follows:

66c603427ea14fce984a8bae3dd64f1a.png

I don’t know what this data rate and total bit rate are, but the largest file is the clearest (dog head saves life). Overall, for the previous link with the same width and height, the video resolution requested is the best, and there is not much difference in video resolution, at least I can't tell the difference with the naked eye. 

The next step is to parse the data.

ex1 = '<title data-vue-meta="true">(.*?)_.*?</title>'
ex2 = '<script>window.__playinfo__=(.*?)</script>'
title = re.findall(ex1, response.text, re.S)[0]
data = re.findall(ex2, response.text, re.S)
for i in title:
    if (i in "[\/:*?<>|]"):
        title = title.replace(i, '')
json_data = json.loads(data[0])
video_url = json_data['data']['dash']['video'][0]['baseUrl']
audio_url = json_data['data']['dash']['audio'][0]['baseUrl']

video_cotent = requests.get(url=video_url, headers=headers).content
audio_cotent = requests.get(url=audio_url, headers=headers).content

Two regular expressions, one to get the title of the video, and one to get the video and audio. A for loop removes the punctuation marks that will cause an error when saving as a file name, which is to comply with the Windows file naming specifications, and then jsonize the text containing the video and audio links, so that we can get a dictionary and get the links based on the keys. , of course, this kind of conversion data depends on the situation, and this can be done here. Then there are matryoshka dolls, layer by layer to get the values ​​we want through keys and get our video and audio links.

Finally, the video and audio are saved as binary files. I saved them all as mp4 and mp3 here. Friends who have other format requirements can make changes and finally save them.

with open(title + '.mp4', 'wb') as f :
    f.write(video_cotent)
with open(title + '.mp3', 'wb') as f :
    f.write(audio_cotent)

d3735172cbc64ddfb68eba6a1f1d0438.png

It's all over here.

Summary of pitfalls 

The first point

The second is that the link cannot be found in the browser search bar, but the video can be requested using the code. It may be that the parameters in the referer are not included.

Second point

It is the definition of the video. As long as the width and hright parameter values ​​are the same for the link, the video definition will be the same. There is also the link at the top, which is basically the highest resolution request link for the video.

 

This article is for learning only, please do not use it for illegal purposes.

 

Guess you like

Origin blog.csdn.net/qq_64241302/article/details/132245825