Python crawler introductory tutorial continued, python crawler crawling AC intermediate station video

Get into the habit of writing together! This is the 12th day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .

write in front

In the last blog, we analyzed the video transmission method of station B with a relatively large amount of space. In this blog, we will fill in the pit left before, and we will write the code part.

Article source: Dream Eraser, in fact, this ID is a combination

The steps and logic of the analysis will not be repeated here for everyone to demonstrate. You can read the previous article, which has clear instructions.

remember first

30280.m4s, corresponding to the audio file 30064.m4s, corresponding to the video file

encoding time

Although the video of station B has been analyzed, the actual coding is still difficult, so hold on, let's get it done together.

The link used in the whole article is: www.bilibili.com/video/BV1Pv… , BV link, after the upgrade of station B, the AV connection is upgraded to BV, and the anti-climbing technology is multiplied.

After crawling and analyzing through Fiddler, we got some conclusions like this. The key point is as shown in the figure below. The page return status code is 206, which needs to be paid attention to.

Python crawler introductory tutorial 71-100 Continued from the previous article, python crawler crawling station B video

The above picture, you may be dizzy, don't worry, click on a link, we need to see how it requests and returns data. After analysis, you will find a strange phenomenon, in the case of the same link, the request returns one status code is 200, the other is 206.

Python crawler introductory tutorial 71-100 Continued from the previous article, python crawler crawling station B video

These two requests next to each other have the same request address, but the returned status codes are different. This is not the most important thing. If you look at the request method, you will find that even more strange points appear.

Python crawler introductory tutorial 71-100 Continued from the previous article, python crawler crawling station B video

The status code is 200, and the request method is OPTIONS... Another status code is 206, and the request method is GETa bit interesting. Maybe this is our final breakthrough point. First, understand the request method thoroughly. get up.

这部分代码,我使用的链接是直接从 fiddler 中获取的,代码完毕之后,这个地方有个非常大的难度需要攻克,可以作为大家深入分析后续的一个亮点功能点。后面我们会基于这个链接做一些扩展讨论和探索。 从 fiddler 中复制的链接 https://9lglsr2.yfcalc.com:9940********=1 节省了一些篇幅,链接自己去找哦~

最终实现的代码如下,注意这个地方要通过 requests.session() 去发起请求,因为请求头中有 keep-alive 这个属性值,下述代码中删掉了部分链接,需要代码的可以关注我微信公号,回复 B 站,或者直接在文章头部下载文件也可以。

微信搜索“非本科程序员” 即可

import requests

"""
Host: 9lglsr2.yfcalc.com:9940
Connection: keep-alive
Origin: https://www.bilibili.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; UA删掉部分
range: bytes=998857-1198790
Accept: */*
Sec-Fetch-Site: cross-site
Sec-Fetch-Mode: cors
Referer: https://www.bilibili.com/video/BV1Pv41167FE
Accept-Encoding: identity
Accept-Language: en,zh-CN;q=0.9,zh;q=0.8

"""
# 你的地址和我的不同
url = "https://9lglsr2.yfcalc.com:9940/upos-dash-mirrorks3u.bilivideo.com/bilibilidash_篇幅关系,链接删掉部分"

header_options = {

    'Host': '9lglsr2.yfcalc.com:9940',
    'Connection': 'keep-alive',
    'Access-Control-Request-Method': 'GET',
    'Origin': 'https://www.bilibili.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) UA删掉部分',
    'Access-Control-Request-Headers': 'range',
    'Accept': '*/*',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.bilibili.com/video/BV1Pv41167FE',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en,zh-CN;q=0.9,zh;q=0.8'

}

# 拼凑一个请求头,你获取的跟我应该大不相同,写自己的即可
headers ={
    'Host': '9lglsr2.yfcalc.com:9940',
    'Connection': 'keep-alive',
    'Origin': 'https://www.bilibili.com',
    'User-Agent': 'Mozilla/5.0 UA删掉部分',
    'range': 'bytes=0-9999999999',
    'Accept': '*/*',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.bilibili.com/video/BV1Pv41167FE',
    'Accept-Encoding': 'identity',
    'Accept-Language': 'en,zh-CN;q=0.9,zh;q=0.8'
}
session=requests.session()
session.options(url=url,headers=header_options,verify=False)
res=session.get(url=url,headers=headers,verify=False)
with open('audio.mp3','wb') as fp:
    fp.write(res.content)
    fp.flush()
    fp.close()
print("下载成功")
复制代码

运行成功,下载的是音频文件。

Python crawler introductory tutorial 71-100 Continued from the previous article, python crawler crawling station B video

用同样的办法,你可以把视频下载下来,只需要获取到视频的地址即可进行操作,我得到的地址如下

url = "9ns2tr2.yfcalc.com:13357/upos-dash-m…"

爬取的音频和视频是分离的,只需要用 ffmpeg 去合并一下即可。基本逻辑已经理清楚,剩下的是修改、完善,大部头的工作即将开始,当然我这块砖已经抛给你了,剩下的由你来完成。

编码未完成部分

上面的代码只是非常小的一部分逻辑,例如在下载视频的时候,没有进度条会导致视频不确定是否下载完毕,体验非常差,如果去完成一个项目,你需要补充下,学习阶段看自己的情况。

最大的问题是我们上述的请求地址是直接从 Fiddler 中获取的,这个地址到底是怎么来的,我尝试去解决,结果发现 B 站反爬果然还是比较厉害的,我展示一下大概的进展,并没有完成该问题。

写代码的时候,如果从网页源码去获取连接,发现获取难度很大,我们在网页源码中获取的链接和视频请求的链接明显不一致

源码获取的链接(网站右键,查看源码获取到的)

http://upos-sz-mirrorhw.bilivideo.com/upgcxcode/20/11/199591120/199591120-1-30280.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1591936610&gen=playurl&os=hwbv&oi=3056817678&trid=ba124908cc9b4c08b338898c507f1b2cu&platform=pc&upsig=69afcb746d8f70b3ef157b0d5d96105a&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=39801623&orderid=0,2&logo=80000000

视频请求的真实链接

https://k31k9q1.yfcalc.com:15172/upos-dash-mirrorks3u.bilivideo.com/bilibilidash_98e9c0435499a2edc99cda183af9647c2c76fee9/199591120-1-30064.m4s?scuid=F6we2NvTZYGLpLeGVVZi&timeout=1592535041&check=1258368323&sttype=90&yfdspt=1591930241786&yfpri=100&yfopt=17&yfskip=1&yfreqid=CDkypiFjBDuLtgbAAQ&yftt=100&yfhost=5p8vjn1.yfcache.com&yfpm=1

这就需要我们破掉这个障碍了,难度最大的地方是需要找到视频请求的连接是如何拼接出来的。来吧,继续探案吧。

Continue to try to find the link shown in the figure below, this should be the breakthrough point, because I found the above 真实链接address . And it returns a JSON string.

Python爬虫入门教程 71-100 续上篇,python爬虫爬取B站视频

Python爬虫入门教程 71-100 续上篇,python爬虫爬取B站视频

Use the url decoding tool to decode the URL used for the parameters, wow, the feeling of deja vu is getting closer and closer to the truth.

https://upos-dash-mirrorks3u.bilivideo.com/upgcxcode/20/11/199591120/199591120-1-30280.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1591938786&gen=playurl&os=hwbv&oi=3056817678&trid=8106da0eee9548cb8377565645dbc1f8u&platform=pc&upsig=adeb773cf597120141a94bd78a8775ab&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=39801623&orderid=0,2&logo=80000000&[email protected]

The problem now is to unlock the encryption rules of station B. Regarding this part, I tried to unlock it, and found that the confusion is serious, and the analysis is very time-consuming. The following is the progress on my side. The core part of the code has been found, if possible Unpack and find the corresponding code in the JS file, and then use Python to transform it. I found two videos in station B, and found the following code snippets for these two videos, and found that some of the contents are indeed encrypted, but they 混淆的字符串are consistent. It is the t part and the ak part below

Python爬虫入门教程 71-100 续上篇,python爬虫爬取B站视频

Python爬虫入门教程 71-100 续上篇,python爬虫爬取B站视频

Then next, I got into a bottleneck. The encryption method of this piece is not easy to reverse, and it can be said that it takes a lot of time to try (I am lazy again, I didn’t continue to think~ Hahaha), if you solve it, you must Send me a message and let me know! ! !

egg time

For the video download of station B, it is very fragrant to use you-get. Rebuilding the wheel depends on what ability you are exercising. However, in this case, I will directly visit the mobile web site and change the access address to https://m.bilibili.com/video/BV1Pv41167FEDirect View source code

在这里插入图片描述

emmm... Directly an MP4 file, more fragrant, more simple, more convenient to write code, get connection, extract address, binary download, write file, archive, exit, a set of combination punches, take data and leave.

Many times, we can't solve technical problems, not because our technology is not good, but because we don't know how to do it

Guess you like

Origin juejin.im/post/7085545132644106276