Supplements of knowledge points 2 (download video)

Preface

In order to be practical, to improve the efficiency of downloading video materials with python in the future, and to enhance the availability of knowledge points about downloading videos in the mind,
this article is hereby recorded and updated from time to time.

Download video

Method one----->you-get

Download and use
use
the advantages
can be executed in a terminal (cmd), a line of code that is faster
to download music, video short extremely convenient
measurement result is 30 Screen only nine complete down, is because first refrain

you-get https://v.youku.com/v_show/id_XMzk4NDE2Njc4OA==.html?firsttime=0

Necessary to rely on
ffmpeg
for video and audio synthesis
instructions

Options Description Have you used it
-i Display resource information, such as format, resolution, size, etc.
-u Specify the url to download or view, sometimes you can omit -u and add the url directly
-The Set the output folder, that is, the save path, if not specified, it will be saved in the current working directory
-THE Set the file name, you can use the default file name
-f Forcibly overwrite existing files
-l Download the entire list first
-P Use password (if you need a password to access the video)
-t Set the timeout period in seconds
-c Use cookies, load cookies.txt or cookies.sqlite

Disadvantages The
supported platforms are limited, but many, mainly including b-site, Youku Video, Douban, NetEase Cloud, iQiyi, and Kugou.
Summary
You-get is practical, compact, and practical, and you can get videos with minimal time cost. But the function is single, and the operating space is not high. It is very suitable as a general video download tool. It is not suitable for crawling movie video files.

Method two ------->you-get plus pycharm

Direction
Use the
convenience of you-get to make programs that
are simple to use user interaction interface, and you can also write further code to repeatedly download specific targets

Method Three------->Code Implementation

Next video

url

URLs are known to be resource addresses, but the URL of a video screen has many appearances, and they have different meanings. For a detailed explanation of the url, please see the url ending with .mp4 here

http://ggkkmuup9wuugp6ep8d.exp.bcevod.com/mda-km671xd58s1yy16y/mda-km671xd58s1yy16y.mp4

This is a typical example, it contains the protocol name, domain name, virtual directory and then .mp4,

You can download it directly after getting such a link

First click on this link to see if it can be played directly, the result is displayed like this,
url
and then use the following two methods
: Method one:

import requests
headers = {  # 模拟浏览器身份头向对方发送消息
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    }
 url = "http://ggkkmuup9wuugp6ep8d.exp.bcevod.com/mda-km671xd58s1yy16y/mda-km671xd58s1yy16y.mp4“
 content = requests.get(url,headers=headers).content
 with open("D:/base/%s.mp4"%i,"ab") as fp:
     fp.write(content)
     print("正在下载s")

Method 2:

open('D:/base/test.mp4', "wb") as mp4:
    for chunk in r.iter_content(chunk_size=1024 * 1024):
        if chunk:
            mp4.write(chunk)

The relevant knowledge involved in the second method is first clicked.

URLs that do not end with .mp4

Generally, it is not so easy to get .mp4url directly,

Find one of the two directions of the url-----elements

Looking at the source code, the url is under the layers of code. The mp4url in the above example is the src of a tag hidden under the two-layer page. The secret is to take a look at the source code page first. If the structure is clear to the eye, it is mostly uncomplicated. Go back to the video playback page and check the tags carefully with the help of right-click

But if you look at the source code page and find that there are a lot of long paragraphs that you don’t understand, especially if there are a lot of braces or js styles, you can see the wrong page at a glance. In this case, you can’t find it through element search. from. It requires some means such as selelnium, packet capture technology ( data packet )

Find one of the two directions of the url-----network

This direction is a bit like the packet capture technique just mentioned, but the real use of packet capture requires the help of professional tools. Here first introduces the use of the browser's built-in packet capture function to find our url. That is, right-click to check, click on network, find the data stream, check one by one to find useful information for us, this method is mainly for dynamic web pages.

The two directions to find the URL are essentially the same, both are filtering useful information from the data

At the beginning, I was looking for the url of .mp4 to download directly, but it is obvious that if the source code web page is very complicated, the url of .mp4 will not exist at all (the existence will also be encrypted)

In this case, change your thinking and find the interface (api)

The interface (api) is like a wrapper function written by other developers. After adding some parameters to it, the corresponding content can be obtained.

Take station b as an example, it has its video interface. Find this interface and add the corresponding parameters to it to get the specified video. It is not difficult to see that this interface must contain the URL we need, but the format of this URL is definitely not as simple as .mp4 and can be spliced ​​by adding parameters. In addition, if this is the case, even if the spliced ​​url is obtained, it is unknown whether it can be downloaded with the two codes mentioned above.

Let’s take station b as an example to find the interface that uses it to
find the interface

'https://api.bilibili.com/x/player/playurl?' + 'bvid=' + bvid +'&cid=' + str(cid) + '&qn=64&type=&otype=json'

Or code to find the url (some videos in the network have playurl, some don’t, use regular to find it), you
need to use regular to find the json file containing the direct url of the video in the source code of the target webpage

pattern = r'\<script\>window\.__playinfo__=(.*?)\</script\>'
result = re.findall(pattern, html)[0]

b site source code of a page
Then parse the json

temp = json.loads(result)

Then extract the content

video_url = temp['data']['dash']['video'][0]['baseUrl']
print(video_url)

This is the print result

{'code': 0, 'message': '0', 'ttl': 1, 'data': {'from': 'local', 'result': 'suee', 'message': '', 'quality': 32, 'format': 'flv480', 'timelength': 8679, 'accept_format': 'flv720,flv480,mp4', 'accept_description': ['高清 720P', '清晰 480P', '流畅 360P'], 'accept_quality': [64, 32, 16], 'video_codecid': 7, 'seek_param': 'start', 'seek_type': 'offset', 'dash': {'duration': 9, 'minBufferTime': 1.5, 'min_buffer_time': 1.5, 'video': [{'id': 64, 'baseUrl': 'https://upos-sz-mirrorkodo.bilivideo.com/upgcxcode/43/29/288562943/288562943-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1613228680&gen=playurl&os=kodobv&oi=827278138&trid=3625ae10dddf49fc88c17490d5867c64u&platform=pc&upsig=1d6f8c7c311ba495bca7f2da3fe288f6&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=0&orderid=0,3&agrr=0&logo=80000000', 'base_url': 'https://upos-s

You can see that there are links like base_url appear, open such a link and find that it is inaccessible, and it is not .mp4, but this is the url we need to download.
The base_url can also be found directly in net-work.

Download video experience of station b

How to use it to download after getting base_url?
It’s just a short shot. The
final code is as follows

import requests
url = "https://cn-jszj-dx-v-09.bilivideo.com/upgcxcode/18/28/287972818/287972818-1-30074.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1613356156&gen=playurl&os=vcache&oi=827278138&trid=3f0f02d09e5a46e8943e7bb069e62f60u&platform=pc&upsig=343a85c8a72a750ba56c93eee4c28d63&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&cdnid=8192&mid=410239695&orderid=0,3&agrr=0&logo=80000000"
headers = {  # 模拟浏览器身份头向对方发送消息
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
begin = 0
end = 1024 * 1024 - 1
flag = 0
headers.update({'Range': 'bytes=' + str(begin) + '-' + str(end)})
while True:
    # 添加请求头键值对,写上 range:请求字节范围
    headers.update({'Range': 'bytes=' + str(begin) + '-' + str(end)})
    res = requests.get(url=url, headers=headers)
    if res.status_code != 416:
        # 响应码不为416时有数据,由于我们不是b站服务器,最终那个数据包的请求range肯定会超出限度,所以传回来的http状态码是416而不是206
        begin = end + 1
        end = end + 1024 * 1024
    else:
        headers.update({'Range': str(end + 1) + '-'})
        res = requests.get(url=url, headers=headers)
        flag = 1
    with open("D:/base/a.mp4", 'ab') as fp:
        fp.write(res.content)
        fp.flush()
    if flag == 1:
        fp.close()

The last step will not succeed without adding the range byte range.
This is a big problem. I downloaded only the m4s file, and the m4s file needs a video file and an audio file to be combined to be fully watched...

Another way of thinking is to find the flv link directly according to the video interface............

The b station interface is a get request.
Now it is possible to crawl a single or series of videos with the specified bv number according to the interface.

As long as you understand the essence of crawling videos is to find url or interface, then it will be a matter of course.

have to be aware of is

1. An error demonstration is like this

import requests
headers = {  # 模拟浏览器身份头向对方发送消息
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    }
for i in range(142):

    url = "https://www.bilibili.com/video/BV17o4y1976Q?p=%s"%i
    content = requests.get(url,headers=headers).content
    with open("D:/base/%s.mp4"%i,"ab") as fp:
        fp.write(content)
        print("正在下载%s"%i)

Here, the indirect URL is used as a direct URL. Of course, an error will occur, and the file cannot be opened when it is downloaded.
Change it to

import requests
headers = {  # 模拟浏览器身份头向对方发送消息
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    }

url = "http://ggkkmuup9wuugp6ep8d.exp.bcevod.com/mda-km671xd58s1yy16y/mda-km671xd58s1yy16y.mp4"
content = requests.get(url,headers=headers).content
with open("D:/base/s.mp4","ab") as fp:
    fp.write(content)
    print("正在下载s")

You can download successfully

2. Encrypted URL

Sometimes a video can be copied by right-clicking its address (this url is usually useless)

Like this

blob:https://www.bilibili.com/62cc1137-e104-4360-a03f-b856bf1079ca

This URL is encrypted. For details , see why the video link address is a blob
and the crawler needs to find the relative position of the direct url in the web page data to achieve batch download.
3.
Don’t download the url directly after finding the interface. Add range byte range in the header
4.
Normal download speed is extremely slow
5.
Breakpoint resume

The simplest case of the next movie

Take a movie website as an example.
Find the js file on the network. The end of its name is ordered and there is no anti-climbing method. It can be downloaded directly ( this is to find the direct url download. The first error case above is the same as this case. Confused )
The principle involved also has the above-mentioned resumable
transmission. In the HTTP protocol, Content-Length is used to describe the transfer-length of the message-body of the HTTP message entity.
Just a few lines of code

import requests
from urllib3.exceptions import InsecureRequestWarning
from urllib3 import disable_warnings

disable_warnings(InsecureRequestWarning)  # https问题的报错

print("开始下载!")
for i in range(1712):
    url = "url%s.ts" % i
    ret = requests.get(url,verify=False).content
    with open( r"D:/base/白日夢想家4.mp4","ab") as f:
        f.write(ret)
        print("第%d个ts文件已写入"%i)
print("下载完毕!")

Advantages There is
no advantage, but the fact that it can be downloaded is a little happier.
You cannot download movies with you-get, and this method can at least achieve the
disadvantages.
The download speed of the above-mentioned few codes is extremely slow. The use of thread pools and multi-threaded knowledge may be able to Increase the speed (this item is to be done), otherwise there is no practical
summary.
Not all websites are so
easy to crawl. This method is rarely used. Even if it can be used, the speed is very slow.
Even if you can improve the speed
, it takes time to learn this method. , May be greater than the significance of your movie acquisition.
If these two problems are not a problem, such as learning multi-threading, thread pool and even agent is originally to learn crawlers, you should know and know, acquiring movies is very important to you. I have movie collection hobbies, then You can try to learn more
ps: Practical first. At present, I don’t have sufficient knowledge of multi-threading, and there is no special need to crawl movies. In this case, it is better to choose to learn which is more practical and can skip multi-threaded learning. Scapy, learn to download text, picture, video, and selenium with the scrapy framework, so that the cost performance will be greater than the direction
of learning for downloading movies . The logic will take time to clarify

More difficult situation

The js file is not the same as the above one. It seems to be encrypted and encoded. It can't be solved by searching it briefly. It is temporarily shelved and will be added in the future.

Method four----------->scrapy download

slightly

Also about video processing and its powerful software

format factory

No need to say more

to sum up

Being able to download videos or movies with python is one of my motivations for learning all the time. I hope this method will bring me many benefits. Of course, during the learning process, I gradually discovered that this goal is in a lot of software, and the Internet is not a big deal today. . There are many ways to replace python's functions casually, but this way of thinking limits my vision.

Guess you like

Origin blog.csdn.net/qq_51598376/article/details/113778759