How can I batch download these videos with python? Easy to implement in just 15 lines of code

Saying that life is too short, I use Python.

It would be pointless to learn python if it wasn't to download these videos!
Ahhh, old bastard


Ahem, let's get down to business.

1. Prelude

First of all, if you don't have Python and pycharm installed, please install it yourself, and I won't write about the installation.

If you want to watch the video tutorial, you can scan the code on the left side. I have specially recorded an explanation video.

Then there is the module, or the crawler's boss, requests, which can be installed directly by pip, and requests are the data request module.

win+r to open the run box, enter cmd and press Enter, enter pip install requests in the pop-up command prompt window and press Enter to complete the installation.

  • Reasons for installation failure
    1. pip is not an internal command, the solution (set the environment variable)
    2. There are a lot of red reports (read time out), the solution (because the network link timed out, the mirror source needs to be switched)
    3. The cmd shows that it has been installed It has passed, or the installation is successful, but it still cannot be imported in pycharm, the solution (may be installed with multiple python versions (anaconda or python can install one), just uninstall one, or the python interpreter in your pycharm is not set up )

mirror source

清华:https://pypi.tuna.tsinghua.edu.cn/simple
阿里云:https://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
华中理工大学:https://pypi.hustunique.com/
山东理工大学:https://pypi.sdutlinux.org/
豆瓣:https://pypi.douban.com/simple/

Installation method

E.g

pip3 install -i https://pypi.doubanio.com/simple/ 模块名

How to configure the python interpreter in pycharm?

  1. Select file >>> setting >>> Project >>> python interpreter (python interpreter)
  2. Click the gear, select add.
  3. Add python installation path

How does pycharm install plugins?

  1. Select file >>> setting >>> Plugins
  2. Click on Marketplace and enter the name of the plug-in you want to install, such as: translation plug-in input translation, Chinese plug-in input Chinese
  3. Select the corresponding plug-in and click install.
  4. After the installation is successful, the option to restart pycharm will pop up, click OK, and the restart will take effect.

2. Text

The prelude is over, let's go straight to the topic...

I deleted the key address, the first v.6 and the second haokan.baidu

1. Thought process

How do we go about implementing a worm case?

The data structure of each website is different, and you need to re-analyze and capture packets by yourself, but this process is basically the same for pa bugs.

1. Data source analysis

  • First, determine your own target address and target data source, and determine the url address;
  • Packet capture analysis through developer tools;

2. Code implementation process

  • Send a request, send a request for the url address just analyzed;
  • Get data, get the response data returned by the server;
  • Parse the data, extract the content we want, the video playback url address and the video title;
  • Save data, save the local folder;

2. Code display

First import the module

import requests
import re

re is a regular expression is a module, built-in, do not need to install. Just install requests and that's it.

send request

Send a request to the url address just analyzed

I. Request url URL [understood as a phone number];
II. Request method;
III. What parameters, request header, dictionary data type, and key-value pair form need to be added for headers disguise;

for page in range(26, 29):
    print(f'====================================正在采集第{page}页数据内容====================================')
    url = f'https://minivideo/getMiniVideoList.php?act=recommend&page={page}&pagesize=25'
    headers = {
    
    
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'
    }
    response = requests.get(url=url, headers=headers)

<Response [200]> Returns the response response object with 200 status code indicating that the request is successful.

retrieve data

Get the response data returned by the server

response.text 获取响应体文本数据           字符串数据类型
response.json() 获取响应体json字典数据     字典数据类型

If the returned data is in a complete json data format, you can directly get response.json() for the convenience of extracting the content later.
The dictionary value is more convenient, you can directly extract the data content according to the key-value pair, and extract the content to the right of the colon according to the content to the left of the colon.

First take the content and then take the list, the return is the list data.

print(response.text)
print(response.json()['content']['list'])

The returned list contains the data information of the video. I will not take a screenshot of the following address, I am afraid~

Analytical data

Extract the content we want, the video playback url address, and the video title.

for index in response.json()['content']['list'][14:]:
    title = index['title']
    play_url = index['playurl']  # 快速复制 ctrl + D
    new_title = re.sub(r'[\/:*?"|<>]', '', title)
    print(title, play_url)

save data

video_content = requests.get(url=play_url).content
with open('video\\' + new_title + '.mp4', mode='wb') as f:
    f.write(video_content)
print('视频保存完成: ', title, play_url)

Replenish

json get data

import requests
import re
import json
url = 'https://com/web/search/api?pn=4&rn=10&type=video&query=%E7%BE%8E%E5%A5%B3'
headers = {
    
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'
}
json_data = requests.get(url=url, headers=headers).json()
for index in json_data['data']['list']:
    index_url = index['url']
    html_data = requests.get(url=index_url, headers=headers).text
    video_info = re.findall('window.__PRELOADED_STATE__ = (.*?);.*?document', html_data)[0]
    json_data_1 = json.loads(video_info)
    title = json_data_1['curVideoMeta']['title']
    video_url = json_data_1['curVideoMeta']['clarityUrl'][-1]['url']
    print(title, video_url)

3. Result display

insert image description here
Brothers, today's sharing is here, remember to like and favorite!

Guess you like

Origin blog.csdn.net/fei347795790/article/details/123660880