14 lines of Python code to easily crawl website videos

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us for processing.

Author: IvanFX revival computer community

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join


Basic steps and preparations

Insert picture description here
Debugging environment:

pycharm+python3

Need library:

  • urllib.
  • request
  • re

(http.cookiejar is a library that will be used by subsequent crawlers, this project is not involved in anti-crawling, so you can not add it)

If the import process shows that there is no such library, you can add it by clicking + on the right side of File→Settings→projet interpreter (if you use anaconda or python, you can also run this project directly, and add it by cmd→pip install)

2. In this article, we use python to crawl online short videos, download and store them. The basic steps are as follows (you can write notes to sort out ideas):

(1) Analyze the page URL and video file URL characteristics
(2) Obtain the source code HTML of the web page, and solve the anti-climbing mechanism
(3) Batch download video storage

Analyze page URL and file URL characteristics

Insert picture description here
1. Analyze the web page URL

Through the website URL: http://www.budejie.com/video/1, we can find the last value of the knowledge URL for different page numbers, and this value represents the number of pages, so it only needs to be changed to a fixed URL + variable Get the website URL of the site in batches

2. Analyze the file name URL

Through the analysis of the mp4 file name in the web page, it is found that the URL of the file is displayed in plain text, so it can be obtained by matching through the regularity of re.

Get URLs in batches and extract video URLs from them

import urllib.request
import re
for  page in range (1,20):
    req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
    html = urllib.request.urlopen(req).read()
    html = html.decode('UTF-8')
    print(html)

1. Batch crawl web URL

Here our page variable represents the encoding of the page, from here we temporarily crawl the first 20 pages.

(1) req obtains webpage feedback
(2) html obtains the meta code of the webpage through a function
(3) restores the display of Chinese by encoding the source code UTF-8.

However, through the execution of the above code, it is found that the error display http Error 403 is displayed because the anti-crawl mechanism of the web page cannot be obtained.
Insert picture description here
2. Add header files through the page

We visit the page through Google Chrome, press F12 and switch to Network, refresh the interface to observe the access process, you can select a header file from the process file and add it to the code, (baisibudejie.js selected here) modify the code as follows, you can Crawl the interface normally.

for  page in range (1,20):
    req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
    req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
    html = urllib.request.urlopen(req).read()
    html = html.decode('UTF-8')
    print(html)

Download videos in batches and create file name storage

Insert picture description here
Insert picture description here
1. Establish a batch naming cycle structure

After the loop structure is established, the file name needs to be reserved for downloading. The meaning of i.split("/")[-1] is to split i, with'/' as the separator, and retain the last paragraph, which is the MP4 file name.

2. Batch download

It is still necessary to add a displayed output sentence to indicate the process, which is also in line with the interactivity of a program, that is, when the video is downloaded, the progress is displayed, and finally downloaded to an mp4 folder

for i in re.findall(reg, html):
    filename = i.split("/")[-1]  # 以‘/ ’为分割f符,保留最后一段,即MP4的文件名
    print('正在下载%s视频' % filename)
    urllib.request.urlretrieve(i, "mp4/%s" % filename)

1. Establish a complete program

As a qualified programmer, you need to sort out the program and add notes for easy understanding and subsequent modification

import urllib.request
import re
def getVideo(page):
        req = urllib.request.Request("http://www.budejie.com/video/%s" %page)
        req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
        html = urllib.request.urlopen(req).read()
        html = html.decode('UTF-8')
        reg = r'data-mp4="(.*?)"'
        for i in re.findall(reg,html):
            filename = i.split("/")[-1]#以‘/ ’为分割f符,保留最后一段,即MP4的文件名
            print ('正在下载%s视频' %filename)
            urllib.request.urlretrieve(i,"mp4/%s"%filename)
for  i in range (1,20):
    getVideo(i)

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/112787420