Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us for processing.
Author: IvanFX revival computer community
PS: If you need Python learning materials, you can click on the link below to get it yourself
Python free learning materials and group communication answers Click to join
Basic steps and preparations
Debugging environment:
pycharm+python3
Need library:
- urllib.
- request
- re
(http.cookiejar is a library that will be used by subsequent crawlers, this project is not involved in anti-crawling, so you can not add it)
If the import process shows that there is no such library, you can add it by clicking + on the right side of File→Settings→projet interpreter (if you use anaconda or python, you can also run this project directly, and add it by cmd→pip install)
2. In this article, we use python to crawl online short videos, download and store them. The basic steps are as follows (you can write notes to sort out ideas):
(1) Analyze the page URL and video file URL characteristics
(2) Obtain the source code HTML of the web page, and solve the anti-climbing mechanism
(3) Batch download video storage
Analyze page URL and file URL characteristics
1. Analyze the web page URL
Through the website URL: http://www.budejie.com/video/1, we can find the last value of the knowledge URL for different page numbers, and this value represents the number of pages, so it only needs to be changed to a fixed URL + variable Get the website URL of the site in batches
2. Analyze the file name URL
Through the analysis of the mp4 file name in the web page, it is found that the URL of the file is displayed in plain text, so it can be obtained by matching through the regularity of re.
Get URLs in batches and extract video URLs from them
import urllib.request
import re
for page in range (1,20):
req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
html = urllib.request.urlopen(req).read()
html = html.decode('UTF-8')
print(html)
1. Batch crawl web URL
Here our page variable represents the encoding of the page, from here we temporarily crawl the first 20 pages.
(1) req obtains webpage feedback
(2) html obtains the meta code of the webpage through a function
(3) restores the display of Chinese by encoding the source code UTF-8.
However, through the execution of the above code, it is found that the error display http Error 403 is displayed because the anti-crawl mechanism of the web page cannot be obtained.
2. Add header files through the page
We visit the page through Google Chrome, press F12 and switch to Network, refresh the interface to observe the access process, you can select a header file from the process file and add it to the code, (baisibudejie.js selected here) modify the code as follows, you can Crawl the interface normally.
for page in range (1,20):
req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
html = urllib.request.urlopen(req).read()
html = html.decode('UTF-8')
print(html)
Download videos in batches and create file name storage
1. Establish a batch naming cycle structure
After the loop structure is established, the file name needs to be reserved for downloading. The meaning of i.split("/")[-1] is to split i, with'/' as the separator, and retain the last paragraph, which is the MP4 file name.
2. Batch download
It is still necessary to add a displayed output sentence to indicate the process, which is also in line with the interactivity of a program, that is, when the video is downloaded, the progress is displayed, and finally downloaded to an mp4 folder
for i in re.findall(reg, html):
filename = i.split("/")[-1] # 以‘/ ’为分割f符,保留最后一段,即MP4的文件名
print('正在下载%s视频' % filename)
urllib.request.urlretrieve(i, "mp4/%s" % filename)
1. Establish a complete program
As a qualified programmer, you need to sort out the program and add notes for easy understanding and subsequent modification
import urllib.request
import re
def getVideo(page):
req = urllib.request.Request("http://www.budejie.com/video/%s" %page)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
html = urllib.request.urlopen(req).read()
html = html.decode('UTF-8')
reg = r'data-mp4="(.*?)"'
for i in re.findall(reg,html):
filename = i.split("/")[-1]#以‘/ ’为分割f符,保留最后一段,即MP4的文件名
print ('正在下载%s视频' %filename)
urllib.request.urlretrieve(i,"mp4/%s"%filename)
for i in range (1,20):
getVideo(i)