[Python web crawler] Use urllib to crawl web page source code, pictures and videos

1. Introduction to web crawlers

The basics of HTML and CSS have been introduced before , and based on the understanding of the composition of page elements, crawlers can become familiar with it after seeing the source code, and can better locate the required data. Next, I will give you a deeper understanding of crawler-related knowledge.

A web crawler is a program or script that automatically crawls information from the World Wide Web according to certain rules. Generally, we start from a certain web page of a certain website, read the content of the web page, and at the same time retrieve the useful link addresses contained in the page, and then use these link addresses to find the next web page, and then do the same work, and keep looping until we follow a certain strategy. Until all the web pages on the Internet have been crawled.

Classification of web crawlers

There are roughly four types of web crawlers: general web crawlers, focused web crawlers, incremental web crawlers, and deep web crawlers.

  • General web crawler: The target data crawled is huge and the scope is very wide, and the performance requirements are very high. It is mainly used in large search engines and has very high application value, or it is used in large data providers.
  • Focused web crawler: A crawler that selectively crawls web pages according to predefined themes. Locate the crawled target web pages in pages related to the topic, which greatly saves the bandwidth resources and server resources required for crawling. It is mainly used in crawling specific information to provide services for a specific type of people.
  • Incremental web crawler: When crawling web pages, it only crawls web pages whose content has changed or newly generated web pages. It will not crawl web pages whose content has not changed. Incremental web crawlers can ensure that the crawled pages are as new as possible to a certain extent.
  • Deep web crawler: Web pages can be divided into surface pages and deep pages according to the way they exist. Surface pages refer to static pages that can be reached using static links without submitting a form ; deep pages are hidden behind the form and cannot be directly obtained through static links. Pages that can only be obtained after submitting certain keywords.

A form is used to collect user input information and represents an area in the document. This area contains interactive controls and sends the information collected by the user to the Web server. Usually contains various input fields, check boxes, radio buttons, drop-down lists and other elements. In real life, the most common ones are questionnaires and account logins.

The role of web crawlers

1) Search engine: Provide users with relevant and effective content and create snapshots of all visited pages for subsequent processing. Implementing a search engine or search function on any portal using a focused web crawler helps to find web pages with the highest relevance to the search topic.

2) Build data sets: for research, business and other purposes.

  • Understand and analyze netizens’ behavior toward a company or organization.
  • Gather marketing information and make better marketing decisions in the short term.
  • Collect information from the Internet, analyze them and conduct academic research.
  • Collect data to analyze long-term trends in an industry.
  • Monitor competitors' real-time changes.

Web crawler workflow

Set one or several initial seed URLs in advance to obtain the URL list on the initial web page. During the crawling process, the URLs are continuously obtained from the URL queue, and then the page is accessed and downloaded.

After the page is downloaded, the page parser removes the HTML tags on the page and obtains the page content. It saves the summary, URL and other information to the Web database. At the same time, it extracts the new URL on the current web page and pushes it into the URL queue until the system stopping conditions are met. .

Reptilian Spear and Shield:

  • Anti-crawling mechanism : Portal websites can prevent crawler programs from crawling website data by formulating corresponding strategies or technical means.
  • Anti-crawling strategy : The crawler program can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain relevant data in the portal website.

robots.txt Agreement : Gentleman’s Agreement. It stipulates which data in the website can be crawled by crawlers and which data is not allowed to be crawled.

2. Use urllib crawler

urllib is the official standard library for requesting URL connections in Python. It is divided into urllib and urllib2 in Pyhton 2, and is synthesized into urllib in Python 3. It is used to obtain the URL, use the urlopen function to open the obtained web page, and can obtain the URL using various protocols. The following 4 modules are common:

  • urllib.request: Mainly responsible for constructing and initiating network requests, and defines functions and classes suitable for opening URLs in various complex situations.
  • urllib.error: Exception handling, including exceptions thrown by urllib.request.
  • urllib.parse: Parse various URL data formats.
  • urllib.robotparser: Parse robots.txt file.

In Google Chrome, open the Baidu URL: https://www.baidu.com/ , use right-click to check, or more tools in Google Settings - Developer Tools, click refresh and click on the element, you can see the following picture:

Insert image description here

2.1 Send a request

Use Baidu URL as a reference, use urlopen to initiate a request, view specific source code information, and execute the following command.

import urllib.request # 导入模块
baidu_url = 'http://www.baidu.com' # 百度url
response = urllib.request.urlopen(baidu_url) # 发起请求,返回响应对象
msg = response.read().decode('utf-8') # 调用read方法读取数据并解码
print(msg) # 打印元素信息

Insert image description here

2.2 Data storage and exception handling

If the network speed is slow or the requested website is slow to open, you can set a timeout limit using the timeout parameter. In addition, you can save the web page source code as an HTML file, and you can also use the exception handling method try...except. Please refer to Python exception handling and program debugging .

Create a new py file, enter the following command, and then execute the file in the terminal, such as python 1.py. If no file is displayed, the path is incorrect. You can use cd to enter the directory where the file is located.


import urllib.request # 导入模块
try:
    baidu_url = 'http://www.baidu.com' # 百度url
    response = urllib.request.urlopen(baidu_url,timeout=1) # 发起请求,返回响应对象,设置超时1s
    msg = response.read().decode('utf-8') # 调用read方法读取数据并解码
    save_file = open('baidu.html','w') # 将网页信息保存为html文件,可以指定绝对路径,此处文件存放在命令行所处目录下
    save_file.write(msg)
    save_file.close()
    print('over!')
except Exception as e:
    print(str(e))

Insert image description here

As above, after execution, you can see that the HTML file has been generated in the urllib_request directory. Click to open it and you will find that it is consistent with the browser.

Insert image description here

2.3 Simulate the browser to initiate a request

Using urllib.request.Request can imitate the browser accessing the URL, which can avoid some simple anti-crawling measures. Once the anti-crawling starts, it cannot continue to crawl data.

Taking the JD.com website as an example, create a new py file, enter the following command and execute it.

import urllib.request # 导入模块
try:
    jd_url = 'http://www.jd.com' # 京东url
    request = urllib.request.Request(jd_url) # 返回Requests对象
    response = urllib.request.urlopen(request) # 发起请求,打开网址,返回响应对象
    msg = response.read().decode('utf-8') # 调用read方法读取数据并解码
    save_file = open('jd.html','w') # 将网页信息保存为html文件,可以指定绝对路径,此处文件存放在命令行所处目录下
    save_file.write(msg)
    save_file.close()
    print('over!')
except Exception as e:
    print(str(e))

Insert image description here

Through the above example, you can find that urlopen has the following two uses:

# 都返回响应对象
http.client.HTTPResponse = urllib.request.urlopen(url,data,timeout,cafile,capath,context)
http.client.HTTPResponse = urllib.request.urlopen(urllib.request.Request)

2.4 Add request header

Adding a request header, often also called UA camouflage, is used to make the crawler simulate the browser more authentically or fakely. Basic format: headers = {'User-Agent':'Request header'}.

Still taking the Baidu URL as an example, right-click and select Inspect, click on the Baidu URL in the network, look at the header, you can see the following information, and you can continue to scroll down to see more.

Insert image description here

  • Request URL: The currently opened URL.
  • Request method: GET request, and common POST request.
  • Status code: 200 OK, indicating that the requested URL is opened successfully.
  • Cookie: used for user identification.
  • User-Agent: User agent.

Taking the Zhihu hot list as an example, enter the following command to obtain the HTML file of the web page.

import urllib.request
try:
    url = 'https://www.zhihu.com/billboard' # 知乎热榜
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    requests = urllib.request.Request(url=url, headers=headers)
    response = urllib.request.urlopen(requests) # 发起请求,返回响应对象
    msg = response.read().decode('utf-8') # 调用read方法读取数据并解码
    save_file = open('zhihubillboard.html','w') # 将网页信息保存为html文件,可以指定绝对路径,此处文件存放在命令行所处目录下
    save_file.write(msg)
    save_file.close()
    print('over!')
except Exception as e:
    print(str(e))

Insert image description here

You can also continue to add other information, such as cookies, to the request header to facilitate identification.

import urllib.request
try:
    url = 'https://www.zhihu.com/billboard'  # 知乎热榜
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Cookie': '_zap=a4ca5ae3-6096-4bfe-85d9-29ec029e5139; d_c0=AGDWXB3XchaPTh7JqVPL2DXt_HZ-eyoUOe8=|1678443923; __snaker__id=Rojpy7kzOo4QD1qH; YD00517437729195%3AWM_NI=jFh9c3lgUsZ2rn%2F4NoCwfBUKeMIG7qwJ7ogEPJYJx3HH2VSM74V4z8d%2FKTapadHED%2Fy2ZfMSiY6Tup5jQR7wYP6jBo%2Fqj4FqgLO9Rty4prHJCwZD%2BLUYEIzOsT3dQaIBZHc%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eea3e14285868c93d57ab2968fa3c15f928a8a87c16ab08e84adc67cf6eebaa3f92af0fea7c3b92aaa9aa8dae239b0b99b94d93ab2b2e1d5cd6581a9aaa3d841a79387a8b67bb596f990e64f949697abfb398c98feb7d247f5ec9ad5f934a8f185b5e47ca3bc88a6c725bbee9687d059ab87fab5f87cf7b69fb3b67d9b90fa8bc165edafa6d5c57ba59481dad180bbecc093cb4281a996a5aa4897878baaf945b18cfd85d46b8b9e96a9e237e2a3; YD00517437729195%3AWM_TID=RTtO5%2Fq3SYtEEAFVRBPAagEeaCt%2F26Ry; l_n_c=1; l_cap_id="OTNjZTgzNmQ4OTE5NDM2Y2FmMTJlOGY3NTlkYmE2MDI=|1703057324|7361fe71da9c72be879115517ae60a45a5cb9cc5"; r_cap_id="ZTVjMTgxZWQ2MTg3NGVlOGJjZDM3ODJjNmE0MDk2N2Y=|1703057324|a5fbcb7c58a06bccc58929001d0757465533852b"; cap_id="NzBhYTA3Mjc3ZTQ2NGZjYWI0OTI2MTY1MGU4OTQ4ZjI=|1703057324|70b91e0fc74f4df3a43ee78735905e9c4ea40e75"; n_c=1; _xsrf=lBmR1gcdJUT3in6ptJZhosQfxpcQATmo; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1703057329; captcha_session_v2=2|1:0|10:1703057329|18:captcha_session_v2|88:Sm9FR1ZqV1VZU0hpeHdoSXIvaWQyZ2FrKy84cTM4WVZ2QUtmVWxRZThPengrbFU0RkRhN2VDQWtWT1hURU0vQw==|9d94397475e180bfd9427e40143374afeb674533f873fc21c73d3c4cafc433b8; SESSIONID=o9FiVMzL4fXXqtarTVS3JsVpUea9CjwExnOnqJELoaq; JOID=VlwVAEshpRQX2k_EGibxBn9bpUAITphRfYgCulURw0hMqD-lbOs1lnXeS8Ma_QOeSua8OlDbmfBWrHqYFGnEZvI=; osd=VlgVCkwhoRQd3U_AGiz2Bntbr0cISphbeogGul8Ww0xMojilaOs_kXXaS8kd_QeeQOG8PlDRnvBSrHCfFG3EbPU=; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1703057423; gdxidpyhxdE=jcNXJIKi%5Csz7c2u2UPPNoESbzjfffl0nXY0b7rcOozmIRo20NkiY%5C%5C%2BJzbUf7rZPNZpJqaCmojnY9cwqPlBVuBw0mnPEWhzZNuvxkWzPAQWG6U7SJ26sdhgRU4RWgaDo%5Ce0oI%5Cc7gBn%5Ckm%2B5nBOG7lDlvdSiVEExPGdrBNfa5%2BTUt%2BxA%3A1703059071056; KLBRSID=fe78dd346df712f9c4f126150949b853|1703058399|1703057324'
    }
    requests = urllib.request.Request(url=url, headers=headers)
    response = urllib.request.urlopen(requests)  # 发起请求,返回响应对象
    msg = response.read().decode('utf-8')  # 调用read方法读取数据并解码
    save_file = open('zhihubillboard1.html', 'w')  # 将网页信息保存为html文件,可以指定绝对路径,此处文件存放在命令行所处目录下
    save_file.write(msg)
    save_file.close()
    print('响应码:',response.getcode()) # 获取响应码
    print('实际url:',response.geturl()) # 获取实际url
    print(response.info()) # 获取服务器响应的HTTP标头
    print('over!')
except Exception as e:
    print(str(e))

2.5 Authentication login

For websites with SSL certification, crawlers often report the following error:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1129)>

We often have two methods. One is to set up SSL, ignore the warning, and continue to access. As shown below, access the China Railway 12306 website without logging in.

import urllib.request
import ssl
# 对ssl进行设置,忽略警告,继续进行访问
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.12306.cn/mormhweb/'
response = urllib.request.urlopen(url=url)
msg = response.read().decode('utf-8')
print(msg)

Insert image description here

The other is to use an account to log in and access. The common steps are as follows:

  1. Create an account and password management object.
  2. Add account and password.
  3. Get a handler object.
  4. Get the opener object.
  5. Use the open() function to initiate a request.

Take the blog park website as an example, register an account, and use the account to crawl.

import urllib.request
try:
    url = 'https://www.cnblogs.com/' # 博客圆网址
    user = 'user'
    password = 'password'
    pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() # 创建一个账号密码管理对象
    pwdmgr.add_password(None,url,user,password) # 添加账号和密码
    auto_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr) # 获取一个handler对象
    opener = urllib.request.build_opener(auto_handler) # 获取opener对象
    response = opener.open(url) # 使用open函数发起请求
    msg = response.read().decode('utf-8')
    save_file = open('blogs.html','w')
    save_file.write(msg)
    save_file.close()
    print('over!')
except Exception as e:
    print(str(e))

Insert image description here

3. Download pictures and videos

The basic knowledge of urllib was explained earlier, and the relevant source code was crawled and saved locally based on different URLs. The source code is relatively unfamiliar. Next, we will further use urllib to crawl our common image and video format files.

Take Pear Video: https://www.pearvideo.com/ as an example. Click on a video, enter it, right-click and select Inspect, then click on the upper left arrow, click on the required element to quickly locate the source code, and you can get the link. .

Insert image description here

import urllib.request
import os
url = 'https://image2.pearvideo.com/cont/20231208/cont-1790138-12750419.jpg' # 图片url
requests = urllib.request.Request(url) # 发送请求,返回请求对象
response = urllib.request.urlopen(requests) # 返回响应对象
msg = response.read() # 读取信息
if not os.path.exists('download_file'): # 判断当前目录下,此目录是否存在,不存在则创建
    os.makedirs('download_file')
save_file = open('download_file/2.jpg', 'wb') # 保存图片,模型为wb二进制写入
save_file.write(msg) # 写入图片
save_file.close() # 关闭
print('over!')

Here we use os to create a new directory to store downloaded files, such as download_file. Execute the above code to download the corresponding image.

Insert image description here

Similarly, we quickly locate a video source link.

Insert image description here

import urllib.request
import os
url = 'https://video.pearvideo.com/mp4/short/20231123/cont-1789641-16009997-hd.mp4' # 视频url
requests = urllib.request.Request(url) # 发送请求,返回请求对象
response = urllib.request.urlopen(requests) # 返回响应对象
msg = response.read() # 读取信息
if not os.path.exists('download_file'): # 判断当前目录下,此目录是否存在,不存在则创建
    os.makedirs('download_file')
save_file = open('download_file/1.mp4', 'wb') # 保存视频,模型为wb二进制写入
save_file.write(msg) # 写入视频
save_file.close() # 关闭
print('over!')

Continue to save the file in the download_file directory, and the result will be as shown below.

Insert image description here

The above is the file crawling and saving implemented using the urllib.request and open methods. In fact, there is also a urlretrieve method in urllib that can also be implemented, urllib.request.urlretrieve(url, file path name). For details, you can view the following example.

import urllib.request

if __name__ == '__main__':
    url1 = 'https://www.pearvideo.com/video_1789641'  # 梨视频视频网页网址
    url2 = 'https://image2.pearvideo.com/cont/20231208/cont-1790138-12750419.jpg'  # 图片url
    url3 = 'https://video.pearvideo.com/mp4/short/20231123/cont-1789641-16009997-hd.mp4'  # 视频url
    urllib.request.urlretrieve(url1, 'download_file/url1.html')  # 保存url1为html
    urllib.request.urlretrieve(url2, 'download_file/url2.jpg')  # 保存url1为图片
    urllib.request.urlretrieve(url3, 'download_file/url3.mp4')  # 保存url1为视频

Insert image description here

4. Expansion-Universal Video Download

you-get official website: https://you-get.org

Install in the terminal command line. For example pip install you-get, you can use it directly after the installation is completed. From the official website, you can find the URL that supports downloading, as well as related knowledge and some download examples.

Insert image description here

Take a supported NetEase video as an example, use right-click to check, and locate the link address corresponding to the video.

Insert image description here

Enter the following command directly on the command line to download. The -o option can set the save path. -O can set the name of the downloaded file. If you do not write it, the text behind the link will be intercepted by default. There are other optional parameters that you can go to the official website for deep learning. .

you-get -o download_file -O wangyi 'https://flv3.bn.netease.com/d007d7e6e21113cbebd6ee56b312da000d3a18a34ed6f9e7a811c1aa017eda5705f4f1d6c53d0d0ff2258604ef5ce15c7673b36396bb3b4f500a1ee36ee64390bf1184a2f095c71e7704e67f17f928c03739fc8a26b9db66d943f36641e82fd9a992304050488f362fe95ad1d6b6fff4dc037de346c5be47.mp4'

Insert image description here

Every time you write the you-get command on the command line, it may be hard to remember or inconvenient. You can further use python to encapsulate it and imitate its execution on the command line. Create a new 12.py file and enter the following commands.

import sys
from you_get import common
directory = 'download_file'
url = 'https://flv3.bn.netease.com/d007d7e6e21113cbebd6ee56b312da000d3a18a34ed6f9e7a811c1aa017eda5705f4f1d6c53d0d0ff2258604ef5ce15c7673b36396bb3b4f500a1ee36ee64390bf1184a2f095c71e7704e67f17f928c03739fc8a26b9db66d943f36641e82fd9a992304050488f362fe95ad1d6b6fff4dc037de346c5be47.mp4'
sys.argv = ['you-get','-o',directory,'-O','3','--format=flv',url] # 将you-get命令写入列表传递给变量
common.main()

Execute python 12.py directly on the command line, and the new video file will be saved in the download_file directory.

Insert image description here

This encapsulation method is very useful. In actual work, when there are a large number of tasks that need to be executed on the command line, you can also list the tasks and then traverse and simulate them into command line commands.

Guess you like

Origin blog.csdn.net/weixin_50357986/article/details/135116908