Python's web crawler framework - common technologies for web crawlers


I. Introduction

  • Personal homepage : ζ Xiaocaiji
  • Hello everyone, I am a rookie, let us learn Python's web crawling framework - a common technology of web crawlers.
  • If the article is helpful to you, welcome to follow, like, and bookmark (one-click three links)

2. Python network request

   URL address and downloading web pages, these two are necessary and critical functions of web crawlers. When it comes to these two functions, they must be inseparable from dealing with HTTP. This article will introduce three common ways to implement HTTP network requests in python: urllib, urllib3 and requests.

1. urllib module

   urllib is a built-in module of python, which provides a urlopen() method, through which a URL is specified to send a network request to obtain data. urllib provides multiple sub-modules, and the specific module names and meanings are shown in the following table:

module name describe
urllib.request This module defines methods and classes for opening URLs (mainly HTTP), for example, authentication, redirection, cookies, etc.
urllib.error This module mainly includes exception classes, the basic exception class URLError.
urllib.parse The functionality defined by this module falls into two broad categories: URL parsing and URL referencing.
urllib.robotparser This module is used to parse robots.txt files.

   A simple example of sending a request and reading web page content through the urlli b.request module is as follows:

import urllib.request  # 导入该模块

# 打开指定需要爬取的网页
response = urllib.request.urlopen("http://www.baidu.com")
html = response.read()  # 读取网页代码
print(html)  # 打印读取内容

   In the above example, Baidu's web page content is obtained through a get request. The following uses the post request of the urlli b.request module to obtain the content of the webpage information. The example is as follows:

import urllib.request
import urllib.parse

# 将数据使用urlencode编码处理后,再使用encoding设置为utf-8编码
data = bytes(urllib.parse.urlencode({
    
    "word": "holle"}), encoding="utf8")

# 打开指定需要爬取的网页
response = urllib.request.urlopen("http://httpbin.org/post", data=data)

html = response.read()  # 读取网页代码
print(html)  # 打印读取内容

Description: Here is a demonstration through the http://httpbin.org/post website, which can be used as a site for practicing the use of the urllib module, and can simulate various request operations.


2. urllib3 module

   urllib3 is a powerful, well-organized Python library for HTTP clients. Many native Python systems have started using urllib3. urllib3 provides many important features not found in the Python standard library:

  • thread safety
  • connection pool
  • Client SSL/TLS Verification
  • Upload files using multipart encoding
  • Helpers for retrying requests and handling HTTP redirects
  • Support gzip and deflate encoding
  • Support HTTP and SOCKS proxy
  • 100% test coverage

   The sample code for sending network requests through the urllib3 module is as follows:

import urllib3

# 创建PoolManager对象,用于处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()

# 对需要爬取的网页发送请求
response = http.request("GET", "https://www.baidu.com/")
print(response.data)  # 打印读取内容

   The post request realizes obtaining the content of the web page information, the key code is as follows:

 对需要爬取的网页发送请求
response = http.request("POST",
                        "http://httpbin.org/post"
                        , fields={
    
    "word": "hello"})

Note: Before using the urllib3 module, you need to install the module through the "pip install urllib3" code in python.


3.requests module

   requests is a way to implement HTTP requests in Python. requests is a third-party module. This module is much simpler than the urllib module when implementing HTTP requests, and the operation is more user-friendly. When using the requests module, you need to execute the pip install requests code to install the module. The features of requests are as follows:

  • Keep-Alive & Connection Pool
  • Internationalized domain names and URLs
  • Sessions with persistent cookies
  • Browser-style SSL authentication
  • Automatic Content Decoding
  • Basic/Digest Authentication
  • Elegant Key/value Cookie
  • automatic decompression
  • Unicode response body
  • HTTP(S) proxy support
  • File upload in chunks
  • stream download
  • Connection timed out
  • chunked request
  • support.netrc

   Taking the GET request method as an example, the sample code for printing various request information is as follows:

mport requests

response = requests.get("http://www.baidu.com")

print(response.status_code)  # 打印状态码
print(response.url)  # 打印请求url
print(response.headers)  # 打印头部信息
print(response.cookies)  # 打印cookies信息
print(response.text)  # 以文本形似打印网页源码
print(response.content)  # 以字节流的形式打印网页源码

   The sample code for sending an HTTP network request in the form of a POST request is as follows:

import requests

data = {"word": "holle"}  # 表单参数

# 对需求爬取的网页发送请求
response = requests.post("http://httpbin.org/post", data=data)

print(response.content)  # 以字节流的形式打印网页源码

   If it is found that the parameter in the URL address of the request is followed by ? (question mark), for example, httpbin.org/get?key=val. The Requests module provides methods for passing parameters, allowing the use of the params keyword argument to provide these parameters as a dictionary. For example, if you want to pass "key1=value1" and "key2=value2" to "httpbin.org/get", you can use the following code:

import requests

payload = {"key1": "value1", "key2": "value2"}  # 传递的参数

# 对需要爬取的网页发情请求
response = requests.get("http://httpbin.org/get", params=payload)

print(response.content)  # 以字节流的形似打印网页源码


3. Request headers processing

   Sometimes when requesting a webpage content, it is found that no matter through GET or POST or other request methods, a 403 error will occur. This phenomenon is mostly due to the server rejecting your access, which is because of the anti-crawler settings used by these web pages to prevent malicious collection of information. At this time, you can access by simulating the header information of the browser, which can solve the above problem of anti-crawling settings. The following uses the requests module as an example to introduce the processing of request headers. The specific steps are as follows:

   (1) To view the header information through the browser’s network monitor, first open the corresponding web page address through Google Chrome, then open the network monitor with the shortcut key <Ctrl+Shift+I>, refresh the current page, and the network monitor will display the data changes as shown in the figure below.

insert image description here


   (2) Select the first message, the request header information will be displayed in the information header panel on the right, and then copy the information, as shown in the following figure:

insert image description here


   (3) To implement the code, first create a URL address that needs to be crawled, then create the header information, then send the request and wait for the response, and finally print the web page code information. The implementation code is as follows:

import requests

url = "https://www.baidu.com/"  # 创建需要爬取的网页地址

# 创建头部信息,根据当前浏览器版本进行复制即可

headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"}

response = requests.get(url, headers=headers) # 发起网络请求

print(response.content) # 以字节流形式打印网页源码

4. Network timeout

  When accessing a webpage, if the webpage does not respond for a long time, the system will judge that the webpage has timed out, so the webpage cannot be opened. The following code is used to simulate a network timeout phenomenon, the code is as follows:

mport requests

# 循环发送请求50次

for a in range(1, 50):
    try:
        # 设置超时为0.5秒
        response = requests.get("https://www.baidu.com/", timeout=0.5)
        print(response.status_code)  # 打印状态码

    except Exception as e:
        print("异常" + str(e))  # 打印异常信息

  The print result is shown in the figure below:

insert image description here

Explanation: In the above code, 50 cyclic requests are simulated, and the timeout time is set to 0.5 seconds, so if the server does not respond within 0.5 seconds, it will be considered a timeout, and the timeout information will be printed on the console. According to the above simulation test results, it can be confirmed that different timeout values ​​are set in different situations.

  Speaking of exception information, the requests module also provides three common exception classes, the sample code is as follows:

import requests

# 导入requests.exceptions中的三个异常类

from requests.exceptions import ReadTimeout, HTTPError, RequestException

for a in range(1, 50):
    try:
        # 设置超时为0.5秒
        response = requests.get("https://www.baidu.com/", timeout=0.5)
        print(response.status_code)  # 打印状态码
    except ReadTimeout:  # 超时异常
        print("timeout")
    except HTTPError:  # HTTP异常
        print("httperror")
    except RequestException:  # 请求异常
        print("reqerror")

5. Agency service

  In the process of crawling webpages, it often happens that webpages that could be crawled not long ago cannot be crawled now, because your IP is blocked by the server of the anti-crawling website. At this time, the proxy service can solve this trouble for you. When setting the proxy, you first need to find the proxy address, for example, "122.114.43.113", the corresponding port number is "808", and the complete format is "122.114.43.113:808". The sample code is as follows:

import requests

proxy = {"http": "122.114.43.113:808", "https": "122.114.43.113:808"}  # 设置代理IP和对于端口号

# 对需要爬取的网页进行请求
response = requests.get("https://www.baidu.com/", proxies=proxy)

print(response.content)  # 以字节流形式打印网页源码

Note: Since the proxy IP in the example is free, the usage time is not fixed, and the address will be invalid if the usage time is exceeded. When the address is invalid or the address is wrong, the control panel will display the error message as shown in the figure below:

insert image description here


6. BeautifulSoup for HTML analysis

  BeautifulSoup is a Python library for extracting data from HTML and XML files. BeautifulSoup provides some simple functions to handle navigation, searching, modifying the parse tree, and more. The lookup extraction in the BeautifulSoup module is very powerful and very convenient, saving programmers hours or days of time.

1. Installation of BeautifulSoup

  Download the source code of BeatifulSoup at https://www.crummy.com/software/BeautifulSoup/bs4/download/, enter cmd, enter the storage path of BeatifulSoup4-4.8.0, and enter the Python steup.py install command.

insert image description here

insert image description here


2. The use of BeatifulSoup

  After BeatifulSoup is installed, the following will introduce how to parse HTML through BeatifulSoup. The specific steps are as follows:

  (1) Import the bs4 library, and then create a string that simulates HTML code, the code is as follows:

from bs4 import BeautifulSoup      #导入BeatifulSoup 库


#创建模拟HTML代码的字符串

html_doc = """
<html><head><title>The Dotmouse's story</title></head>
<body>
<p class="title"><b>The Dotmouse's story</b></p>
<p class="story">Once upon a time there were three little sisiter ;and therinames were
<a href ="http ://example.com/elsie" class ="sister" id ="link1">Elsie</a>,
<a href ="http ://example.com/lacie" class ="sister" id ="link2">Lasie</a> and
<a href ="http ://example.com/tillie" class ="sister" id ="link3">Tillie</a>;
and they lived at the bottom of a well. </p>

<p class ="story">...</p>

"""


  (2) Create a beautifulSoup object, specify the parser as lxml, and finally display the HTML code of the parser in the console by printing, the code is as follows:

import lxml

#创建一个BeatifulSoup对象,获取页面正文
soup = BeautifulSoup(html_doc,features="lxml")
print(soup)         #打印解析的HTML代码

  The running result is shown in the figure below:

insert image description here


Note: If the code in the html_doc string is saved in the index.html file, the code can be parsed by opening the HTML file, and the code can be formatted by the prettify() method. The code is as follows:

# 创建一个BeatifulSoup对象打开需要解析的html文件
soup = BeautifulSoup(open("index.html"), "lxml")

print(soup.prettify())  # 打印格式化后的代码

  This is the end of Python's web crawling framework - the introduction of common technologies of web crawlers. Thank you for reading. If the article is helpful to you, please pay attention, like, and bookmark (one-click triple link)


Guess you like

Origin blog.csdn.net/weixin_45191386/article/details/131484413