Python web crawler from 0 to 1 (1): Detailed introduction to the Requests library

In web crawlers, web requests are the basic part. Without network requests and responses, the subsequent data analysis of the web crawler loses its meaning. Network requests in Python are mainly completed by the Requests library. In this article, let's take a look at the Requests library and its basic usage.

Getting started with Requests

Introduction to the Requests library

The Requests library is a concise and elegant third-party Python library that better fits people's usage habits. Therefore, in the http library, the Requests library is very popular among developers. The Requests library supports many functions that are used today, such as Keep-Alivepersistent connection, Cookiedialogue, SSLauthentication, dynamic decoding, HTTP(S)proxy support, streaming download, and file block upload. For other function introductions and documentation, please refer to the official documentation. Requests: HTTP for Humans

Installation of the Requests library

pip install requests

The main methods of the Requests library

The Requests library mainly has 7 methods, most of HTTP(S)which are the same as the request methods in the protocol

method	Description
requests.request()	Construct a request to support the following methods
requests.get()	The method of obtaining resources is the same as HTTP GET
requests.head()	The method of obtaining resource response header information is the same as HTTP HEAD
requests.post()	The method of submitting a post request to the website is the same as HTTP POST
requests.put()	The method of submitting a put request to the website is the same as HTTP PUT
requests.patch()	Submit partial patch modification request to the website, same as HTTP PATCH
requests.delete()	Submit a resource deletion request to the website, same as HTTP DELETE

In addition, other HTTP method requests library are also supported, because they are not commonly used, they will not be listed one by one.
In fact, all the other methods are invoked by the basic method requests.request()to achieve

Use the Requests library to construct a basic request

The most commonly used request method in web page access is the GETmethod. Below we will use the GETmethod to construct a basic request.

Requests.get() method

By the following code

r = requests.get(url)
#url是一个字符串型变量，保存了要请求资源的URL

The Requests library constructs a Requestrequest object that requests resources from the server , and the result returned by the command is a Responseresponse object containing server resourcesr

Complete usage format of Requests.get() method

requests.get(url,params=None,**kwargs)

among them:

url: URLUniform resource locator of the resource you want to request
params: Optional, default is None, which means extra parameters in url, which can be in dictionary or byte stream format.
**kwargs: Optional, 12 parameters to control access, see below

In fact, the requests.get() method calls the requests.request() method. The code before encapsulation is as follows

requests.request('get',url,params=params,**kwargs)

The get method lists the params parameter as a keyword parameter with default parameters. Therefore, the requests.request() method has 13 parameters (including params), which are read in a dictionary.

Response object

The above request returns a Response object, which contains the resource-related information returned from the server, and mainly has the following attributes

Attributes	Description
r.status_code	HTTP request returns status code
r.text	HTTP response content (string form)
r.encoding	The response body encoding method obtained from the HTTP header is also the current entity encoding method (also can be configured later)
r.apparent_encoding	Encoding method analyzed from the response body (optional)
r.content	Binary form of HTTP response content

Use the get method to construct the request

Let's use an example to illustrate how to use the requests.get() method to construct a get request, taking access to Baidu as an example

r = requests.get("https://www.baidu.com")

View the encoding information of the response object

#每一行后的注释内容为返回值
r.encoding
#'ISO-8859-1'
r.apparent_encoding
#'utf-8'

In fact, resulting from the response header field body encoding format (In fact, the response header encoding format is not specified, the default format that is ISO-8859-1) wrong (not display correctly in Chinese), need to adjust your · r.encodingits actual subject Encoding utf-8to display Chinese correctly

After the encoding format of the body is set correctly, the body content of the response can be correctly obtained, and the response header information can also be obtained.

Common code framework for crawling web pages

The universal code framework is a piece of universal code used to crawl webpages. By customizing the universal code framework for crawling webpages, the content of the webpage can be crawled reliably and flexibly, and links to other pages can also be obtained. It is inevitable that an exception will be thrown when using the requests.get() method to access a web page, so you must understand the exception information and handling methods of the Requests library before learning the general code framework.

Requests library exception handling

When using the Requests library to send requests, you may receive various exceptions from various links. If you do not handle them, the program may terminate abnormally. Therefore, recognizing and handling exceptions is a necessary link in crawler development.

abnormal	Description
requests.ConnectionError	Network connection errors, such as unable to establish a connection, DNS resolution error, connection refused (non-4xx response code), etc.
requests.HTTPError	HTTP error exception (need to be thrown manually, see note)
requests.URLRequired	Missing URL exception
requests.TooManyRedirects	The redirection of the response exceeds the threshold, and a redirection exception occurs
requests.ConnectTimeout	Server connection timeout exception
requests.Timeout	Request timeout (non-connection timeout refers to the timeout of the request phase after connection)

Among them, ConnectionErrorrefers to the exceptions generated at the network TCP layer, which will force the termination of the program; while HTTPErrorrefers to the exceptions generated at the HTTP protocol (application layer), the use r.raise_for_status()method to manually throw an exception, as long as the return code is not 200, this will be thrown abnormal.

Common code framework

import requests


def get_uri(url):
    try:
        r = requests.get(url, allow_redirects=False)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "An error has been thrown"


print("Requests.get Skeleton")
print(get_uri("http://192.168.0.6:8080/a"))

The get_urifunction in the example is a simple crawler general code framework that is encapsulated
In the example, the server returned a 404 error, which was raise_for_status()caught by the method; if the url in the example is changed to a non-existent server or an illegal url, the error will be thrown directly without the statement. After try-exceptthe exception capture of the statement, the program will not exit, but the error prompt and processing in the next step can be performed.
If the url in the example is changed to a real one, the source code of the web page will be output directly, which can be analyzed and processed through further operations.
A more precise definition of the capture condition of the except statement can distinguish different errors, so as to customize different solutions for different errors

Use of the main methods of the Requests library

Basic request methods and parameters

First introduce the basic method of the Requests library-Requests.request() method. The other main request methods in the Requests library are all done by calling the Requests.request() method.

requests.request('Method', url, **kwargs)

among them

MethodFor the specified method name
urlUniform identifier for the requested target resource
**kwargsFor the additional parameters of the request, all other method parameters based on requests.request() are the same. Some methods may explicitly define some common parameters as keyword parameters, which has no major impact

parameter	Description
params	Add as a parameter to the URL link to be visited
data	As the subject content of the Request message, it is sent as a form, which can be a dictionary, tuple, etc.
json	Request message content in json format
headers	More commonly used, it defines the request header parameters of the Request message, which can be a dictionary
cookies	The cookie information in the HTTP Request can be a dictionary, CookieJar or Cookie in the Request
auth	Used for HTTP protocol authentication, as a tuple
files	The body of the Request message (in the form of a file) is a dictionary, and the key of the dictionary is a file type variable
timeout	Set the timeout period of Request, in seconds
proxies	Set the proxy server to be accessed, a dictionary, and specify a proxy server for the protocol respectively
allow_redirects	Allow redirection, boolean type, default True, allow redirection
stream	流下载（获取内容立即下载），boolean型，默认True，允许流下载
verify	SSL认证证书开关，默认True，使用SSL认证证书
cert	本地的SSL证书路径

注意：data参数是将参数加入Request报文中，而params参数是将参数加入要访问的url链接中。比如，params={‘a’: 1}，url='https://www.baidu.com/'，最后请求生成的url为'https://www.baidu.com/?a=1'

headers是一个常用参数，用于定义请求报文的头部信息，接受字典形式的数据。如果未经定义，会输出默认参数

{
      
      'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

files参数示例

files = {
      
      'file1': open('data.xls', 'rb')}
r = requests.post{
      
      'https://192.168.0.6:8080/post', files=files}

proxies示例

proxies = {
      
      'http': 'http://192.168.0.4',
          'https': 'https://192.168.0.5'}

关于get方法，上文已经做过讲解，本段中不再重复赘述，下面介绍一些其他的requests库方法

Requests.post()方法

Requests.post()方法通过HTTP协议的post方法向服务器传递数据，关于HTTP中的post方法请读者自行了解，下面仅介绍Requests.post()方法的具体使用

post方法的封装

def post(url, data=None, json=None, **kwargs):
    r"""Sends a POST request.

    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('post', url, data=data, json=json, **kwargs)

可以看到，post方法将data和json参数单独列出，分别表示post请求主体内容（以web表单形式编码）和以json格式编码的主体内容。以web表单形式编码的条件为data是一个字典；如果想要发送未经编码的数据（不要默认web表单编码），则直接以字符串作为data

下面三个示例分别用web表单格式、json格式和****作为post方法的请求主体，后附服务器收到信息（节选）

payload = {
    
    'b': 2}
r = requests.post("http://192.168.0.6:8080/post", data=payload)

import json
payload = {
    
    'b': 2}
r = requests.post("http://192.168.0.6:8080/post", json=json.dumps(payload))

r = requests.post("http://192.168.0.6:8080/post", data="b=2")

{
    
    "args":{
    
    },"data":"\"{\\\"b\\\": 2}\"","files":{
    
    },"form":{
    
    },"headers":{
    
    "Content-Length":"12","Content-Type":"application/json"},"json":"{\"b\": 2}",}

{
    
    "args":{
    
    },"data":"","form":{
    
    "b":"2"},"headers":{
    
    "Content-Length":"3","Content-Type":"application/x-www-form-urlencoded"},"json":null}

{
    
    "data":"b=2","headers":{
    
    "Content-Length":"3"},"json":null}

在发送未经编码的请求实体时，首部字段中不存在Content-Type

如果以元组作为data，将同样以web表单形式编码，但是如果当个元组使用同一个key的时候，使用元组可以避免字典中键名称重复的问题

payload = (('b', 1), ('b', 2))
r = requests.post("http://192.168.0.6:8080/post", data=payload)

#下面是服务器收到的主体内容
"form":{
    
    "b":["1","2"]}

Requests.put()方法

The put method in the HTTP protocol is basically similar to the post method, except that the original data under the url will be overwritten when the put method is used, while the post method just adds a new data. The format of the two methods in the Requests library is almost the same, so I won't repeat them here.

Requests.patch() method

Similar to the patch method in the HTTP protocol, the usage is similar to Requests.put(). The difference is that patch only provides part of the data that needs to be modified, while put needs to submit all the data under the url.

Requests.delete() method

Similar to the delete method in the HTTP protocol, to request the server to delete the resource specified by the URL, authentication information may need to be added to the request parameters.

Requests.head() method

Similar to the head method in the HTTP protocol, the usage is similar to the Requests.get() method, except that there is only the header information and no body in the returned message.