In web crawlers, web requests are the basic part. Without network requests and responses, the subsequent data analysis of the web crawler loses its meaning. Network requests in Python are mainly completed by the Requests library. In this article, let's take a look at the Requests library and its basic usage.
Introduction to the Requests library
The Requests library is a concise and elegant third-party Python library that better fits people's usage habits. Therefore, in the http library, the Requests library is very popular among developers. The Requests library supports many functions that are used today, such as Keep-Alive
persistent connection, Cookie
dialogue, SSL
authentication, dynamic decoding, HTTP(S)
proxy support, streaming download, and file block upload. For other function introductions and documentation, please refer to the official documentation. Requests: HTTP for Humans
Installation of the Requests library
pip install requests
The main methods of the Requests library
The Requests library mainly has 7 methods, most of HTTP(S)
which are the same as the request methods in the protocol
method | Description |
---|---|
requests.request() | Construct a request to support the following methods |
requests.get() | The method of obtaining resources is the same as HTTP GET |
requests.head() | The method of obtaining resource response header information is the same as HTTP HEAD |
requests.post() | The method of submitting a post request to the website is the same as HTTP POST |
requests.put() | The method of submitting a put request to the website is the same as HTTP PUT |
requests.patch() | Submit partial patch modification request to the website, same as HTTP PATCH |
requests.delete() | Submit a resource deletion request to the website, same as HTTP DELETE |
-
In addition, other HTTP method requests library are also supported, because they are not commonly used, they will not be listed one by one.
-
In fact, all the other methods are invoked by the basic method
requests.request()
to achieve
Use the Requests library to construct a basic request
The most commonly used request method in web page access is the GET
method. Below we will use the GET
method to construct a basic request.
Requests.get() method
By the following code
r = requests.get(url)
#url是一个字符串型变量,保存了要请求资源的URL
The Requests library constructs a Request
request object that requests resources from the server , and the result returned by the command is a Response
response object containing server resourcesr
Complete usage format of Requests.get() method
requests.get(url,params=None,**kwargs)
among them:
- url:
URL
Uniform resource locator of the resource you want to request - params: Optional, default is None, which means extra parameters in url, which can be in dictionary or byte stream format.
- **kwargs: Optional, 12 parameters to control access, see below
In fact, the requests.get() method calls the requests.request() method. The code before encapsulation is as follows
requests.request('get',url,params=params,**kwargs)
The get method lists the params parameter as a keyword parameter with default parameters. Therefore, the requests.request() method has 13 parameters (including params), which are read in a dictionary.
Response object
The above request returns a Response object, which contains the resource-related information returned from the server, and mainly has the following attributes
Attributes | Description |
---|---|
r.status_code | HTTP request returns status code |
r.text | HTTP response content (string form) |
r.encoding | The response body encoding method obtained from the HTTP header is also the current entity encoding method (also can be configured later) |
r.apparent_encoding | Encoding method analyzed from the response body (optional) |
r.content | Binary form of HTTP response content |
Use the get method to construct the request
Let's use an example to illustrate how to use the requests.get() method to construct a get request, taking access to Baidu as an example
r = requests.get("https://www.baidu.com")
View the encoding information of the response object
#每一行后的注释内容为返回值
r.encoding
#'ISO-8859-1'
r.apparent_encoding
#'utf-8'
In fact, resulting from the response header field body encoding format (In fact, the response header encoding format is not specified, the default format that is ISO-8859-1
) wrong (not display correctly in Chinese), need to adjust your · r.encoding
its actual subject Encoding utf-8
to display Chinese correctly
After the encoding format of the body is set correctly, the body content of the response can be correctly obtained, and the response header information can also be obtained.
Common code framework for crawling web pages
The universal code framework is a piece of universal code used to crawl webpages. By customizing the universal code framework for crawling webpages, the content of the webpage can be crawled reliably and flexibly, and links to other pages can also be obtained. It is inevitable that an exception will be thrown when using the requests.get() method to access a web page, so you must understand the exception information and handling methods of the Requests library before learning the general code framework.
Requests library exception handling
When using the Requests library to send requests, you may receive various exceptions from various links. If you do not handle them, the program may terminate abnormally. Therefore, recognizing and handling exceptions is a necessary link in crawler development.
abnormal | Description |
---|---|
requests.ConnectionError | Network connection errors, such as unable to establish a connection, DNS resolution error, connection refused (non-4xx response code), etc. |
requests.HTTPError | HTTP error exception (need to be thrown manually, see note) |
requests.URLRequired | Missing URL exception |
requests.TooManyRedirects | The redirection of the response exceeds the threshold, and a redirection exception occurs |
requests.ConnectTimeout | Server connection timeout exception |
requests.Timeout | Request timeout (non-connection timeout refers to the timeout of the request phase after connection) |
- Among them,
ConnectionError
refers to the exceptions generated at the network TCP layer, which will force the termination of the program; whileHTTPError
refers to the exceptions generated at the HTTP protocol (application layer), the user.raise_for_status()
method to manually throw an exception, as long as the return code is not 200, this will be thrown abnormal.
Common code framework
import requests
def get_uri(url):
try:
r = requests.get(url, allow_redirects=False)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "An error has been thrown"
print("Requests.get Skeleton")
print(get_uri("http://192.168.0.6:8080/a"))
- The
get_uri
function in the example is a simple crawler general code framework that is encapsulated - In the example, the server returned a 404 error, which was
raise_for_status()
caught by the method; if the url in the example is changed to a non-existent server or an illegal url, the error will be thrown directly without the statement. Aftertry-except
the exception capture of the statement, the program will not exit, but the error prompt and processing in the next step can be performed. - If the url in the example is changed to a real one, the source code of the web page will be output directly, which can be analyzed and processed through further operations.
- A more precise definition of the capture condition of the except statement can distinguish different errors, so as to customize different solutions for different errors
Use of the main methods of the Requests library
Basic request methods and parameters
First introduce the basic method of the Requests library-Requests.request() method. The other main request methods in the Requests library are all done by calling the Requests.request() method.
requests.request('Method', url, **kwargs)
among them
Method
For the specified method nameurl
Uniform identifier for the requested target resource**kwargs
For the additional parameters of the request, all other method parameters based on requests.request() are the same. Some methods may explicitly define some common parameters as keyword parameters, which has no major impact
parameter | Description |
---|---|
params | Add as a parameter to the URL link to be visited |
data | As the subject content of the Request message, it is sent as a form, which can be a dictionary, tuple, etc. |
json | Request message content in json format |
headers | More commonly used, it defines the request header parameters of the Request message, which can be a dictionary |
cookies | The cookie information in the HTTP Request can be a dictionary, CookieJar or Cookie in the Request |
auth | Used for HTTP protocol authentication, as a tuple |
files | The body of the Request message (in the form of a file) is a dictionary, and the key of the dictionary is a file type variable |
timeout | Set the timeout period of Request, in seconds |
proxies | Set the proxy server to be accessed, a dictionary, and specify a proxy server for the protocol respectively |
allow_redirects | Allow redirection, boolean type, default True, allow redirection |
stream | 流下载(获取内容立即下载),boolean型,默认True,允许流下载 |
verify | SSL认证证书开关,默认True,使用SSL认证证书 |
cert | 本地的SSL证书路径 |
-
注意:data参数是将参数加入Request报文中,而params参数是将参数加入要访问的url链接中。比如,params={‘a’: 1},url=
'https://www.baidu.com/'
,最后请求生成的url为'https://www.baidu.com/?a=1'
-
headers是一个常用参数,用于定义请求报文的头部信息,接受字典形式的数据。如果未经定义,会输出默认参数
{ 'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
-
files参数示例
files = { 'file1': open('data.xls', 'rb')} r = requests.post{ 'https://192.168.0.6:8080/post', files=files}
-
proxies示例
proxies = { 'http': 'http://192.168.0.4', 'https': 'https://192.168.0.5'}
关于get方法,上文已经做过讲解,本段中不再重复赘述,下面介绍一些其他的requests库方法
Requests.post()方法
Requests.post()方法通过HTTP协议的post方法向服务器传递数据,关于HTTP中的post方法请读者自行了解,下面仅介绍Requests.post()方法的具体使用
post方法的封装
def post(url, data=None, json=None, **kwargs):
r"""Sends a POST request.
:param url: URL for the new :class:`Request` object.
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
"""
return request('post', url, data=data, json=json, **kwargs)
可以看到,post方法将data
和json
参数单独列出,分别表示post请求主体内容(以web表单形式编码)和以json
格式编码的主体内容。以web表单形式编码的条件为data
是一个字典;如果想要发送未经编码的数据(不要默认web表单编码),则直接以字符串作为data
下面三个示例分别用web表单格式、json格式和****作为post方法的请求主体,后附服务器收到信息(节选)
payload = {
'b': 2}
r = requests.post("http://192.168.0.6:8080/post", data=payload)
import json
payload = {
'b': 2}
r = requests.post("http://192.168.0.6:8080/post", json=json.dumps(payload))
r = requests.post("http://192.168.0.6:8080/post", data="b=2")
{
"args":{
},"data":"\"{\\\"b\\\": 2}\"","files":{
},"form":{
},"headers":{
"Content-Length":"12","Content-Type":"application/json"},"json":"{\"b\": 2}",}
{
"args":{
},"data":"","form":{
"b":"2"},"headers":{
"Content-Length":"3","Content-Type":"application/x-www-form-urlencoded"},"json":null}
{
"data":"b=2","headers":{
"Content-Length":"3"},"json":null}
-
在发送未经编码的请求实体时,首部字段中不存在
Content-Type
如果以元组作为
data
,将同样以web表单形式编码,但是如果当个元组使用同一个key的时候,使用元组可以避免字典中键名称重复的问题
payload = (('b', 1), ('b', 2))
r = requests.post("http://192.168.0.6:8080/post", data=payload)
#下面是服务器收到的主体内容
"form":{
"b":["1","2"]}
Requests.put()方法
The put method in the HTTP protocol is basically similar to the post method, except that the original data under the url will be overwritten when the put method is used, while the post method just adds a new data. The format of the two methods in the Requests library is almost the same, so I won't repeat them here.
Requests.patch() method
Similar to the patch method in the HTTP protocol, the usage is similar to Requests.put(). The difference is that patch only provides part of the data that needs to be modified, while put needs to submit all the data under the url.
Requests.delete() method
Similar to the delete method in the HTTP protocol, to request the server to delete the resource specified by the URL, authentication information may need to be added to the request parameters.
Requests.head() method
Similar to the head method in the HTTP protocol, the usage is similar to the Requests.get() method, except that there is only the header information and no body in the returned message.