The basic use of the Requests library of python crawlers from entry to abandonment (4) (transfer)

What are Requests

Requests is written in python based on urllib, and uses the HTTP library of the Apache2 Licensed open source protocol.
If you have read the previous article on the use of the urllib library, you will find that urllib is actually very inconvenient, and Requests is more urllib is more convenient and can save us a lot of work. (After using requests, you are basically reluctant to use urllib.) In a word, requests is the easiest and easiest HTTP library implemented by python. It is recommended that crawlers use the requests library.

After python is installed by default, the requests module is not installed, it needs to be installed separately through pip

Detailed explanation of requests function

A demo of general functionality

import requests

response  = requests.get("https://www.baidu.com")
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
print(response.content)
print(response.content.decode("utf-8"))

We can see that response is really convenient to use. There is a problem that needs to be paid attention to:
in many cases, if the website directly responds.text, there will be garbled characters, so the data format returned by response.content
is actually binary format. , and then convert it to utf-8 through decode(), which solves the problem of displaying garbled characters directly through response.text.

After a request is made, Requests makes an educated guess about the encoding of the response based on the HTTP headers. When you access response.text, Requests uses its inferred text encoding. You can find out what encoding is used by Requests and can change it using the response.encoding property. For example:

response =requests.get("http://www.baidu.com")
response.encoding="utf-8"
print(response.text)

Whether it is through response.content.decode("utf-8) or through response.encoding="utf-8", the problem of garbled characters can be avoided

various request methods

Various request methods are provided in requests

import requests
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

ask

Basic GET request

import requests

response = requests.get('http://httpbin.org/get')
print(response.text)

GET request with parameters, example 1

import requests

response = requests.get("http://httpbin.org/get?name=zhaofan&age=23")
print(response.text)

If we want to pass data in the URL query string, we usually pass it through httpbin.org/get?key=val. The Requests module allows parameters to be passed using the params keyword. These parameters are passed as a dictionary. Examples are as follows:

import requests
data = {
    "name":"zhaofan",
    "age":22
}
response = requests.get("http://httpbin.org/get",params=data)
print(response.url)
print(response.text)

The above two results are the same. Pass a dictionary content through the params parameter to construct the url directly.
Note: When the second method uses the dictionary method, if the parameter in the dictionary is None, it will not be added to the url.

Parse json

import requests
import json

response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

It can be seen from the results that the json integrated in requests is actually executing the json.loads() method, and the results of the two are the same

get binary data

The response.content is mentioned above, the data obtained in this way is binary data, and the same method can also be used to download pictures and
video resources

Adding headers
is the same as when we used the urllib module before. We can also customize the headers information. For example, when we request the Zhihu website directly through requests, it is inaccessible by default.

import requests
response =requests.get("https://www.zhihu.com")
print(response.text)

This will get the following error

Because access to Zhihu requires header information, at this time we enter chrome://version in Google browser, we can see the user agent, and add the user agent to the header information

import requests
headers = {

    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response =requests.get("https://www.zhihu.com",headers=headers)

print(response.text)

This will allow you to access Zhihu normally

Basic POST request

By adding a data parameter when sending a post request, this data parameter can be constructed from a dictionary, which
is very convenient for sending post requests

import requests

data = {
    "name":"zhaofan",
    "age":23
}
response = requests.post("http://httpbin.org/post",data=data)
print(response.text)

Similarly, when sending a post request, you can also pass a dictionary-type data through the headers parameter, just like sending a get request.

response

We can get many properties through response, examples are as follows

import requests

response = requests.get("http://www.baidu.com")
print(type(response.status_code),response.status_code)
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)

The result is as follows:

Status code judgment
Requests also comes with a built-in status code query object,
which mainly includes the following contents:

100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

Server Error.
500: ('internal_server_error', 'server_error', '/o\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

Test through the following example: (but it is usually more convenient to judge by the status code)

import requests

response= requests.get("http://www.baidu.com")
if response.status_code == requests.codes.ok:
    print("Access successful")

Advanced usage of requests

File Upload

The implementation method is similar to other parameters, but also constructs a dictionary and passes it through the files parameter

import requests
files= {"files":open("git.jpeg","rb")}
response = requests.post("http://httpbin.org/post",files=files)
print(response.text)

The result is as follows:

get cookies

import requests

response = requests.get("http://www.baidu.com")
print(response.cookies)

for key,value in response.cookies.items():
    print(key+"="+value)

session maintenance

One of the functions of cookies is that they can be used to simulate login and maintain session

import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456")
response = s.get("http://httpbin.org/cookies")
print(response.text)

This is the correct way of writing, and the following way of writing is wrong

import requests

requests.get("http://httpbin.org/cookies/set/number/123456")
response = requests.get("http://httpbin.org/cookies")
print(response.text)

Because this method is independent between the two requests, and the first time is by creating a session object, both requests are accessed through this object

Certificate verification

Many websites are accessed through https now, so this time involves the issue of certificates

import requests

response = requests.get("https:/www.12306.cn")
print(response.status_code)

The certificate of the default 12306 website is invalid, which will prompt the following error

In order to avoid this situation, you can pass verify=False
but you can access the page, but it will prompt:
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs. io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)

The solution is:

import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)

This will not prompt a warning message, of course, you can also put the certificate path through the cert parameter

proxy settings

import requests

proxies= {
    "http":"http://127.0.0.1:9999",
    "https":"http://127.0.0.1:8888"
}
response  = requests.get("https://www.baidu.com",proxies=proxies)
print(response.text)

If the proxy needs to set the account name and password, just change the dictionary to the following:
proxies = {
"http":" http://user:[email protected]:9999 "
}
If your proxy is through sokces this way then Requires pip install "requests[socks]"
proxies= {
"http":"socks5://127.0.0.1:9999",
"https":"sockes5://127.0.0.1:8888"
}

timeout setting

The timeout period can be set by the timeout parameter

Authentication settings

If you encounter a website that requires authentication, you can use the requests.auth module to achieve

import requests

from requests.auth import HTTPBasicAuth

response = requests.get("http://120.27.34.24:9001/",auth=HTTPBasicAuth("user","123"))
print(response.status_code)

Of course there is another way

import requests

response = requests.get("http://120.27.34.24:9001/",auth=("user","123"))
print(response.status_code)

exception handling

Exceptions about reqeusts can be found here:
http://www.python-requests.org/en/master/api/#exceptions
All exceptions are in requests.excepitons

From the source code, we can see that RequestException inherits IOError,
HTTPError, ConnectionError, Timeout inherits RequestionException
ProxyError, and SSLError inherits ConnectionError
ReadTimeout inherits Timeout exception
. Here are some common exception inheritance relationships. For details, see:
http://cn.python-requests .org/en_US/latest/_modules/requests/exceptions.html#RequestException

Simple demonstration with the following example

import requests

from requests.exceptions import ReadTimeout,ConnectionError,RequestException


try:
    response = requests.get("http://httpbin.org/get",timout=0.1)
    print(response.status_code)
except ReadTimeout:
    print("timeout")
except ConnectionError:
    print("connection Error")
except RequestException:
    print("error")

In fact, in the final test, it can be found that the first exception to be caught is timeout. When the haul is disconnected from the network, ConnectionError will be captured. If the previous exceptions are not captured, it can also be captured by RequestExctption.

Original address http://www.cnblogs.com/zhaof/p/6915127.html