HTTP request: Analysis of advanced usage of requests | JD Cloud technical team

Yuan Chuanghui, offline restart! July 1, 2023 Shenzhen Station - Basic software technology interview! Early bird tickets are available for a limited time only!

1 background

The previous article explained the basic use of the requests module, including get, put, post and other request methods, using data, json and other formats as request parameters, and adding common information about request header information in the request body, such as: headers, cookies, and processing methods for request responses. Next, explain the advanced usage of requests.

2 Examples of advanced methods

2.1 requests.request()

method: submission method (get｜post);

url: submission address;

kwarg s: 14 parameters to control access;

The commonly used parameters are: params, data, json, headers, cookies, which have been introduced in the previous article. Interested friends, you can review it in the previous article. The following will explain and illustrate the use of other parameters.

Example:

2.1.1 files

Request to carry files, if some requests need to upload files, you can use it to achieve.

import requests

# 上传文件
f= {"files": open("favicon.ico", "rb") }
data = {"name": "上传文件"}

requests.request(
    method = 'POST', 
    url = 'http://127.0.0.1:8080/example/request',  
    data = data,
    files = f
)

Note: the favicon.ico file needs to be in the same directory as the current script, if not, you can change the file name to the file path

import requests
from requests.auth import HTTPBasicAuth, HTTPDigestAuth

# 1、Basic Auth认证
res = requests.request(
    method = 'GET',
    url = 'http://127.0.0.1:8080/example/request',
    auth = HTTPBasicAuth("username", "password")
)
res.encoding = "gbk"

print(res.status)  # 200


# 2、DIGEST 认证
res = requests.request(
    method = 'GET',
    url = 'http://127.0.0.1:8080/example/request',
    auth = HTTPDigestAuth("username", "password")
)
res.encoding = "gbk"

print(res.status)  # 200

The two methods of http auth authentication are Basic and Digest authentication, among which: the advantage of Basic Auth is that it provides a simple user authentication function, and its authentication process is simple and clear, suitable for systems or devices that do not require high security; There are also disadvantages: the entered user name and password will appear in Authorization after base64 encoding, which can be easily parsed out.

So what is the difference between Digest and Basic certification?

The idea of Digest is to use a random number string, and the two parties agree on which information to hash, and then the identity verification of both parties can be completed. Digest mode avoids the transmission of passwords in plain text on the network and improves security, but it still has disadvantages, such as authentication messages being intercepted by attackers and attackers can obtain resources.
DIGEST authentication provides a higher level of security than BASIC authentication, but is still weak compared to HTTPS client authentication.
DIGEST authentication provides a protection mechanism against password eavesdropping, but there is no protection mechanism against user masquerading.
DIGEST certification, like BASIC certification, is not so convenient and flexible to use, and still falls short of the high security level pursued by most Web sites. Therefore, its scope of application is also limited.

2.1.2 timeout

The timeout period for requests and responses. When the network response is delayed or there is no response, you can set the timeout period to avoid waiting.

import requests

# 设置请求超时1秒，1秒后无响应，将抛出异常，1秒为connect和read时间总和
requests.request(
    method = 'POST',
    url = 'http://127.0.0.1:8080/example/request',
    json = {'k1' : 'v1', 'k2' : 'v2'},
    timeout = 1
)

# 分别设置connect和read的超时时间，传入一个数组
requests.request(
    method = 'POST',
    url = 'http://127.0.0.1:8080/example/request',
    json = {'k1' : 'v1', 'k2' : 'v2'},
    timeout = (5, 15)
)

# 永久等待
requests.request(
    method = 'POST',
    url = 'http://127.0.0.1:8080/example/request',
    json = {'k1' : 'v1', 'k2' : 'v2'},
    timeout = None
    # 或者删除timeout参数
)

# 捕捉超时异常
from requests.exceptions import ReadTimeout
try:
    res = requests.get('http://127.0.0.1:8080/example/request', timeout=0.1)
    print(res.status_code)
except ReadTimeout:
    print("捕捉到超时异常")

2.1.3 allow_redirects

Set redirection switch.

>>> import requests
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'

>>> r.status_code
200

>>> r.history
[<Response [301]>]

# 如果使用GET、OPTIONS、POST、PUT、PATCH或DELETE，则可以使用allow_redirects参数禁用重定向
>>> r = requests.get('http://github.com', allow_redirects=False)

>>> r.status_code
301

>>> r.history
[]

# 用HEAD启动重定向
>>> r = requests.head('http://github.com', allow_redirects=True)

>>> r.url
'https://github.com/'

>>> r.history
[<Response [301]>]


import requests
import re

# 第一次请求
r1=requests.get('https://github.com/login')
r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)
authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

# 第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码
data={
    'commit':'Sign in',
    'utf8':'✓',
    'authenticity_token':authenticity_token,
    'login':'[email protected]',
    'password':'password'
}


# 测试一：没有指定allow_redirects=False,则响应头中出现Location就跳转到新页面，
# r2代表新页面的response
r2=requests.post('https://github.com/session',
             data=data,
             cookies=r1_cookie
             )

print(r2.status_code) # 200
print(r2.url) # 看到的是跳转后的页面
print(r2.history) # 看到的是跳转前的response
print(r2.history[0].text) # 看到的是跳转前的response.text

# 测试二：指定allow_redirects=False,则响应头中即便出现Location也不会跳转到新页面，
# r2代表的仍然是老页面的response
r2=requests.post('https://github.com/session',
             data=data,
             cookies=r1_cookie,
             allow_redirects=False
             )

print(r2.status_code) # 302
print(r2.url) # 看到的是跳转前的页面https://github.com/session
print(r2.history) # []

2.1.4 proxies

Same as adding headers method, proxy parameter is dict.

import requests
import re
def get_html(url):
    proxy = {
        'http': '120.25.253.234:812',
        'https' '163.125.222.244:8123'
    }
    heads = {}
    heads['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
    req = requests.get(url, headers=heads,proxies=proxy)
    html = req.text
    return html
def get_ipport(html):
    regex = r'<td data-title="IP">(.+)</td>'
    iplist = re.findall(regex, html)
    regex2 = '<td data-title="PORT">(.+)</td>'
    portlist = re.findall(regex2, html)
    regex3 = r'<td data-title="类型">(.+)</td>'
    typelist = re.findall(regex3, html)
    sumray = []
    for i in iplist:
        for p in portlist:
            for t in typelist:
                pass
            pass
        a = t+','+i + ':' + p
        sumray.append(a)
    print('代理')
    print(sumray)
if __name__ == '__main__':
    url = 'http://www.baidu.com'
    get_ipport(get_html(url))

Some interfaces have added an anti-harassment mode. For large-scale and frequent requests, a verification code may pop up, or jump to the login verification page, or block the IP address. If you want to access normally at this time, you can solve it by setting a proxy this problem.
In addition to the basic HTTP proxy, requests also supports the proxy of the SOCKS protocol.

# 安装socks库
pip3 install "requests[socks]"

# 进行代理
import requests

proxies = {
    'http': 'socks5://user:password@host:port',
    'https': 'socks5://user:password@host:port'
}
res = requests.get('http://www.baidu.com', proxies=proxies)
print(res.status)  # 200

2.1.5 hooks

That is, the hook method. The requests library only supports a response hook, that is, when the response is returned, the custom method can be executed incidentally. Can be used to print some information, do some response checking, or add extra information to the response.

import requests
url = 'http://www.baidu.com'

def verify_res(res, *args, **kwargs):
    print('url', res.url)
    res.status='PASS' if res.status_code == 200 else 'FAIL'

res = requests.get(url, hooks={'response': verify_res})
print(res.text) # <!DOCTYPE html><!--STATUS OK--><html> 
print(res.status) # PASS

2.1.6 stream

Get the content download switch immediately, response will load all the messages into the memory at one time, if the message is too large, you can use this parameter to download iteratively.

import requests

url="http://www.baidu.com"

r = requests.get(url, stream=True)

# 解析response_body，以\n分割
for lines in r.iter_lines():
    print("lines:", lines)

# 解析response_body，以字节分割
for chunk in r.iter_content(chunk_size=1024):
    print("chunk:", chunk)

2.1.7 verify

Authentication SSL certificate switch, when sending an HTTPS request, if the certificate of the website is not trusted by the CA organization, the program will report an error, you can use the verify parameter to control whether to check the SSL certificate.

# 1、直接设置
import requests

response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 2、请求时虽然设置了跳过检查，但是程序运行时仍然会产生警告，警告中包含建议给我们的指定证书
# 可以通过设置，忽略屏蔽这个警告
from requests.packages import urllib3  # 如果报错，则直接引入import urllib3

# 3、屏蔽警告
urllib3.disable_warnings()

response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code) # 200

# 4、通过cert直接声明证书
# 本地需要有crt和key文件（key必须是解密状态，加密状态的key是不支持的），并指定它们的路径，
response = requests.get('https://www.12306.cn',cert('/path/server.crt','/path/key'))
print(response.status_code) # 200

2.2 Abnormal requests library

How to judge whether there is an exception?

2.2.1 raise_for_status()

This method internally judges whether res.status_code is equal to 200, otherwise an exception HTTPError will be generated

Example:

# 1、HTTPError异常示例
import requests
from requests.exceptions import HTTPError

try:
    res = requests.post("http://127.0.0.1:8080/example/post")
    res.raise_for_status()
    # 等同于
    if res.status != 200:
        raise HTTPError

    return res

except HTTPError:
    return False

2.2.2 ReadTimeout

This exception type will catch requests due to request/response timeouts.

# Timeout超时异常
import requests
from requests.exceptions import ReadTimeout

try:
    res = requests.get('http://127.0.0.1:8080/example/post',timeout=0.5)
    print(res.status_code)
    return res

except ReadTimeout:
    print('timeout')

2.2.3 RequestException

This exception type will capture abnormal requests caused by no requests.

# RquestError异常
import requests
from requests.exceptions import RequestException

try:
    res = requests.get('http://127.0.0.1:8080/example/post')
    print(res.status_code)
    return res

except RequestException:
    print('reqerror')

3 summary

Seeing this, everyone should understand that the requests library is a third-party library that is more concise than the urilib2 module, and it has the following characteristics:

Supports HTTP connection persistence and connection pooling
Support the use of cookies and sessions to maintain sessions
Support file upload
Supports encoding of auto-response content
Supports automatic encoding of internationalized URL and Post data
Support automatic implementation of persistent connection keep-alive

Therefore, requests, a highly encapsulated module, can make our HTTP requests more humanized. It can easily complete any operation requested by the browser, fully interpreting its slogan: "HTTP for Humans".

Author: JD Logistics Luo Tonglei

Source: JD Cloud Developer Community