1 Introduction to the urllib3 module

urllib3 is a third-party network request module (install the module separately), which is more powerful than the urllib that comes with Python.

1.1 Understanding urllib3

The urllib3 library is a powerful, well-organized python library for HTTP clients that provides many important features not found in the Python standard library. E.g:

thread safe.
connection pool.
Client SSL/TIIS Authentication
Upload files using multipart encoding
Helpers are used to retry requests and handle HTTP redirects.
Supports gzip and deflate encoding
Supports HTTP and SOCKS proxies
100% test coverage

1.1.1 urllib3 installation command

pip install urllib3

2 Send a network request

2.1 Send a Get request

When using the urllib3 module to send network requests, you first need to create a PoolManager object, and call the request() method through this object to send network requests.

The syntax of the request() method is as follows.

request(method,url,fields=None,headers=None,**urlopen_kw)

method: Required parameter, used to specify the request method, such as GET, POST, PUT, etc.
url: Required parameter, used to set the URL address that needs to be requested.
fields: optional parameter used to set request parameters.
headers: optional parameter used to set request headers.

2.1.1 Send a GET request instance [and get the response information]

import urllib3
urllib3.disable_warnings() # 关闭SSL警告
url = "https://www.baidu.com/"
http = urllib3.PoolManager()
get = http.request('GET',url) # 返回一个HTTPResponse对象
print(get.status)
# 输出 200

response_header = get.info() # 获取HTTPResponse对象中的info()获取响应头信息，字典形状，需要用for循环
for key in response_header:
    print(key,":",response_header.get(key))
# Accept-Ranges : bytes
# Cache-Control : no-cache
# Connection : keep-alive
# Content-Length : 227
# Content-Type : text/html
# Date : Mon, 21 Mar 2022 12:12:23 GMT
# P3p : CP=" OTI DSP COR IVA OUR IND COM ", CP=" OTI DSP COR IVA OUR IND COM "
# Pragma : no-cache
# Server : BWS/1.1
# Set-Cookie : BD_NOT_HTTPS=1; path=/; Max-Age=300, BIDUPSID=E864BF1D7795F2742A7BC13B95F89493; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, PSTM=1647864743; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BAIDUID=E864BF1D7795F27482D1B67B4F266616:FG=1; max-age=31536000; expires=Tue, 21-Mar-23 12:12:23 GMT; domain=.baidu.com; path=/; version=1; comment=bd
# Strict-Transport-Security : max-age=0
# Traceid : 1647864743283252404214760038623219429901
# X-Frame-Options : sameorigin
# X-Ua-Compatible : IE=Edge,chrome=1

2.1.2 Send POST request

import urllib3
url ="www.httpbin.org/post"
params = {'name':'xiaoli','age':'1'}
http = urllib3.PoolManager()
post = http.request('POST',url,fields=params,retries=5) # retries重试次数：默认为3
print("返回结果：",post.data.decode('utf-8'))
print("返回结果(含中文的情况下)：",post.data.decode('unicode_escape'))

2.2 Processing server return information

2.2.1 Processing the json information returned by the server

If the server returns a piece of JSON information, and only a certain piece of data is available in this piece of information, you can first convert the returned JSON data into dictionary data, and press and hold to obtain the value corresponding to the specified key.

import urllib3
import json
url ="www.httpbin.org/post"
params = {'name':'xiaoli','age':'1'}
http = urllib3.PoolManager()
post = http.request('POST',url,fields=params,retries=5) # retries重试次数：默认为3
post_json_EN = json.loads(post.data.decode('utf-8'))
post_json_CH = json.loads(post.data.decode('unicode_escape')) # 将响应数据转换为字典类型
print("获取name对应的数据",post_json_EN.get('form').get('name'))
# 获取name对应的数据 xiaoli

2.2.2 Processing the binary data (picture) returned by the server

import urllib3
urllib3.disable_warnings()
url = 'https://img-blog.csdnimg.cn/2020060123063865.png'
http = urllib3.PoolManager()
get = http.request('GET',url) # 创建open对象
print(get.data)
f = open('./p.png','wb+')
f.write(get.data) # 写入数据
f.close()

2.2.3 Set the request header

import urllib3
urllib3.disable_warnings()
url = 'https://www.baidu.com/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
http = urllib3.PoolManager()
get = http.request('GET',url,headers=headers)
print(get.data.decode('utf-8'))

2.2.4 Set Timeout

import urllib3    # 导入urllib3模块
urllib3.disable_warnings()               # 关闭ssl警告
baidu_url = 'https://www.baidu.com/'    # 百度超时请求测试地址
python_url = 'https://www.python.org/'  # Python超时请求测试地址
http = urllib3.PoolManager()                   # 创建连接池管理对象
try:
    r = http.request('GET',baidu_url,timeout=0.01)# 发送GET请求，并设置超时时间为0.01秒
except  Exception as error:
    print('百度超时：',error)
# 百度超时： HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002690D2057F0>, 'Connection to www.baidu.com timed out. (connect timeout=0.01)'))

http2 = urllib3.PoolManager(timeout=0.1)  # 创建连接池管理对象,并设置超时时间为0.1秒
try:
    r = http2.request('GET', python_url)  # 发送GET请求
except  Exception as error:
    print('Python超时：',error)
# Python超时： HTTPSConnectionPool(host='www.python.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002690D21A910>, 'Connection to www.python.org timed out. (connect timeout=0.1)'))

2.2.5 Set IP proxy

import urllib3    # 导入urllib3模块
url = "http://httpbin.org/ip"            # 代理IP请求测试地址
# 定义火狐浏览器请求头信息
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
# 创建代理管理对象
proxy = urllib3.ProxyManager('http://120.27.110.143:80',headers = headers)
r = proxy.request('get',url,timeout=2.0)  # 发送请求
print(r.data.decode())                    # 打印返回结果

2.3 Upload

2.3.1 Upload text

import urllib3
import json
with open('./test.txt') as f :# 打开文本文件
    data = f.read() # 读取文件
url = "http://httpbin.org/post"
http = urllib3.PoolManager()
post = http.request('POST',url,fields={'filedield':('upload.txt',data)})
files = json.loads(post.data.decode('utf-8'))['files']  # 获取上传文件内容
print(files)                                         # 打印上传文本信息
# {'filedield': '在学习中寻找快乐！'}

2.3.2 Upload image files

import urllib3
with open('p.png','rb') as f :
    data = f.read()
url = "http://httpbin.org/post"
http = urllib3.PoolManager()
# 发送上传图片文件请求
post = http.request('POST',url,body = data,headers={'Content-Type':'image/jpeg'})
print(post.data.decode())

Pytrch crawler combat study notes_4 network request urllib3 module: send GET/POST request instance + upload file + IP proxy + json + binary + timeout