Urllib library

Let's first understand the Urllib library, which is a built-in HTTP request library in Python, which means that we can use it without additional installation. It contains four modules:

The first module, request, is the most basic HTTP request module. We can use it to simulate sending a request, just like entering a URL in a browser and hitting enter. You only need to pass in the URL to the library method. With additional parameters, you can simulate this process.
The second error module is the exception handling module. If a request error occurs, we can catch these exceptions, and then retry or other operations to ensure that the program will not terminate unexpectedly.
The third parse module is a tool module that provides many URL processing methods, such as splitting, parsing, merging, and so on.
The fourth module is robotparser, which is mainly used to identify the robots.txt file of the website, and then to determine which websites can be crawled and which websites cannot be crawled. It is less practical.
Here we focus on explaining the first three modules:

1. Request template

The request module of urllib can easily grab the URL content, that is, send a GET request to the specified page, and then return the HTTP response:

For example, grab a URL of Douban, https://api.douban.com/v2/book/2129650, and return a response

(1) GET according to the specified page: return data

# data是入参
# timeout是超时时间
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False, context=None)

from urllib import request

with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))

2) To simulate a browser to send a GET request, we need to use the Request object. By adding HTTP headers to the Request object, we can disguise the request as a browser.

herders = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3',
    'Referer': baseUrl,
    'Connection': 'keep-alive'}
req = request.Request(baseUrl, headers=herders)
r = request.urlopen(req)
print(r.read().decode('utf-8'))

(3) Post, if you want to send a request by POST, you only need to pass in the parameter data in bytes

from urllib import request, parse

baseUrl = 'http://movie.douban.com/top250?start=0'
# data = bytes(parse.urlencode({"": ""}), encoding="utf-8")
login_data = parse.urlencode([
    ('username', "email"),
    ('password', "passwd"),
], encoding="utf-8")
req = request.Request(baseUrl)
response = request.urlopen(req, data = login_data )
print(response.data().decode("utf-8"))

2. The error module

（1）error.URLError：urlopen error
（2）error.HTTPError：HTTP Error
（3）error.ContentTooShortError：Exception raised when downloaded size does not match content-length
(4) error._ all _: all three types of errors are captured

try:
    response = request.urlopen(req, data=data,timeout=0.01)
except error.URLError as e:
    print(e)

requests。

A better solution is to use requests. It is a Python third-party library, which is particularly convenient for processing URL resources.

If Anaconda is installed, requests are already available. Otherwise, you need to install via pip from the command line:

$ pip install requests

If you encounter Permission denied installation failure, please add sudo and try again.

Use requests

(1) GET to visit a page:

>>> import requests
>>> r = requests.get('https://www.douban.com/') # 豆瓣首页
>>> r.status_code
200
>>> r.text
r.text
'<!DOCTYPE HTML>\n<html>\n<head>\n<meta name="description" content="提供图书、电影、音乐唱片的推荐、评论和...'

For URL with parameters, pass in a dict as the params parameter:

>>> r = requests.get('https://www.douban.com/search', params={
    
    'q': 'python', 'cat': '1001'})
>>> r.url # 实际请求的URL
'https://www.douban.com/search?q=python&cat=1001'
requests自动检测编码，可以使用encoding属性查看：
>>> r.encoding
'utf-8'
无论响应是文本还是二进制内容，我们都可以用content属性获得bytes对象：
>>> r.content
b'<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n...'
requests的方便之处还在于，对于特定类型的响应，例如JSON，可以直接获取：
>>> r = requests.get('https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format=json')
>>> r.json()
{
    
    'query': {
    
    'count': 1, 'created': '2017-11-17T07:14:12Z', ...

When we need to pass in HTTP Header, we pass in a dict as the headers parameter:

>>> r = requests.get('https://www.douban.com/', headers={
    
    'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'})
>>> r.text
'<!DOCTYPE html>\n<html>\n<head>\n<meta charset="UTF-8">\n <title>豆瓣(手机版)</title>...'

To send a POST request, just change the get() method to post(), and then pass in the data parameter as the data of the POST request:

>>> r = requests.post('https://accounts.douban.com/login', data={
    
    'form_email': '[email protected]', 'form_password': '123456'})
requests默认使用application/x-www-form-urlencoded对POST数据编码。如果要传递JSON数据，可以直接传入json参数：
params = {
    
    'key': 'value'}
r = requests.post(url, json=params) # 内部自动序列化为JSON

Similarly, uploading files requires a more complex encoding format, but requests reduces it to the files parameter:

>>> upload_files = {
    
    'file': open('report.xls', 'rb')}
>>> r = requests.post(url, files=upload_files)

When reading a file, be sure to use'rb' or binary mode to read, so that the length of bytes obtained is the length of the file.

By replacing the post() method with put(), delete(), etc., you can request resources in PUT or DELETE mode.

In addition to easily getting the response content, requests are also very simple to get other information about the HTTP response.

For example, to get the response header:

>>> r.headers
{
    
    Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', ...}
>>> r.headers['Content-Type']
'text/html; charset=utf-8'

Requests has made special treatment on Cookies, so that we can easily get the specified Cookies without parsing the Cookies:

>>> r.cookies['ts']
'example_cookie_12345'

To pass in cookies in the request, you only need to prepare a dict and pass in the cookies parameters:

>>> cs = {
    
    'token': '12345', 'status': 'working'}
>>> r = requests.get(url, cookies=cs)

Finally, to specify the timeout, pass in the timeout parameter in seconds:

>>> r = requests.get(url, timeout=2.5) # 2.5秒后超时

Python enhanced knowledge to obtain network resources Urllib (1)