2.1 Detailed explanation of urllib library

1. Basic module

The urllib library is Python's built-in HTTP request library. No additional installation is required and can be used directly. The urllib library contains the following four modules.

request: The most basic HTTP request module, which can simulate the sending of requests. Just pass in the URL and corresponding parameters to the library method, and you can simulate the browser to send a request.
error: Exception handling module. If request exceptions occur, we can capture these exceptions and then retry or perform other operations to ensure that the program does not terminate unexpectedly.
parse: A tool module. Provides many URL processing methods, such as splitting, parsing, merging, etc.
robotparser: It is mainly used to identify the robot.txt file of the website and determine which websites can be crawled and which websites cannot. It is generally used less often.

2. Send a request (request)

holidays

request provides the most basic method of constructing an HTTP request. We can use urlopen to simulate the process of a browser initiating a request.

from urllib import request

response = request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

The running results are as follows
Insert image description here

We can use the type method to check the type of response

from urllib import request

response = request.urlopen('https://www.python.org')
print(type(response))

The output is:

It can be seen that the response is an object of type HTTPResponse.

HTTPResponse mainly includes methods such as read, readinto, getheader, getheaders, and fileno, as well as attributes such as msg, version, status, reason, debuglevel, and closed. Here are some examples.

from urllib import request

response = request.urlopen('https://www.python.org')
print(response.status)#status属性，输出响应状态码
print(response.getheaders())#调用getheaders方法，输出响应头信息

输出结果为：
200
[(‘Connection’, ‘close’), (‘Content-Length’, ‘49948’), (‘Server’, ‘nginx’), (‘Content-Type’, ‘text/html; charset=utf-8’), (‘X-Frame-Options’, ‘SAMEORIGIN’), (‘Via’, ‘1.1 vegur, 1.1 varnish, 1.1 varnish’), (‘Accept-Ranges’, ‘bytes’), (‘Date’, ‘Tue, 27 Jun 2023 13:08:24 GMT’), (‘Age’, ‘3329’), (‘X-Served-By’, ‘cache-iad-kiad7000025-IAD, cache-nrt-rjtf7700041-NRT’), (‘X-Cache’, ‘HIT, HIT’), (‘X-Cache-Hits’, ‘57, 1722’), (‘X-Timer’, ‘S1687871304.488935,VS0,VE0’), (‘Vary’, ‘Cookie’), (‘Strict-Transport-Security’, ‘max-age=63072000; includeSubDomains; preload’)]

We can also pass other parameters to the urlopen method. The API of the urlopen method is as follows:

request.urlopen( url , data=None , [timeout]* , cafile=None , capath=None , cadefault=False , context=None)

data parameter

When using the data parameter, you need to use the bytes method to convert the parameter into content in the byte stream encoding format (that is, the bytes type). If this parameter is passed, its request method will be changed from GET to POST. Here are some examples.

from urllib import request, parse

data = bytes(parse.urlencode({
    
    'name': 'python'}), encoding='utf-8')
#使用bytes方法将参数转化为字节流编码格式的内容（即bytes类型）
#传入第一个参数name，值为python，用urlencode方法将字典参数转换为符合URL规范的查询字符串
#第二个参数encoding用于指定编码格式
response = request.urlopen('https://www.httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

Our request site this time is www.httpbin.org, which can provide HTTP request testing. Adding /post after it can be used to test post requests. The output information contains the data parameter we passed.

The running results are as follows:
Insert image description here

timeout parameter

The timeout parameter is used to set the timeout, in seconds. When the request exceeds the set time and no response is received, an exception will be thrown. If no parameters are specified, the global default time will be used. We can use timeout to skip crawling a web page when it does not respond for a long time.

from urllib import request

response = request.urlopen('https://www.httpbin.org', timeout=0.1)
#0.1秒几乎不可能得到服务器的响应
print(response.read())

Run error due to timeout:
Insert image description here

Other parameters

cafile and capath are used to specify the CA certificate and its path respectively. cadefault is now deprecated and its default value is False. The context parameter must be of type ssl.SSLContext, which is used to specify SSL settings.

Request

We can create a Request type object through Request, and then pass the object as a parameter to the urlopen method, so that the parameters of the urlopen method can be flexibly configured. The construction method of the Request class is as follows:

class request.Request(url , data = None , headers = { } , origin_req_host = None , unverifiable = False , method = None)

url
The URL used for the request is a required parameter, and the others are optional parameters.
data
needs to be converted to bytes type. If the data is a dictionary, the urlencode method in the urllib.parse module can be used to convert the dictionary parameters into a query string that conforms to the URL specification.
headers
Request header, which is a dictionary. When we construct the request header, we can either directly construct it through the headers parameter or add it by calling the add_headers method of the request instance. The most common way to add request headers is to disguise the browser by modifying User_Agent. The default User_Agent is Python-urllib. For example, if we want to disguise ourselves as Google Chrome, we can set User_Agent to:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0 .0.0 Safari/537.36
We can view the browser's User_Agent through the Headers option in the Network panel in the developer tools.
origin_req_host
The host name or IP address of the requester.
unverifiable
is used to indicate whether the user has sufficient permissions to receive the result of this request. The default value is False.
method
indicates the method used for the request, such as GET, POST, etc.

Next, the Request class is constructed with specific parameters.

from urllib import request, parse

url = 'https://www.httpbin.org/post'
data = bytes(parse.urlencode({
    
    'name': 'python'}), encoding='utf-8')
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 '
                  'Safari/537.36'
}
method = 'POST'
req = request.Request(url=url, data=data, headers=headers, method=method)
response = request.urlopen(req)
print(response.read().decode('utf-8'))

The running results are as follows:
Insert image description here

3. Exception handling (error)

When we send a request, an exception may occur. If not handled, the program may stop running due to an error. The error module in the urllib library defines exceptions generated by the request module. When a problem occurs, the request module will throw the exception defined in the error module.

URLError

Exceptions generated by the request module can be handled by catching this class. Its attribute reason can be used to return the reason for the error. Here are some examples:

from urllib import request, error

try:
    res = request.urlopen('https://xiaohui.com/403')
except error.URLError as e:
    print(e.reason)

The running results are as follows:
Not Found

HTTPError

HTTPError is a subclass of URLError, specially used to request HTTP error requests, such as authentication request failure, etc. It has the following three properties:

code: Returns the HTTP status code.
reason: Reason for reply.
headers: Return request headers.

Here are some examples:

from urllib import request, error

try:
    res = request.urlopen('https://helloword.com/404')
except error.HTTPError as e:
    print(e.code, e.reason, e.headers, sep='\n')

The running results are as follows:
Insert image description here

4. Parse the link (parse)

urlparse

This method can realize URL identification and segmentation. Examples are as follows:

from urllib.parse import urlparse

res = urlparse('https://editor.csdn.net/md?articleId=131423019')
print(type(res))
print(res)

The running results are as follows:
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc=' editor.csdn.net', path='/md', params='', query='articleId=131423019', fragment='')

It can be seen from the running results that the parsing result is an object of type ParseResult, including 6 parts, namely scheme, netloc, path, params, query, and fragment.

The returned ParseResult is actually a tuple, and its contents can be obtained using its property name or its index order. Examples are as follows:

from urllib.parse import urlparse

res = urlparse('https://editor.csdn.net/md?articleId=131423019')
print(res.scheme, res[0], res.netloc, res[-1], sep='\n')

The result is as follows:
https
https
editor.csdn.net

urlunparse

It can be understood as the opposite method of urlparse, used to construct URLs. The length of the parameters received by this method must be 6, otherwise an error will be reported. Examples are as follows:

from urllib.parse import urlunparse

data = ['https', 'www.baidu.com', 'index.html', 'user', 'id=8', 'comment ']
print(urlunparse(data))

The running results are as follows:
https://www.baidu.com/index.html;user?id=8#comment

urlsplit

This method is similar to urlparse, except that the urlsplit method no longer parses the params part separately, but merges this part into the path, so only 5 results will be returned. Examples are as follows:

from urllib.parse import urlsplit

res = urlsplit('https://www.baidu.com/index.html;user?id=8#comment')
print(type(res))
print(res)

The running results are as follows:
<class 'urllib.parse.SplitResult'>
SplitResult(scheme='https', netloc=' www.baidu.com', path='/index.html;user', query='id=8', fragment='comment')

urlunsplit

Similarly, this method is similar to urlunparse, the only difference is that the parameter length of this method is 5.

urljoin

urljoin is used to splice a base URL and a relative URL to generate a complete URL.
The specific usage is as follows:

Join absolute URLs: If the incoming relative URL is an absolute URL (for example, starting with "http" or "https"), the urljoin() function will directly return the relative URL. URL instead of concatenating it with the base URL. Examples include:

from urllib.parse import urljoin

base_url = 'https://baidu.com'
relative_url = 'https://www.python.org'
print(urljoin(base_url, relative_url))

Running results: https://www.python.org

Resolve relative paths: If the incoming relative URL is a relative path (not an absolute URL), the urljoin() function will splice it with the base URL to generate a Complete URL.

from urllib.parse import urljoin

base_url = 'https://baidu.com'
relative_url = '/about.html'
print(urljoin(base_url, relative_url))

Running results: https://baidu.com/about.html

Resolve path symbols: The urljoin() function will parse the path symbols (such as "..." and ".") in relative URLs and perform reasonable analysis based on the path of the base URL. splicing. This ensures that the full URL generated is correct.

from urllib.parse import urljoin

base_url = 'https://baidu.com/default/html/'
relative_url_1 = '../about.html'#返回上一级目录
relative_url_2 = './about.html'#当前目录
print(urljoin(base_url, relative_url_1))
print(urljoin(base_url, relative_url_2))

Running result:
https://baidu.com/default/about.html
https://baidu.com/default/html /about.html

Merge query parameters: If both the base URL and the relative URL contain query parameters, the urljoin() function will combine the query parameters in the relative URL with those in the base URL. Query parameter merging. If there are duplicate query parameter names, the query parameters in the relative URL will override the query parameters in the base URL.

from urllib.parse import urljoin

base_url_1 = 'https://baidu.com/default/html/'
relative_url_1 = '?id=8'
base_url_2='https://baidu.com/default/html/?catagorg=2'
relative_url_2 = '?id=6'
print(urljoin(base_url_1, relative_url_1))
print(urljoin(base_url_2, relative_url_2))

Running result:
https://baidu.com/default/html/?id=8
https://baidu.com/ default/html/?id=6

These features make the urljoin() function very convenient and reliable when dealing with URL splicing. It handles a variety of different types of URLs correctly, ensuring that the complete URL generated is as expected.

urlencode

The urlencode method can convert the params sequence into parameters of the GET request. Examples are as follows:

from urllib.parse import urlencode

params = {
    
    
    'name': 'xiaohui',
    'age': 19
}
base_url = 'https://www.vonphy.love?'
url = base_url + urlencode(params)
print(url)

Running results: https://www.vonphy.love?name=xiaohui&age=19

parse_qs

The parse_qs method is the opposite of the urlencode method and is used to convert GET request parameters into a dictionary. Examples are as follows:

from urllib.parse import parse_qs

query = 'name=xiaohui&age=19'
print(parse_qs(query))

Running results: {‘name’: [‘xiaohui’], ‘age’: [‘19’]}

parse_qsl

parse_qsl is used to convert parameters into a list of tuples. The example is as follows:

from urllib.parse import parse_qsl

query = 'name=xiaohui&age=19'
print(parse_qsl(query))

Running results: [(‘name’, ‘xiaohui’), (‘age’, ‘19’)]

quote

This method converts the content into URL-encoded format. When the URL contains Chinese parameters, it may cause garbled characters. To avoid this problem, the quote method can be used to convert Chinese characters into URL encoding. Examples are as follows:

from urllib.parse import quote

keyword = '你好'
url = 'https://www.baidu.com' + quote(keyword)
print(url)

Running results: https://www.baidu.com%E4%BD%A0%E5%A5%BD

unquote

In contrast to the quote method, this method is used for URL decoding. Examples are as follows:

from urllib.parse import unquote

url = 'https://www.baidu.com?wd=%E4%BD%A0%E5%A5%BD'
print(unquote(url))

Running results: https://www.baidu.com?wd=Hello