Pytrch crawler combat study notes_3 Network request urllib module: set IP proxy + handle request exception + parse URL + decode + encode + combine URL + URL connection

1 Set up IP proxy

1.1 Methodology

It is relatively simple to use the urllib module to set the proxy IP. First, you need to create a ProxyHandler object, whose parameter is the proxy IP of the dictionary type, the key name is the protocol type (such as HTTP or HTTPS), and the value is the proxy link. Then use the ProxyHandler object and the buildopener() method to build a new opener object, and finally send a network request.

1.2 Code Implementation

import urllib.request  # 导入urllib.request模块
url= 'https://www.httpbin.org/get'   # 网络请求地址

# 创建代理IP
proxy_handler = urllib.request.ProxyHandler({
    'https':'58.220.95.114:10053'
})

# 创建opener对象
opener = urllib.request.build_opener(proxy_handler)
response = opener.open(url,timeout=2)        # 发送网络请求
print(response.read().decode('utf-8'))    # 打印返回内容

2 Handling request exceptions

When implementing network requests, there may be many abnormal errors. The urllib.error module in the ulib module contains two important exception classes: URLError and HTTPError.

2.1 Handling URLError exceptions

The reason attribute in the URLError class can feedback the reason for the exception

2.1.1 Send a request to a non-existent address

import urllib.request    # 导入urllib.request模块
import urllib.error      # 导入urllib.error模块
try:
    # 向不存在的网络地址发送请求
    response = urllib.request.urlopen('http://www.52pojie.cn/4040.html')
except urllib.error.URLError as error:    # 捕获异常信息
    print(error.reason)                    # 打印异常原因
    print(error.code)                   # 打印HTTP状态码
    print(error.header)                 # 返回请求头

2.1.2 Double exception catch

import urllib.request    # 导入urllib.request模块
import urllib.error      # 导入urllib.error模块
try:
    # 向不存在的网络地址发送请求
    response = urllib.request.urlopen('https://www.python12.org/',timeout=0.1)
except urllib.error.HTTPError as error:    # HTTPError捕获异常信息
    print('状态码为:',error.code)                      # 打印状态码
    print('HTTPError异常信息为:',error.reason)         # 打印异常原因
    print('请求头信息如下:\n',error.headers)           # 打印请求头
except urllib.error.URLError as error:     # URLError捕获异常信息
    print('URLError异常信息为:',error.reason)
    # 输出 URLError异常信息为: [Errno 11001] getaddrinfo failed

3 Parse the URL

The parse submodule is provided in the urllib module, which is mainly used to parse URLs, and can realize URL splitting or combination. It supports URL handling for multiple protocols.

3.1 urlparse ()

3.1.1 Introduction to urlparse()

urlparse() is used to break the URL into different parts

3.1.2 urlparse() function composition

 urllib.parse.urlparse (urlstring,scheme ='',allow_fragments=True)
  • urlstring: The URL to be split, this parameter is a required parameter.
  • scheme: optional parameter, indicating the default protocol to be set. If there is no protocol in the URL to be split, you can set a default protocol through this parameter. The default value of this parameter is an empty string.
  • allow_fragments: optional parameter. If this parameter is set to False, it means that the content of fragments is ignored. The default is Tnue.

3.1.3 Using urlparse() to decompose URLs

import urllib.parse    #导入urllib.parse模块
parse_result = urllib.parse.urlparse('https://www.baidu.com/doc/library/urllib.parse.html')
print(type(parse_result))    # 打印类型
print(parse_result)          # 打印拆分后的结果
### 也可以拆分打印
print(parse_result.scheme)          # 打印拆分后的结果
print(parse_result.netloc)          # 打印拆分后的结果
print(parse_result.path)          # 打印拆分后的结果
print(parse_result.params)          # 打印拆分后的结果
print(parse_result.query)          # 打印拆分后的结果
print(parse_result.fragment)          # 打印拆分后的结果

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/doc/library/urllib.parse.html', params='', query='', fragment='')
https
www.baidu.com
/doc/library/urllib.parse.html

3.2 urlsplit ()

3.2.1 Introduction to urlsplit()

Using the urlsplit() method is similar to the urlparse() method, which can achieve URL splitting, but the urlsplit() method no longer splits the params part of the content separately, but merges the params into the path, so there are only 5 in the returned result. Part of the content, and the data type is SplitResult. The sample code is as follows,

3.2.2 urlsplit() code implementation

import urllib.parse    #导入urllib.parse模块
# 需要拆分的URL
url = 'https://www.baidu.com/doc/library/urllib.parse.html'
print(urllib.parse.urlsplit(url))     # 使用urlsplit()方法拆分URL
# 输出: SplitResult(scheme='https', netloc='www.baidu.com', path='/doc/library/urllib.parse.html', query='', fragment='')
print(urllib.parse.urlparse(url))     # 使用urlparse()方法拆分URL
# 输出:ParseResult(scheme='https', netloc='www.baidu.com', path='/doc/library/urllib.parse.html', params='', query='', fragment='')

4 Combining URLs

4.1 urlunparse()

When using the urlunpare() method to combine URLs, it should be noted that the elements in the iterable parameter must be 6

import urllib.parse    #导入urllib.parse模块
list_url = ['https','baidu.org','/3/library/urllib.parse.html','','','']
tuple_url = ('https','baidu.org','/3/library/urllib.parse.html','','','')
dict_url = {'scheme':'https','netloc':'docs.baidu.org','path':'/baidu/library/urllib.parse.html','params':'','query':'','fragment':''}
print('组合列表类型的URL:',urllib.parse.urlunparse(list_url))
print('组合元组类型的URL:',urllib.parse.urlunparse(tuple_url))
print('组合字典类型的URL:',urllib.parse.urlunparse(dict_url.values()))

output:

URL of combined list type: https://baidu.org/3/library/urllib.parse.html URL of
combined tuple type: https://baidu.org/3/library/urllib.parse.html
combined dictionary type URL: https://docs.baidu.org/baidu/library/urllib.parse.html

4.2 urlunsplit()

import urllib.parse    #导入urllib.parse模块
list_url = ['https','docs.python.org','/3/library/urllib.parse.html','','']
tuple_url = ('https','docs.python.org','/3/library/urllib.parse.html','','')
dict_url = {'scheme':'https','netloc':'docs.python.org','path':'/3/library/urllib.parse.html','query':'','fragment':''}
print('组合列表类型的URL:',urllib.parse.urlunsplit(list_url))
print('组合元组类型的URL:',urllib.parse.urlunsplit(tuple_url))
print('组合字典类型的URL:',urllib.parse.urlunsplit(dict_url.values()))

output:

Combined list type URL: https://docs.python.org/3/library/urllib.parse.html
Combined tuple type URL: https://docs.python.org/3/library/urllib.parse. html
combined dictionary type URL: https://docs.python.org/3/library/urllib.parse.html
 

5 URL connection urllib.parse.urljoin()

5.1 Function Prototypes

urllib.parse.urljoin(base,url,allow_fragments = True)
  • base: indicates the base link
  • url: indicates a new link
  • allow_fragments: optional parameters, the default is True

5.2 Example of using urllib.parse.urljoin()

import urllib.parse    #导入urllib.parse模块
base_url = 'https://tet.baidu.org'   # 定义基础链接
# 第二参数不完整时,合并返回
print(urllib.parse.urljoin(base_url,'3/library/urllib.parse.html'))
# 第二参数完成时,不合并直接返回第二参数的链接
print(urllib.parse.urljoin(base_url,'https://docs.tet.baidu.org/3/library/urllib.parse.html#url-parsing'))

output:

https://tet.baidu.org/3/library/urllib.parse.html


https://docs.tet.baidu.org/3/library/urllib.parse.html#url-parsing
 

6 URL decoding and encoding

The function of the quote() method is similar to that of the urlencode() method, but the urlencode() method only accepts a dictionary type parameter, and the quote() method can encode a string.

6.1 Encoding request parameters using the urlcode() method

import urllib.parse    #导入urllib.parse模块
base_url = 'http://baidu.org/get?'    # 定义基础链接
params = {'name':'Jack','country':'中国','age':30}  # 定义字典类型的请求参数
url = base_url+urllib.parse.urlencode(params)       # 连接请求地址
print('编码后的请求地址为:',url)

The encoded request address is: http://baidu.org/get?name=Jack&country=%E4%B8%AD%E5%9B%BD&age=30

6.2 Using the quote() method to encode string arguments

import urllib.parse    #导入urllib.parse模块
base_url = 'http://baidu.org/get?country='    # 定义基础链接
url = base_url+urllib.parse.quote('中国')        # 字符串编码
print('编码后的请求地址为:',url)

The encoded request address is: http://baidu.org/get?country=%E4%B8%AD%E5%9B%BD

7 Decoding

7.1 Code 1

import urllib.parse    #导入urllib.parse模块
u = urllib.parse.urlencode({'country':'中国'})  # 使用urlencode编码
q=urllib.parse.quote('country=中国')              # 使用quote编码
print('urlencode编码后结果为:',u)
print('quote编码后结果为:',q)
print('对urlencode解码:',urllib.parse.unquote(u))
print('对quote解码:',urllib.parse.unquote(q))

7.2 Code 2 

import urllib.parse    #导入urllib.parse模块
# 定义一个请求地址
url = 'http://httpbin.org/get?name=Jack&country=%E4%B8%AD%E5%9B%BD&age=30'
q = urllib.parse.urlsplit(url).query   # 获取需要的参数
q_dict = urllib.parse.parse_qs(q)      # 将参数转换为字典类型的数据
print('数据类型为:',type(q_dict))
print('转换后的数据:',q_dict)

7.3 Code 3

import urllib.parse    #导入urllib.parse模块
str_params = 'name=Jack&country=%E4%B8%AD%E5%9B%BD&age=30'  # 字符串参数
list_params = urllib.parse.parse_qsl(str_params)   # 将字符串参数转为元组所组成的列表
print('数据类型为:',type(list_params))
print('转换后的数据:',list_params)

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/123408876
Recommended