Python urllib库使用教程（urllib.request、urllib.parse、urllib.error和urllib.robotparser）（URL解析、URL编码）

文章目录

Python urllib库使用教程

Python urllib库使用教程

在Python中，urllib库是一个用于处理URL的模块。其主要包含四个子模块：urllib.request、urllib.parse、urllib.error和urllib.robotparser。下面将深入讨论这些子模块，并提供实例代码以增加理解。

1. urllib.request

1.1 基础使用

urllib.request模块定义了一些打开URL的函数和类，最常见的使用方式如下：

import urllib.request

response = urllib.request.urlopen('http://example.com')
html = response.read()

在这个例子中，首先导入urllib.request模块，然后使用urlopen函数打开一个URL，该函数返回一个响应对象，可以调用此对象的read方法来获取网页的HTML内容。

1.2 异常处理

当遇到HTTP错误时，例如404页面不存在或500内部服务器错误，urlopen会抛出HTTPError异常。所以，在使用urlopen时，最好进行异常处理。

import urllib.request
from urllib.error import HTTPError

url = 'http://example.com'
try:
    response = urllib.request.urlopen(url)
except HTTPError as e:
    print(f'Error: {
      
      e.code} while fetching {
      
      url}')

在这个例子中，当尝试访问一个不存在的页面时，程序将打印出错误代码而不是终止。

2. urllib.parse

2.1 URL解析

urllib.parse模块提供了一些函数，可以将URL拆分为六个组件：scheme、netloc、path、params、query和fragment。例如：

from urllib.parse import urlparse

result = urlparse('http://example.com:80/path;param?query=arg#frag')
print(result)

运行结果：

ParseResult(scheme='http', netloc='example.com:80', path='/path', params='param', query='query=arg', fragment='frag')

在这里插入图片描述

在这个例子中，urlparse函数将URL字符串拆分为一个元组，其中包含六个组件。

2.2 URL编码

何为URL，为什么需要做URL编码？

参考文章：URL编码（百分比编码）的必要性：传递特殊字符与非ASCII字符。不做URL编码有何后果？

当需要在URL中传递特殊字符或非ASCII字符时，就需要使用URL编码

例如：

（注意本.py文件要在满足utf-8编码条件下编写，如果在unicode条件下编写，运行会报错，用beyond compare可查看文件打开时的推理编码格式，如果不对，就改过来）

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from urllib.parse import urlencode

encoding = 'utf-8'


def main():
    params = {
    
    'name': '张三', 'age': 20}
    query_string = urlencode(params)
    print(query_string)


if __name__ == "__main__":
    main()

运行结果：

name=%E5%BC%A0%E4%B8%89&age=20

在这里插入图片描述

在这个例子中，urlencode函数将字典转换为URL编码的查询字符串。

3. urllib.error

urllib.error模块定义了由urllib.request模块引发的异常。当发生网络问题，如连接失败或找不到服务器时，urllib.request会抛出URLError异常。例如：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import urllib.request

encoding = 'utf-8'


def main():
    from urllib.error import URLError

    url = 'http://nonexistent.example.com'
    try:
        response = urllib.request.urlopen(url)
    except URLError as e:
        print(f'Error: {
      
      e.reason} while fetching {
      
      url}')


if __name__ == "__main__":
    main()

运行结果：

Error: [Errno -2] Name or service not known while fetching http://nonexistent.example.com

在这里插入图片描述

在这个例子中，尝试访问一个不存在的网站时，程序将打印出错误原因而不是终止。

4. urllib.robotparser

urllib.robotparser模块提供了一个类RobotFileParser，用于解析robots.txt文件。这些文件通常由网站管理员创建，以告诉网络爬虫哪些页面可以抓取，哪些不可以。例如：

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://example.com/robots.txt')
rp.read()

can_fetch = rp.can_fetch('*', '/secret/page.html')
print(can_fetch)

在这个例子中，首先创建一个RobotFileParser对象，然后设置要读取的robots.txt文件的URL，并调用read方法来获取和解析文件。然后，使用can_fetch方法来检查网络爬虫是否可以抓取特定的页面。

以上就是关于Python urllib库的深入研究。这个库提供了处理URL的强大功能，包括发送HTTP请求、解析URL、处理异常和解析robots.txt文件。希望这篇文章能够帮助读者更好地理解和使用这个库。

Python urllib库使用教程（urllib.request、urllib.parse、urllib.error和urllib.robotparser）（URL解析、URL编码）

文章目录

Python urllib库使用教程

1. urllib.request

1.1 基础使用

1.2 异常处理

2. urllib.parse

2.1 URL解析

2.2 URL编码

何为URL，为什么需要做URL编码？

当需要在URL中传递特殊字符或非ASCII字符时，就需要使用URL编码

3. urllib.error

4. urllib.robotparser

猜你喜欢