Python Reptile 1.1 - urllib tutorial Basic usage

Overview
urllib Library Introduction
urllib.request Introduction

urlopen ()
urlretrleve ()

urllib.parse Introduction

urlencode()
parse_s()
urlparam ()
urlsplit ()
urljoin ()

urllib.error Introduction

Overview

At the same time in this series of documents for Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4

urllib Library Introduction

It is a Python library built-in HTTP requests, which means we do not need extra installation needed to use, it contains four modules (primarily the first three modules of study):

request: it is the most basic HTTP request module, we can use it to simulate sending a request, just enter the URL in the browser, then hit Enter, like, just give incoming URL library methods as well as additional parameters, it can simulate the process to achieve this.
error: exception handling module, if the request is an error, we can catch them, and then retry or other actions to ensure the program does not terminate unexpectedly.
parse: for parsing URL, URL processing provides a number of methods, such as resolution, resolution, combined methods, etc., and waits for splicing parameters.
robotparser: mainly used to identify the site's robots.txt protocol file, and then determine what data the site can climb, what data can not climb, in fact, use less.

urllib.request Introduction

urlopen ()

Parameter Description (write only four of the most commonly used parameters)
- url: URL climb to take the target;
- data: request parameter, if this parameter is set, the request is by default a post request; default is no get request;
- timeout: setting a timeout in seconds;
- context: a ssl.SSLContext type must be used to specify SSL settings, ignore non-certified CA certificate;

Specific usage

GET request method

    # 导入urllib库
    import urllib.request
    
    # 向指定的url发送请求，并返回服务器响应的类文件对象
    url = "http://www.baidu.com"
    response = urllib.request.urlopen(url=url)
    print(type(response))
    
    # 类文件对象支持文件对象的操作方法，如read()方法读取文件全部内容，返回字符串
    html = response.read()
    # html = response.readline() # 读取一行
    # html = response.readlines() # 读取多行，返回列表
    # 打印响应结果（byte类型）
    print(html)
    # 打印响应结果（utf-8类型）
    # 二进制和字符串之间的相互转码使用 encode() 和 decode() 函数
    # encode() 和 decode() 可带参数，不写默认utf-8，其他不再特别说明
    print(html.decode())
    # 打印状态码
    # print(response.get_code())
    print(response.status)
    # 获取响应头
    print(response.getheaders())
    # 获取响应头Server信息
    print(response.getheader('Server'))
    # 获取响应结果原因
    print(response.reason)

POST request method

    # 导入urllib库
    import urllib.parse
    import urllib.request
    
    # 向指定的url发送请求，并返回
    post_url = 'https://fanyi.baidu.com/sug'
    # 传入参数
    form_data = {
        'kw': 'honey'
    }
    # 格式化参数
    form_data = urllib.parse.urlencode(form_data).encode()
    
    response = urllib.request.urlopen(url=post_url, data=form_data)
    # 打印服务器响应的类文件对象
    print(type(response))
    
    # 类文件对象支持文件对象的操作方法，如read()方法读取文件全部内容，返回字符串
    html = response.read()
    # 打印响应结果（byte类型）
    print(html)
    # 打印响应结果（utf-8类型）
    print(html.decode())
    # 打印状态码
    print(response.status)
    # print(response.getcode())
    # 获取响应头
    print(response.getheaders())
    # 获取响应头Server信息
    print(response.getheader('Server'))
    # 获取响应结果原因
    print(response.reason)

urlretrleve ()

Parameter Description
- url: download link address;
- filename: Specifies the local path stored (if the parameter is not specified, urllib generates a temporary file to save the data);
- reporthook: is a callback function, when connected to the server, and the corresponding data block transfer will trigger the callback is complete, we can use this callback function to display the current download progress;
- data: the guide post data server means, the method returns to (filename, headers) a tuple of two elements, saved to a local filename indicates the path, header represents a response to the first server;

Specific usage

    # 引入所需要的库
    import os
    import urllib.request
    
    
    # 定义回调函数
    def call_back(a, b, c):
        """
        图片下载回调
        :param a: 已经下载的数据块
        :param b: 数据块的大小
        :param c: 远程文件的大小
        :return: 
        """
        per = 100.0 * a * b / c
        if per > 100:
            per = 100
        print('%.2f%%' % per)
    
    
    # 定义下下载的地址
    url = 'http://www.baidu.com'
    # 构造文件保存路径
    path = os.path.abspath('.')
    file_path = os.path.join(path, 'baidu.html')
    # 进行下载
    urllib.request.urlretrieve(url, file_path, call_back)

urllib.parse Introduction

urlencode()

Parameter Description
- query: url parameter can be a string, the dictionary may be;
- encoding: encoding;

Specific usage

    # 引入所需要的库
    import urllib.parse
    # 参数数据
    data = {
        'name': '张三',
        'age': 26
    }
    # 进行编码
    ret = urllib.parse.urlencode(data)
    print(ret)

parse_s()

Parameter Description
- qs: url parameter coded string;
- encoding: Character mode;

Specific usage

    # 引入所需要的库
    import urllib.parse
    # 参数数据
    data = {
        'name': '张三',
        'age': 26
    }
    # 进行编码
    ret1 = urllib.parse.urlencode(data)
    print(ret1)
    # 进行解码
    ret2 = urllib.parse.parse_qs(ret1)
    print(ret2)

urlparam ()

Parameter Description
- url: url address string;

Specific usage

    # 引入所需要的库
    import urllib.parse
    # 声明url
    url = "https://www.baidu.com/s?wd=urlparse&rsv_spt=1&rsv_iqid=0x921f00fe005646ef&issp=1&f=8"
    # 进行url解析
    ret = urllib.parse.urlparse(url)
    print(ret)
    print('scheme:', ret.scheme)  # 协议
    print('netloc:', ret.netloc)  # 域名服务器
    print('path:', ret.path)  # 相对路径
    print('params:', ret.params)  # 路径端参数
    print('fragment:', ret.fragment)  # 片段
    print('query:', ret.query)  # 查询
    
    # urlunparse() 与 urlparse() 对应相反函数
    # 使用urlparse的格式组合成一个url，可以直接将urlparse的返回传递组合
    ret1 = urllib.parse.urlunparse(ret)
    print(ret1)

urlsplit ()

Parameter Description
- url: url address string;

Specific usage

    # 引入所需要的库
    import urllib.parse
    # 声明url
    url = "https://www.baidu.com/s?wd=urlparse&rsv_spt=1&rsv_iqid=0x921f00fe005646ef&issp=1&f=8"
    # 进行url解析
    ret = urllib.parse.urlsplit(url)
    print(ret)
    print('scheme:', ret.scheme)  # 协议
    print('netloc:', ret.netloc)  # 域名服务器
    print('path:', ret.path)  # 相对路径
    print('fragment:', ret.fragment)  # 片段
    print('query:', ret.query)  # 查询
    
    # urlunsplit() 与 urlsplit() 对应相反函数
    # 使用urlsplit的格式组合成一个url，可以直接将urlsplit的返回传递组合
    ret1 = urllib.parse.urlunsplit(ret)
    print(ret1)
    
    # 此函数和urlparse函数的区别在于此函数没有params

urljoin ()

Parameter Description
- qs: url parameter coded string;
- encoding: Character mode;

Specific usage

    # 引入所需要的库
    import urllib.parse
    
    # 声明url
    url = "https://www.baidu.com/"
    # 参数数据
    data = {
        'name': '张三',
        'age': 26
    }
    # 格式化参数
    data = urllib.parse.urlencode(data)
    # 进行url拼接
    ret = urllib.parse.urljoin(url, data)
    print(ret)

urllib.error Introduction

We made the request when the inevitable errors in reptiles time, such as less access server or access was prohibited and so on, error and HTTPError URLError into two categories:

URLError
- No network
- Server link failure
- Not find the specified server
HTTPError
- Subclass is URLError
Both the differences and connections
- Error message URLError package is generally caused by the network, including an error url
- Error message HTTPError package generally server returns an error status code
- URLError is a subclass of OSERROR, HTTPError is a subclass of URLError
- [Note] when two simultaneously captured need to subclass on top, placed below the parent class

usage

    # 引入所需要的库
    import urllib.error
    import urllib.request
    
    # 一个访问异常的url
    url = 'https://www.mz.com/156427/100'
    # 捕获异常
    try:
        ret = urllib.request.urlopen(url)
        print(ret)
    except urllib.error.HTTPError as e:
        print(e.getcode())
    except urllib.error.URLError as e:
        print(e)