Reference to learn: https://www.cnblogs.com/alex3714/articles/8359358.html

A, urllib library introduction

urllib library, which is a Python library built-in HTTP request. It contains four modules:

request: HTTP request module, it can be used to simulate the transmission request.
error: exception handling module, if the request is an error, we can catch them, and then retry or other actions to ensure that the program is not terminated unexpectedly.
parse: url parsing module, a tool module, provides a number of URL processing methods, such as resolution, resolution, and the like combined.
robotparser: robots parsing module is primarily used to identify the site robots.txt file, and then determining which sites can climb, which site is not allowed to climb, with less than it is actually.

Two, request module

The main function request module: configuration HTTP request using a browser request that can simulate the initiation procedure,

request module as well: for authorization verification (authenticaton), redirection (redirection), Browser Cookies and other content.

1.urlopen method

 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen parameters introduced:

url URL used to request
data: GET request does not pass, POST request transmission
timeout set timeout seconds, meaning that if the request exceeds the time set, no response, an exception is thrown. If this parameter is not specified, it will use the global default time. It supports HTTP, HTTPS, FTP requests.
ssl.SSLContext context type must be used to specify the SSL settings.
cafile specify CA certificate
capath specify the path to the CA certificate, this will be useful when requesting HTTPS link.

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
print(data)
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

2.Request method

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Request parameters introduced :

url used to request URL, this is a must-parameters, other parameters are optional.
To transmit data, must pass bytes (byte stream) type. If it is a dictionary, you can use urllib.parse module in the urlencode () code.
headers is a dictionary, it is the request header, we can directly configuration when the headers parameter configuration request may be a request by invoking the instance add_header () method of adding. Adding usage is most common request header by modifying the User-Agent to disguise the browser
origin_req_host refers host name or IP address of the requesting party.
unverifiable indicates that this request is not validated, the default is False, meaning that the user does not have sufficient permissions to select a result of receiving this request. For example, we request an HTML document in the picture, but we do not have automatic permission to crawl image, then the value of unverifiable is True`.
method is a string, the method used for indicating a request, such as GET, POST, and PUT like.

There are many websites in order to prevent the program from the website crawlers crawl the site causing paralysis, will need to carry some headers to access the header information.

Headers to add two methods as follows:

method 1:

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'alex'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Method 2:

from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

- Handler Processor

urllib.request module in the BaseHandler class, which is the parent of all the other Handler class.

Common Handler Description:

HTTPDefaultErrorHandler: for handling HTTP response error, the error will be thrown HTTPError types of exceptions.
HTTPRedirectHandler: for redirection process.
HTTPCookieProcessor: for handling Cookies.
ProxyHandler: used to set the proxy default proxy is empty.
HTTPPasswordMgr: manage passwords, which maintains a list of user names and passwords.
HTTPBasicAuthHandler: for management certification, authentication is required if a link is opened, you can use it to solve the authentication problem.

3. Acting ProxyHandler

Proxyhandler parameter is a dictionary key name is the protocol type (such as HTTP or HTTPS, etc.), the key link is the agent, you can add more agents.

Then, with this and - the build_opener Handler () method to construct a Opener, after sending the request to.

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

3.cookies

cookie中保存中我们常见的登录信息，有时候爬取网站需要携带cookie信息访问,这里用到了http.cookijar，用于获取cookie以及存储cookie

# 从网页获取cookie，并逐行输出
import http.cookiejar, urllib.request
 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

# 从网页获取cookie，保存为文件格式
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)　　# cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

PS：MozillaCookieJar是CookieJar的子类，LWPCookieJar与MozillaCookieJar均可读取、保存cookie，但格式不同

调用load()方法来读取本地的Cookies文件，获取到了Cookies的内容。

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

三、error模块

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

四、parse模块

urlparse()
urlunparse()
urlsplit()
urlunsplit()
urljoin()
urlencode()
parse_qs()
parse_qsl()
quote()
unquote()

五、robotparser模块

Robots协议也称作爬虫协议、机器人协议，它的全名叫作网络爬虫排除标准（Robots Exclusion Protocol），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作robots.txt的文本文件,

一般放在网站的根目录下。www.taobao.com/robots.txt

robotparser模块提供了一个类RobotFileParser，它可以根据某网站的robots.txt文件来判断一个爬取爬虫是否有权限来爬取这个网页。

urllib.robotparser.RobotFileParser(url='')

# set_url()：用来设置robots.txt文件的链接。
# read()：读取robots.txt文件并进行分析。
# parse()：用来解析robots.txt文件。
# can_fetch()：该方法传入两个参数，第一个是User-agent，第二个是要抓取的URL。
# mtime()：返回的是上次抓取和分析robots.txt的时间。
# modified()：将当前时间设置为上次抓取和分析robots.txt的时间。

from urllib.robotparser import RobotFileParser
 
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

Data of the road - Python Reptile - urllib library