2, Urllib library - a network request

Urllib project list is a collection of more modules used in the URL of the package:
Urllib.request item list to open and read URL
It contains a list of items urllib.error urllib.request exceptions thrown
Urllib.parse list of items used to parse URL
Urllib.robotparser list of items used to parse the robots.txt file

Initiate a request

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

#参数描述
第一个参数 String 类型的地址或者一个Request 对象
data 是 bytes 类型的内容，可通过 bytes()函数转为化字节流。它也是可选参数。使用 data 参数，请求方式变成以 POST 方式提交表单。使用标准格式是application/x-www-form-urlencoded
timeout 参数是用于设置请求超时时间。单位是秒。
cafile和capath代表 CA 证书和 CA 证书的路径。如果使用HTTPS则需要用到。
context参数必须是ssl.SSLContext类型，用来指定SSL设置
cadefault参数已经被弃用，可以不用管了。
该方法也可以单独传入urllib.request.Request对象
该函数返回结果是一个http.client.HTTPResponse对象

Simple crawl the web

import urllib.request

url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url)
html = response.read()         # 获取到页面的源代码
print(html.decode('utf-8'))    # 转化为 utf-8 编码

Set request timeout

Due to network reasons can not be requested in response, set the timeout, discards the request or select another request again.

import urllib.request

url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url, timeout=1)
print(response.read().decode('utf-8'))

Crawl timeout

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

Using data submitted parameters

After that use the POST method to submit data

import urllib.parse
import urllib.request

url = "http://127.0.0.1:8000/book"
params = {
  'name':'浮生六记',
  'author':'沈复'
}

data = bytes(urllib.parse.urlencode(params), encoding='utf8')
response = urllib.request.urlopen(url, data=data)
print(response.read().decode('utf-8'))

params needs to be transcoded into a byte stream. And params is a dictionary. We need to use urllib.parse.urlencode () the dictionary into a string. Reuse bytes () into a stream of bytes. Finally, using the urlopen () initiates a request, the request is simulated submitting the form data using the POST method

Use Request

the urlopen () method provides a simple request initiator may be added .Request request header (headers), specify the request
constructor of Request:

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

#参数说明
url 参数是请求链接，这个是必传参数，其他的都是可选参数。
data 参数跟 urlopen() 中的 data 参数用法相同。
headers 参数是指定发起的 HTTP 请求的头部信息。headers 是一个字典。它除了在 Request 中添加，还可以通过调用 Reques t实例的 add_header() 方法来添加请求头。
origin_req_host 参数指的是请求方的 host 名称或者 IP 地址。
unverifiable 参数表示这个请求是否是无法验证的，默认值是False。意思就是说用户没有足够权限来选择接收这个请求的结果。例如我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，我们就要将 unverifiable 的值设置成 True。
method 参数指的是发起的 HTTP 请求的方式，有 GET、POST、DELETE、PUT等

Simple to use Request

Masquerade User_Agent

import urllib.request

url = "http://tieba.baidu.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Request Advanced Usage

Add proxy in the request, Cookies processing a request, we need to use Handler and OpenerDirector

1） Handler

Handler in Chinese means handlers, processors. Handler can handle various things request (HTTP, HTTPS, FTP, etc.). It is a concrete realization of this class urllib.request.BaseHandler. It is the base class for all Handler, which provides the most basic method Handler, e.g. default_open (), protocol_request () and the like.
There are a number of inherited BaseHandler, I'll list some of the more common types:

ProxyHandler: set the proxy request
HTTPCookieProcessor: Cookies HTTP request processing
HTTPDefaultErrorHandler: error handling HTTP response.
HTTPRedirectHandler: handling HTTP redirects.
HTTPPasswordMgr: manage passwords, which maintains a list of user name and password.
HTTPBasicAuthHandler: for login authentication, and are generally used in combination HTTPPasswordMgr

2） OpenerDirector

For OpenerDirector, we can call Opener. We've used urlopen () This method, in fact it is for us to provide a urllib Opener. That Opener Handler and what is the relationship? The opener object is to create out of build_opener (handler) method. We need to create a custom opener, you need to use install_opener (opener) method. It is noteworthy that, install_opener instantiation will get a global target of OpenerDirector

Use a proxy

import urllib.request

url = "http://tieba.baidu.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

proxy_handler = urllib.request.ProxyHandler({
    'http': 'web-proxy.oa.com:8080',
    'https': 'web-proxy.oa.com:8080'
})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)

request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Certification Log

import urllib.request

url = "http://tieba.baidu.com/"
user = 'user'
password = 'password'
pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
pwdmgr.add_password(None，url ，user ，password)

auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr)
opener = urllib.request.build_opener(auth_handler)
response = opener.open(url)
print(response.read().decode('utf-8'))

Set cookies

If the requested page requires authentication each time, we may use Cookies to automatically log in, eliminating the need for repeated login authentication operation. Cookies need to get http.cookiejar.CookieJar () to instantiate a Cookies objects. Then urllib.request.HTTPCookieProcessor constructed handler object. Finally, using the opener of the open () function can be

import http.cookiejar
import urllib.request

url = "http://tieba.baidu.com/"
fileName = 'cookie.txt'

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

f = open(fileName,'a')
for item in cookie:
    f.write(item.name+" = "+item.value+'\n')
f.close()

HTTPResponse

Using the urllib.request.urlopen () or opener.open (url) The result is a http.client.HTTPResponse object. It has msg, version, status, reason, debuglevel, closed attributes and read (), readinto (), getheader (name), getheaders (), fileno () function and the like

Error parsing

Exception processing is mainly used in two classes, urllib.error.URLError and urllib.error.HTTPError
. 1, UrlError
UrlError exception class is the base class urllib.error can catch exceptions generated by urllib.request. The reason it has an attribute reason, that is, return an error

import urllib.request
import urllib.error

url = "http://www.google.com"
try:
    response = request.urlopen(url)
except error.URLError as e:
    print(e.reason)

2, the HTTPError
the HTTPError UEKRrror is a subclass of specialized error handling HTTP and HTTPS requests. It has three attributes.
1) code: HTTP request returns a status code.
2) renson: Like Father usage class, which returns the wrong reasons.
3) headers`: HTTP header request response information returned

import urllib.request
import urllib.error

url = "http://www.google.com"
try:
    response = request.urlopen(url)
except error.HTTPError as e:
   print('code: ' + e.code + '\n')
   print('reason: ' + e.reason + '\n')
   print('headers: ' + e.headers + '\n')

response

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))


#结果为：<class 'http.client.httpresponse'="">

By response.status, response.getheaders (). Response.getheader ( "server"), and the header information to obtain the status code response.read () is obtained in response to the content body

2, urllib Library - a network request