2, Urllib library - a network request
- Urllib project list is a collection of more modules used in the URL of the package:
- Urllib.request item list to open and read URL
- It contains a list of items urllib.error urllib.request exceptions thrown
- Urllib.parse list of items used to parse URL
- Urllib.robotparser list of items used to parse the robots.txt file
Initiate a request
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
#参数描述
第一个参数 String 类型的地址或者一个Request 对象
data 是 bytes 类型的内容,可通过 bytes()函数转为化字节流。它也是可选参数。使用 data 参数,请求方式变成以 POST 方式提交表单。使用标准格式是application/x-www-form-urlencoded
timeout 参数是用于设置请求超时时间。单位是秒。
cafile和capath代表 CA 证书和 CA 证书的路径。如果使用HTTPS则需要用到。
context参数必须是ssl.SSLContext类型,用来指定SSL设置
cadefault参数已经被弃用,可以不用管了。
该方法也可以单独传入urllib.request.Request对象
该函数返回结果是一个http.client.HTTPResponse对象
Simple crawl the web
import urllib.request
url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url)
html = response.read() # 获取到页面的源代码
print(html.decode('utf-8')) # 转化为 utf-8 编码
Set request timeout
Due to network reasons can not be requested in response, set the timeout, discards the request or select another request again.
import urllib.request
url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url, timeout=1)
print(response.read().decode('utf-8'))
Crawl timeout
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print('TIME OUT')
Using data submitted parameters
After that use the POST method to submit data
import urllib.parse
import urllib.request
url = "http://127.0.0.1:8000/book"
params = {
'name':'浮生六记',
'author':'沈复'
}
data = bytes(urllib.parse.urlencode(params), encoding='utf8')
response = urllib.request.urlopen(url, data=data)
print(response.read().decode('utf-8'))
params needs to be transcoded into a byte stream. And params is a dictionary. We need to use urllib.parse.urlencode () the dictionary into a string. Reuse bytes () into a stream of bytes. Finally, using the urlopen () initiates a request, the request is simulated submitting the form data using the POST method
Use Request
the urlopen () method provides a simple request initiator may be added .Request request header (headers), specify the request
constructor of Request:
urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
#参数说明
url 参数是请求链接,这个是必传参数,其他的都是可选参数。
data 参数跟 urlopen() 中的 data 参数用法相同。
headers 参数是指定发起的 HTTP 请求的头部信息。headers 是一个字典。它除了在 Request 中添加,还可以通过调用 Reques t实例的 add_header() 方法来添加请求头。
origin_req_host 参数指的是请求方的 host 名称或者 IP 地址。
unverifiable 参数表示这个请求是否是无法验证的,默认值是False。意思就是说用户没有足够权限来选择接收这个请求的结果。例如我们请求一个HTML文档中的图片,但是我们没有自动抓取图像的权限,我们就要将 unverifiable 的值设置成 True。
method 参数指的是发起的 HTTP 请求的方式,有 GET、POST、DELETE、PUT等
Simple to use Request
Masquerade User_Agent
import urllib.request
url = "http://tieba.baidu.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
Request Advanced Usage
Add proxy in the request, Cookies processing a request, we need to use Handler and OpenerDirector
1) Handler
Handler in Chinese means handlers, processors. Handler can handle various things request (HTTP, HTTPS, FTP, etc.). It is a concrete realization of this class urllib.request.BaseHandler. It is the base class for all Handler, which provides the most basic method Handler, e.g. default_open (), protocol_request () and the like.
There are a number of inherited BaseHandler, I'll list some of the more common types:
ProxyHandler: set the proxy request
HTTPCookieProcessor: Cookies HTTP request processing
HTTPDefaultErrorHandler: error handling HTTP response.
HTTPRedirectHandler: handling HTTP redirects.
HTTPPasswordMgr: manage passwords, which maintains a list of user name and password.
HTTPBasicAuthHandler: for login authentication, and are generally used in combination HTTPPasswordMgr
2) OpenerDirector
For OpenerDirector, we can call Opener. We've used urlopen () This method, in fact it is for us to provide a urllib Opener. That Opener Handler and what is the relationship? The opener object is to create out of build_opener (handler) method. We need to create a custom opener, you need to use install_opener (opener) method. It is noteworthy that, install_opener instantiation will get a global target of OpenerDirector
Use a proxy
import urllib.request
url = "http://tieba.baidu.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
proxy_handler = urllib.request.ProxyHandler({
'http': 'web-proxy.oa.com:8080',
'https': 'web-proxy.oa.com:8080'
})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
Certification Log
import urllib.request
url = "http://tieba.baidu.com/"
user = 'user'
password = 'password'
pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
pwdmgr.add_password(None,url ,user ,password)
auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr)
opener = urllib.request.build_opener(auth_handler)
response = opener.open(url)
print(response.read().decode('utf-8'))
Set cookies
If the requested page requires authentication each time, we may use Cookies to automatically log in, eliminating the need for repeated login authentication operation. Cookies need to get http.cookiejar.CookieJar () to instantiate a Cookies objects. Then urllib.request.HTTPCookieProcessor constructed handler object. Finally, using the opener of the open () function can be
import http.cookiejar
import urllib.request
url = "http://tieba.baidu.com/"
fileName = 'cookie.txt'
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
f = open(fileName,'a')
for item in cookie:
f.write(item.name+" = "+item.value+'\n')
f.close()
HTTPResponse
Using the urllib.request.urlopen () or opener.open (url) The result is a http.client.HTTPResponse object. It has msg, version, status, reason, debuglevel, closed attributes and read (), readinto (), getheader (name), getheaders (), fileno () function and the like
Error parsing
Exception processing is mainly used in two classes, urllib.error.URLError and urllib.error.HTTPError
. 1, UrlError
UrlError exception class is the base class urllib.error can catch exceptions generated by urllib.request. The reason it has an attribute reason, that is, return an error
import urllib.request
import urllib.error
url = "http://www.google.com"
try:
response = request.urlopen(url)
except error.URLError as e:
print(e.reason)
2, the HTTPError
the HTTPError UEKRrror is a subclass of specialized error handling HTTP and HTTPS requests. It has three attributes.
1) code: HTTP request returns a status code.
2) renson: Like Father usage class, which returns the wrong reasons.
3) headers`: HTTP header request response information returned
import urllib.request
import urllib.error
url = "http://www.google.com"
try:
response = request.urlopen(url)
except error.HTTPError as e:
print('code: ' + e.code + '\n')
print('reason: ' + e.reason + '\n')
print('headers: ' + e.headers + '\n')
response
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(type(response))
#结果为:<class 'http.client.httpresponse'="">
By response.status, response.getheaders (). Response.getheader ( "server"), and the header information to obtain the status code response.read () is obtained in response to the content body