urllib Python standard library is a library for web requests.
The library has four modules, namely:
urllib.request
urllib.error
urllib.parse
urllib.robotparser
1 initiates a request
Analog browser initiates an HTTP request, we need to use urllib.request module. urllib.request role not only initiated the request, but also acquisition request returns the result. Initiate a request alone urlopen()
method can be all-powerful. We look at urlopen () of the API:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None,context=None)
data
Bytes is the type of content, by bytes into byte stream function (). It is also optional. Use the data parameters, POST request method turn out to submit forms. Using a standard formatapplication/x-www-form-urlencoded
timeout
Parameter is used to set the request time. In seconds.cafile
Andcapath
path represents the CA certificate and CA certificate. If you useHTTPS
you need to use.context
Parameter must bessl.SSLContext
the type used to specifySSL
settings- The method can also pass a separate
urllib.request.Request
objects - This function returns an
http.client.HTTPResponse
object.
使用 urllib.request.urlopen() 去请求百度贴吧,并获取到它页面的源代码。
Import the urllib.request URL = " http://tieba.baidu.com " Response = the urllib.request.urlopen (URL) HTML = response.read () # acquired page source Print (html.decode ( ' UTF- 8 ' )) # converted to utf-8 encoded
1.2 Set request timeout
Some requests may not get a response because the network reasons. Therefore, we can set the timeout manually. When the request times out, we can take further steps, such as selecting a direct request or discard the request once again.
import urllib.request url = "http://tieba.baidu.com" response = urllib.request.urlopen(url, timeout=1) print(response.read().decode('utf-8'))
1.3 submit data using the parameter data
When requesting certain pages need to carry some data, we need to use data parameters.
import urllib.parse import urllib.request url = "http://www.baidu.com/" params = { 'name':'TTT', 'author':'Miracle' } data = bytes(urllib.parse.urlencode(params), encoding='utf8') response = urllib.request.urlopen(url, data=data) print(response.read().decode('utf-8'))
params needs to be transcoded into a byte stream. And params is a dictionary. We need to use urllib.parse.urlencode () the dictionary into a string. Reuse bytes () into a stream of bytes. Finally, using the urlopen () initiates a request, the request is simulated submitting the form data using the POST method.
Note: When the url address contains Chinese or "/", which is what you need to do urlencode encoding conversion. urlencode parameter is the dictionary, which can be key-value such as key-value pairs into a format that we want
1.4 Request
We know by the use urlopen () method can initiate a simple request. But a few simple parameters is not enough to build a complete request, if the request needs to add headers (request header), specifies the information request methods, we can use more powerful Request class to construct a request.
According to international practice, look at the Request constructor:
urllib.request.Request(url, data=None, headers={}, origin_req_host=None,
unverifiable=False, method=None)
data 参数
With urlopen (same parameter data usage) in.headers 参数
It is designated initiated HTTP request headers. headers is a dictionary. It was added in addition to the Request, also by calling the add_header example Reques t () method to add request header.origin_req_host 参数
It refers to the host name or IP address of the requesting party.unverifiable
Indicate whether the request is unverifiable, default False. Meaning that the user does not have sufficient rights to the received selection result of this request. For example, we request an HTML document in the picture, but we do not have permission to automatically grab images, we will set the value of unverifiable into True.method 参数
Refers to a way to initiate HTTP requests, there are GET, POST, DELETE, PUT, etc.
1.4.1 Simple to use Request
Use Request disguised browser sends an HTTP request. If you do not set the User-Agent headers in the default User-Agent
Shi Python-urllib/3.5
. Some sites may intercept the request, it is necessary to disguise the browser to initiate a request. User-Agent I use the Chrome browser.
#修改User-Agent为chrome的UA进行伪装 import urllib.request url = "http://tieba.baidu.com/" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' } request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
1.4.2 Request Advanced Usage
If we need to add a proxy in the request, the request processing Cookies, we need to use Handler
and OpenerDirector
.
1)Handler
Handler is the Chinese handlers, processors. Handler can handle various things request (HTTP, HTTPS, FTP, etc.). It is a concrete realization of this classurllib.request.BaseHandler
. It is the base class for all Handler, which provides the most basic method Handler, e.g. default_open (), protocol_request () and the like.
There are a number of inherited BaseHandler, I'll list some of the more common types:
ProxyHandler
: Set the proxy requestHTTPCookieProcessor
: Processing the request HTTP CookiesHTTPDefaultErrorHandler
: HTTP response processing error.HTTPRedirectHandler
: Handling HTTP redirects.HTTPPasswordMgr
: Manage passwords, which maintains a list of user name and password.HTTPBasicAuthHandler
: For login authentication, and are generallyHTTPPasswordMgr
used in combination.
OpenerDirector
For OpenerDirector, we can call Opener. We've used urlopen () method, in fact it is a urllib Opener provides us. That Opener Handler and what is the relationship? The opener object is build_opener (handler) to create out method. We need to create a custom opener, you need to use
install_opener(opener)
the method
. It is noteworthy that, install_opener instantiation will get a global OpenerDirector object.
1.5 Use a proxy
We already know the opener and handler, then we take in-depth learning by example. The first example is a proxy for HTTP requests to set up
some sites do the browsing frequency limit. If we ask the site too frequently. The site will be closed IP, disable our visit. So we need to use a proxy to break this "yoke."
import urllib.request url = "http://tieba.baidu.com/" headers = { 'User-Agent': 'Mozilla/5.0 AppleWebKit/537.36 Chrome/56.0.2924.87 Safari/537.36' } proxy_handler = urllib.request.ProxyHandler({ 'http': 'web-proxy.oa.com:8080', 'https': 'web-proxy.oa.com:8080' }) opener = urllib.request.build_opener(proxy_handler) urllib.request.install_opener(opener) request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
1.6 Certification Log
After some sites need to bring the account number and password to log in to continue browsing the web. Faced with this site, we need to use authentication login. We first need to use HTTPPasswordMgrWithDefaultRealm () to instantiate an account password management object; then use add_password () function to add the account number and password; then use HTTPBasicAuthHandler () to get hander; reuse build_opener () Gets opener object; and finally the use of open opener of () function initiates a request.
The second example is carrying blog account and password to log requests garden, as follows:
import urllib.request url = "http://cnblogs.com/xtznb/" user = '奇迹' password = 'password' pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() pwdmgr.add_password(None,url,user,password) auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr) opener = urllib.request.build_opener(auth_handler) response = opener.open(url) print(response.read().decode('utf-8'))
1.7 Cookies settings
If the requested page requires authentication each time, we may use Cookies to automatically log in, eliminating the need for repeated login authentication operation. Cookies need to get http.cookiejar.CookieJar () to instantiate a Cookies objects. Then urllib.request.HTTPCookieProcessor constructed handler object. Finally, using the opener of the open () function can be.
A third example is the acquisition request Cookies Baidu Post bar and saved to a file, as follows:
import http.cookiejar import urllib.request url = "http://tieba.baidu.com/" fileName = 'cookie.txt' cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url) f = open(fileName,'a') for item in cookie: f.write(item.name+" = "+item.value+'\n') f.close()
1.8 HTTPResponse
It is seen from the above examples, using the urllib.request.urlopen () or opener.open (url) The result is a http.client.HTTPResponse object. It has msg, version, status, reason, debuglevel, closed attributes and read (), readinto (), getheader (name), getheaders (), fileno () function and the like.
2 misinterpretation
Initiate a request there will inevitably be a variety of abnormalities, we need to deal with exceptions, this will make the program more user-friendly.
Exception handling is mainly used in two classes, urllib.error.URLError
and urllib.error.HTTPError
.
-
URLError
URLError urllib.error exception class is the base class can catch exceptions generated by urllib.request.
It has a propertyreason
that returns the wrong reasons.
Capture URL abnormal sample code:
import urllib.request import urllib.error url = "http://www.google.com" try: response = request.urlopen(url) except error.URLError as e: print(e.reason)
HTTPError HTTPError 是 UEKRrror 的子类,专门处理 HTTP 和 HTTPS 请求的错误。它具有三个属性。
1)
code:HTTP 请求返回的状态码。
2)
reason:与父类用法一样,表示返回错误的原因。
3)
headers`: HTTP header request response information returned.
HTTP exception code samples, the output error status code, error causes, in response to the first server
import urllib.request import urllib.error url = "http://www.google.com" try: response = request.urlopen(url) except error.HTTPError as e: print('code: ' + e.code + '\n') print('reason: ' + e.reason + '\n') print('headers: ' + e.headers + '\n')