Reptile basis || 1.3 Handler aid (verification, proxy, cookies)

Although we can construct a request, but for some of the more advanced operations (such as processing Cookies, proxy settings, etc.), how can we do it?

Next, we need more powerful tools Handler.

In short, we can understand it as a variety of processors, have to deal with login authentication, there are processing Cookies, there are processing proxy settings. Use them, we can do almost everything in the HTTP request.

First of all, tell us about urllib.request module in the BaseHandler class, which is the parent of all other Handler class that provides the most basic methods, such as default_open (), protocol_request () and so on.

Next, there are various BaseHandler Handler subclass inherits the class, for example as follows.

HITPDefaultErrorHandler : for handling HTTP response error, the error will be thrown HTTPError types of exceptions.
HTTPRedirectHandler : for redirection process.
HTTPCookieProcessor : for handling Cookies.
ProxyHandler : used to set the proxy default proxy is empty.
HTTPPasswordMgr : manage passwords, which maintains a list of user names and passwords.
HTTPBasicAuthHandler  : for management certification, authentication is required if a link is opened, you can use it to solve the authentication problem.

In addition, there are other Handler class, not list them here, and the details can refer to the official document: http // docs.python.org / 3 / library / urllib.request.html # urllib.request.BaseHandler.

Another important category is OpenerDirector, we can be called Opener.

Used before urlopen () This method, in fact it is for us to provide a urllib Opener.

So why introduce Opener do? Because of the need for more advanced features. Before you can use the Request and urlopen () is equivalent to the library for you a good package very common request methods, we can use them to complete a basic request, but not the same now, we need to implement more advanced features, you need a layer of depth configuration, examples using a lower layer to complete the operation, so here uses Opener.

Opener can open () method returns the type and the urlopen () is exactly the same. Well, it is, and the relationship between the use of the Handler Handler to build Opener.

1. Verify

Some sites will pop up prompt when you open the box, directly prompt you to enter a user name and password to view the page after authentication is successful (although this site has been very small)

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError 

username = 'username'
password ='password'
url =  'http: //localhost:5000/'
p = HTTPPasswordMgrWithDefaultRealm() 
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler) 
try: 
    result = opener.open(url) 
    html = result.read().decode('utf-8') 
    print(html) 
except URLError as e: 
    print(e.reason) 

Interpretation Code:

The result is no hose given] [
1. This object is instantiated HTTPBasicAuthHandler which the object parameters are HTTPPasswordMgrWithDefaultRealm, which uses add_password () added to it the user name and password, thus establishing a process validation Handler.
2. Next, using the Handler - the build_opener () method of constructing a Opener, Opener upon the transmission request is equivalent to the already successfully verified.
3. Next, open Opener's () method to open the link, you can complete the verification. After the source content page to get here is to verify the results.

2. Agent 

In doing reptiles time, you will inevitably have to use a proxy, if you want to add the agent, you can use Handler:

from urllib.error import URLError 
from urllib.request import ProxyHandler,build_opener 
proxy_handler = ProxyHandler({ 'http':'http://127.0.0.1:9743',
                               'https':'https://127.0.0.1:9743 '}) 
opener = build_opener(proxy_handler) 
try: 
    response = opener.open('https://www.baidu.com') 
    print(response.read() .decode('utf-8'))
except URLError as e: 
    print(e.reason) 

About proxy_handler where the two agents, not have their own agent pool of students can go to West plays proxy site to find a free agent, of course, a lot can not be used, only learning to use, then here is the simulated: they set up a proxy, it running on port 9743. This uses ProxyHandler, its argument is a dictionary key name is the protocol type (such as HTTP or HTTPS, etc.), the key link is the agent, you can add more agents. Then, with this and - the build_opener Handler () method to construct a Opener, after sending the request to.

3.Cookies

cookies are reptiles carry a very common means, let's direct Handler carry cookies, first of all, we have to declare a CookieJar object. Next, it is necessary to build a use HTTPCookieProcessor Handler, and finally - the build_opener constructed using Opener () method, performed open () function can be.

import http.cookiejar,urllib.request 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com') 
for item in cookie: 
    print(item.name+'='+item.value) 

Well, we have seen the output of cookies, now put the cookies to store text.

filename = 'cookies.txt' 
cookie = http.cookiejar.MozillaCookieJar(filename) 
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com') 
cookie.save(ignore_discard=True, ignore_expires=True) 

Then CookieJar need to be replaced MozillaCookieJar, it will be used when generating the file is CookieJar subclass Cookies can be used to handle events associated with the file, such as read and save Cookies, Cookies can be saved into a Mozilla browser type Cookies formatter.

In addition, LWPCookieJar Cookies can also be read and saved, but the save format and MozillaCookieJar not the same, it will be saved as libwww-perl (LWP) Cookies file format. Cookies will be saved as LWP format, it can be changed at the time of the statement:

filename = 'cookies.txt' 
cookie = http.cookiejar.LWPCookieJar(filename)  ## 这里的声明改变
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler) 
response = opener.open('http://www.baidu.com') 
cookie.save(ignore_discard=True, ignore_expires=True) 

Two kinds of cookies are quite different formats, and now we read cookies from a file, for example to look LWPCookieJar format:

cookie = http.cookiejar.LWPCookieJar() 
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True) 
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler) 
response= opener.open('http://www.baidu.com')
print(response.read().decode('utf-8')) 

It can be seen here call the load () method to read local files Cookies, access to the content of Cookies.

But only if we first generate the Cookies LWPCookieJar format and saved to a file, and then build Handler lOpener use the same method to complete the operation after reading Cookies. If the result of normal operation, Baidu will output the page's source code. Through the above method, we can achieve the set features a vast majority of requests. Two kinds of storage methods on different formats, the final effect is the same use.


In this paper, reference "python3 develop real web crawler" - CUI Qing only

Published 17 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/Watson_Ashin/article/details/104295153
Aid