Detailed urllib library of Python

urllib Python standard library is a library for web requests.

The library has four modules, namely:

urllib.request

urllib.error

urllib.parse

urllib.robotparser

1 initiates a request

Analog browser initiates an HTTP request, we need to use urllib.request module. urllib.request role not only initiated the request, but also acquisition request returns the result. Initiate a request alone  urlopen() method can be all-powerful. We look at urlopen () of the API:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None,context=None)
  • dataBytes is the type of content, by bytes into byte stream function (). It is also optional. Use the data parameters, POST request method turn out to submit forms. Using a standard formatapplication/x-www-form-urlencoded
  • timeoutParameter is used to set the request time. In seconds.
  • cafileAnd capathpath represents the CA certificate and CA certificate. If you use HTTPSyou need to use.
  • contextParameter must be ssl.SSLContextthe type used to specify SSLsettings
  • The method can also pass a separate urllib.request.Requestobjects
  • This function returns an http.client.HTTPResponseobject.
使用 urllib.request.urlopen() 去请求百度贴吧,并获取到它页面的源代码。
Import the urllib.request 

URL = " http://tieba.baidu.com " 
Response = the urllib.request.urlopen (URL) 
HTML = response.read ()          # acquired page source 
Print (html.decode ( ' UTF- 8 ' ))     # converted to utf-8 encoded

1.2 Set request timeout

Some requests may not get a response because the network reasons. Therefore, we can set the timeout manually. When the request times out, we can take further steps, such as selecting a direct request or discard the request once again.

import urllib.request

url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url, timeout=1)
print(response.read().decode('utf-8'))

1.3 submit data using the parameter data

When requesting certain pages need to carry some data, we need to use data parameters.

import urllib.parse
import urllib.request

url = "http://www.baidu.com/"
params = {
  'name':'TTT',
  'author':'Miracle'
}

data = bytes(urllib.parse.urlencode(params), encoding='utf8')
response = urllib.request.urlopen(url, data=data)
print(response.read().decode('utf-8'))

params needs to be transcoded into a byte stream. And params is a dictionary. We need to use urllib.parse.urlencode () the dictionary into a string. Reuse bytes () into a stream of bytes. Finally, using the urlopen () initiates a request, the request is simulated submitting the form data using the POST method.

Note: When the url address contains Chinese or "/", which is what you need to do urlencode encoding conversion. urlencode parameter is the dictionary, which can be key-value such as key-value pairs into a format that we want

1.4 Request

We know by the use urlopen () method can initiate a simple request. But a few simple parameters is not enough to build a complete request, if the request needs to add headers (request header), specifies the information request methods, we can use more powerful Request class to construct a request.
According to international practice, look at the Request constructor:

urllib.request.Request(url, data=None, headers={}, origin_req_host=None,
unverifiable=False, method=None)
  • data 参数With urlopen (same parameter data usage) in.
  • headers 参数It is designated initiated HTTP request headers. headers is a dictionary. It was added in addition to the Request, also by calling the add_header example Reques t () method to add request header.
  • origin_req_host 参数It refers to the host name or IP address of the requesting party.
  • unverifiable Indicate whether the request is unverifiable, default False. Meaning that the user does not have sufficient rights to the received selection result of this request. For example, we request an HTML document in the picture, but we do not have permission to automatically grab images, we will set the value of unverifiable into True.
  • method 参数Refers to a way to initiate HTTP requests, there are GET, POST, DELETE, PUT, etc.
1.4.1 Simple to use Request

Use Request disguised browser sends an HTTP request. If you do not set the User-Agent headers in the default User-AgentShi Python-urllib/3.5. Some sites may intercept the request, it is necessary to disguise the browser to initiate a request. User-Agent I use the Chrome browser.

#修改User-Agent为chrome的UA进行伪装
import urllib.request

url = "http://tieba.baidu.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

1.4.2 Request Advanced Usage

If we need to add a proxy in the request, the request processing Cookies, we need to use Handlerand OpenerDirector.

1)Handler
Handler is the Chinese handlers, processors. Handler can handle various things request (HTTP, HTTPS, FTP, etc.). It is a concrete realization of this classurllib.request.BaseHandler. It is the base class for all Handler, which provides the most basic method Handler, e.g. default_open (), protocol_request () and the like.
There are a number of inherited BaseHandler, I'll list some of the more common types:

  • ProxyHandler: Set the proxy request
  • HTTPCookieProcessor: Processing the request HTTP Cookies
  • HTTPDefaultErrorHandler: HTTP response processing error.
  • HTTPRedirectHandler: Handling HTTP redirects.
  • HTTPPasswordMgr: Manage passwords, which maintains a list of user name and password.
  • HTTPBasicAuthHandler: For login authentication, and are generally HTTPPasswordMgrused in combination.
2)OpenerDirector
For OpenerDirector, we can call Opener. We've used urlopen () method, in fact it is a urllib Opener provides us. That Opener Handler and what is the relationship? The opener object is build_opener (handler) to create out method. We need to create a custom opener, you need to use install_opener(opener) the method . It is noteworthy that, install_opener instantiation will get a global OpenerDirector object.

1.5 Use a proxy

We already know the opener and handler, then we take in-depth learning by example. The first example is a proxy for HTTP requests to set up
some sites do the browsing frequency limit. If we ask the site too frequently. The site will be closed IP, disable our visit. So we need to use a proxy to break this "yoke."

import urllib.request

url = "http://tieba.baidu.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 AppleWebKit/537.36 Chrome/56.0.2924.87 Safari/537.36'
}

proxy_handler = urllib.request.ProxyHandler({
    'http': 'web-proxy.oa.com:8080',
    'https': 'web-proxy.oa.com:8080'
})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)

request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))        
1.6 Certification Log

After some sites need to bring the account number and password to log in to continue browsing the web. Faced with this site, we need to use authentication login. We first need to use HTTPPasswordMgrWithDefaultRealm () to instantiate an account password management object; then use add_password () function to add the account number and password; then use HTTPBasicAuthHandler () to get hander; reuse build_opener () Gets opener object; and finally the use of open opener of () function initiates a request.

The second example is carrying blog account and password to log requests garden, as follows:

import urllib.request
url = "http://cnblogs.com/xtznb/"
user = '奇迹'
password = 'password'
pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
pwdmgr.add_password(None,url,user,password)

auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr)
opener = urllib.request.build_opener(auth_handler)
response = opener.open(url)
print(response.read().decode('utf-8'))
1.7 Cookies settings

If the requested page requires authentication each time, we may use Cookies to automatically log in, eliminating the need for repeated login authentication operation. Cookies need to get http.cookiejar.CookieJar () to instantiate a Cookies objects. Then urllib.request.HTTPCookieProcessor constructed handler object. Finally, using the opener of the open () function can be.

A third example is the acquisition request Cookies Baidu Post bar and saved to a file, as follows:

import http.cookiejar
import urllib.request

url = "http://tieba.baidu.com/"
fileName = 'cookie.txt'

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

f = open(fileName,'a')
for item in cookie:
    f.write(item.name+" = "+item.value+'\n')
f.close()

1.8 HTTPResponse

It is seen from the above examples, using the urllib.request.urlopen () or opener.open (url) The result is a http.client.HTTPResponse object. It has msg, version, status, reason, debuglevel, closed attributes and read (), readinto (), getheader (name), getheaders (), fileno () function and the like.

 

2 misinterpretation

Initiate a request there will inevitably be a variety of abnormalities, we need to deal with exceptions, this will make the program more user-friendly.
Exception handling is mainly used in two classes, urllib.error.URLErrorand urllib.error.HTTPError.

  • URLError
    URLError urllib.error exception class is the base class can catch exceptions generated by urllib.request.
    It has a property reasonthat returns the wrong reasons.

Capture URL abnormal sample code:

import urllib.request
import urllib.error

url = "http://www.google.com"
try:
    response = request.urlopen(url)
except error.URLError as e:
    print(e.reason)
  • HTTPError HTTPError 是 UEKRrror 的子类,专门处理 HTTP 和 HTTPS 请求的错误。它具有三个属性。

  1)code:HTTP 请求返回的状态码。

  2)reason:与父类用法一样,表示返回错误的原因。

  3)headers`: HTTP header request response information returned.

HTTP exception code samples, the output error status code, error causes, in response to the first server

import urllib.request
import urllib.error

url = "http://www.google.com"
try:
    response = request.urlopen(url)
except error.HTTPError as e:
   print('code: ' + e.code + '\n')
   print('reason: ' + e.reason + '\n')
   print('headers: ' + e.headers + '\n')

 

 

 

 

 

 

 

 

 

 

 

 

 



Guess you like

Origin www.cnblogs.com/xtznb/p/10960396.html