In-depth understanding of urllib, urllib2 and requests

Difference between urllib and urllib2

Both the urllib and urllib2 modules do operations related to requesting URLs, but they provide different functionality.
urllib2.urlopen accepts an instance of the Request class or a url, (whereas urllib.urlopen only accepts a url Chinese meaning is: urllib2.urlopen can accept a Request object or a url, (when accepting a Request object, and this can be used to Set a URL headers), urllib.urlopen only receives one url
urllib has urlencode, urllib2 does not, this is why urllib and urllib2 are often used together

     r = Request(url=‘http://www.mysite.com‘)
     r.add_header(‘User-Agent‘, ‘awesome fetcher‘)
     r.add_data(urllib.urlencode({‘foo‘: ‘bar‘})
     response = urllib2.urlopen(r)     #post method

urllib module

I. urlencode cannot directly process unicode objects, so if it is unicode, it needs to be encoded first, and unicode is transferred to utf8, for example:

  urllib.urlencode (u‘bl‘.encode(‘utf-8‘))

II. Examples

 import urllib #sohu mobile homepage
 url = ‘http://m.sohu.com/?v=3&_once_=000025_v2tov3&_smuid=       ICvXXapq5EfTpQTVq6Tpz‘
 resp = urllib.urlopen(url)
 page = resp.read()
 f = open(‘./urllib_index.html‘, ‘w‘)
 f.write(page)
 print dir (resp)

result:

[‘doc‘, ‘init‘, ‘iter‘, ‘module‘, ‘repr‘, ‘close‘, ‘code‘, ‘fileno‘, ‘fp‘, ‘getcode‘, ‘geturl‘, ‘headers‘, ‘info‘, ‘next‘, ‘read‘, ‘readline‘, ‘readlines‘, ‘url‘]

 print resp.getcode(), resp.geturl(), resp.info(), resp.headers, resp.url
 #resp.url is the same as resp.geturl()

III. Codec example urllib.quote and urllib.urlencode are both codes, but the usage is different

 48 s = urllib.quote('This is python') #encode 49 print 'quote:\t'+s # replace 50 with %20 for spaces s_un = urllib.unquote(s) #decode 51 print 'unquote:\t' +s_un 52 s_plus = urllib.quote_plus('This is python') #encode 53 print 'quote_plus:\t'+s_plus #replace spaces with + 54 s_unplus = urllib.unquote_plus(s_plus) #decode 55 print 's_unplus:\t '+s_unplus 56 s_dict = {'name': 'dkf', 'pass': '1234'} 57 s_encode = urllib.urlencode(s_dict) # Convert the encoding dictionary into url parameters 
 58     print ‘s_encode:\t‘+s_encode

result:

 quote: This%20is%20python
 unquote:   This is python
 quote_plus:    This+is+python
 s_unplus:  This is python
 s_encode:  name=dkf&pass=1234

IV. urlretrieve() urlretrieve is mostly suitable for simple download-only functions or displaying download progress, etc.

 75     url = ‘http://m.sohu.com/?v=3&_once_=000025_v2tov3&_              smuid=ICvXXapq5EfTpQTVq6Tpz‘
 76     urllib.urlretrieve(url, ‘./retrieve_index.html‘)

 #Directly download the content of the url link web page to retrieve_index.html, which is suitable for the simple download function.
 #urllib.urlretrieve(url, local_name, method)

urllib2

I. The functions and classes defined by the urllib2 module are used to obtain URLs (mainly HTTP). It provides some complex interfaces for processing: basic authentication, redirection, cookies, etc. II. Common methods and classes II.1 urllib2.urlopen(url[, data][, timeout]) #When passing a url, the usage is the same as urlopen in urllib II.1.1 It opens the URL, and the url parameter can be a string url Or a Request object. Optional parameter timeout, blocking operations in seconds, such as trying to connect (if not specified, the set global default timeout value will be used). Actually this only works for HTTP, HTTPS and FTP connections.

85     url = ‘http://m.sohu.com/?v=3&_once_=000025_v2tov3&_smuid=ICvXXapq5EfTpQTVq6Tpz‘86     resp = urllib2.urlopen(url)87     page = resp.read()

II.1.2 The urlopen method can also explicitly specify the desired url by creating a Request object. Call the urlopen function to return a response object to the requested url. This response is similar to a file object, so the response object can be manipulated with the .read() function

   url = ‘http://m.sohu.com/?v=3&_once_=000025_v2tov3&_smuid    =ICvXXapq5EfTpQTVq6Tpz‘
   req = urllib2.Request(url)
   resp = urllib2.urlopen(req)
   page = resp.read()

II.2 class urllib2.Request(url[, data][, headers][, originreqhost][, unverifiable])

II.2.1 The Request class is an abstract URL request. The description of the 5 parameters is as follows: II.2.1.1 URL - is a string containing a valid URL. II.2.1.2 data - is a string that specifies additional data to send to the server, or "None" if no data needs to be sent. Currently HTTP requests using data are unique. When the request contains the data parameter, the HTTP request is a POST, not a GET. Data should be cached in a standard application/x-www-form-urlencoded format. The urllib.urlencode() function returns a string in this format using a map or 2-tuple. In layman's terms, if you want to send data to a URL (usually these data represent some CGI scripts or other web applications). For example, when filling a form (form) online, the browser will POST the content of the form, and the data needs to be encoded in a standard format (encode), and then sent to the Request object as a data parameter. Encoding is done in the urlib module, not in urlib2. Here is an example:

import urllib
import urllib2
url = ‘http://www.someserver.com/cgi-bin/register.cgi‘values = {‘name‘ : ‘Michael Foord‘,       ‘location‘ : ‘Northampton‘,       ‘language‘ : ‘Python‘ }
data = urllib.urlencode(values)      
req = urllib2.Request(url, data)   #send post
response = urllib2.urlopen(req)
page = response.read()

II.2.1.3 headers - is a dictionary type. The header dictionary can be directly passed in as a parameter in the request, or can be added by calling the add_header() method with each key and value as a parameter. The User-Agent header, which is used to identify browsers, is often used for spoofing and disguise, because some HTTP services only allow certain requests from common browsers instead of scripts, or return different versions for different browsers. For example, the Mozilla Firefox browser is identified as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11". By default, urlib2 identifies itself as Python-urllib/xy (where xy is the major or minor version number of the python distribution, as in Python 2.6, the default user agent string for urllib2 is "Python-urllib/ 2.6. The difference between the following example and the above is that a header is added to the request, imitating the IE browser to submit the request.

import urllib
import urllib2
url = ‘http://www.someserver.com/cgi-bin/register.cgi‘user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘values = {‘name‘ : ‘Michael Foord‘,        ‘location‘ : ‘Northampton‘,        ‘language‘ : ‘Python‘ }
headers = { ‘User-Agent‘ : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

The standard composition of headers is (Content-Length, Content-Type and Host), which is only added when the Request object calls urlopen() (which is also the case in the above example) or OpenerDirector.open(). Examples of the two cases are as follows: Use the headers parameter to construct the Request object. In the above example, the header has been initialized when the Request object is generated, and the following example is that the Request object calls the add_header(key, val) method to attach the header (the method of the Request object will be introduced later) :

import urllib2
req = urllib2.Request(‘http://www.example.com/‘)
req.add_header(‘Referer‘, ‘http://www.python.org/‘)    
 #http is a stateless protocol. The last request from the client has nothing to do with the next request from the client to the server. Most of them omit this step.
r = urllib2.urlopen(req)

OpenerDirector automatically adds a User-Agent header to each Request, so the second method is as follows (urllib2.buildopener will return an OpenerDirector object, and we will talk about the urllib2.buildopener class below):

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [(‘User-agent‘, ‘Mozilla/5.0‘)]
opener.open(‘http://www.example.com/‘)

II.3 urllib2.installopener(opener) and urllib2.buildopener([handler, ...]) 
The two methods installopener and buildopener are usually used together, and sometimes buildopener is used alone to get the OpenerDirector object.
Installopener instantiation will get the OpenerDirector object used to assign the global variable opener. If you want to use this opener to call urlopen, you must instantiate the OpenerDirector; this way you can simply call OpenerDirector.open() instead of urlopen().
The instantiation of build_opener will also get the OpenerDirector object, where the parameter handlers can be instantiated by BaseHandler or its subclasses. Subclasses can be instantiated by the following: ProxyHandler (if used to detect proxy settings) scanning proxy will be used, very important, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.

import urllib2
req = urllib2.Request(‘http://www.python.org/‘)
opener=urllib2.build_opener()
urllib2.install_opener(opener)
f = opener.open(req)

Use urllib2.install_opener() as above to set the global opener of urllib2. This will be very convenient for later use, but it cannot do more fine-grained control, such as wanting to use two different Proxy settings in the program. A better practice is not to use install_opener to change the global settings, but to directly call the open method of the opener instead of the global urlopen method.

Speaking of the operation between Opener and Handler sounds a bit dizzy. Sorting out the idea will make it clear. When getting a URL, you can use an opener (a urllib2.OpenerDirector instance object, which can be instantiated by build_opener). Under normal circumstances, the program always uses the default opener through urlopen (that is, when you use the urlopen method, you are implicitly using the default opener object), but you can also create custom openers (openers created by operator handlers) instance). All the heavy lifting and hassle is left to these handlers. Each handler knows how to open the url with a specific protocol (http, ftp, etc.), or how to handle the HTTP redirect that occurs when the url is opened, or the included HTTP cookie. When creating openers, if you want to install special handlers to get the url (such as getting an opener that handles cookies, or an opener that doesn't handle redirects), first instantiate an OpenerDirector object, and then call .add_handler(some_handler_instance) multiple times to create an opener. Alternatively, you can use build_opener, which is a convenient function to create an opener object with only one function call. build_opener adds many handlers by default, it provides a quick way to add more stuff and disable default handlers.

install_opener can also be used to create an opener object as described above, but this object is the (global) default opener. This means that calling urlopen will use the opener you just created. That is to say, the above code can be equivalent to the following paragraph. This code ends up using the default opener. In general, we use build_opener to generate a custom opener, and there is no need to call install_opener unless it is for convenience.

import urllib2
req = urllib2.Request(‘http://www.python.org/‘)
opener=urllib2.build_opener() # Create an opener object
urllib2.install_opener(opener) #Define the global default opener
f = urllib2.urlopen(req) #urlopen uses the default opener, but install_opener
 #The opener has been set as the global default, here is the opener created above

III. Exception handling http://www.jb51.net/article/63711.htm When we call urllib2.urlopen, it will not always be so smooth, just like the browser will sometimes report an error when opening the url, so we need us There are exception handling. Speaking of exceptions, let's first understand several common methods of the returned response object:
geturl() - Returns the retrieved URL resource, this is the real url returned, usually used to identify whether to redirect
info() - Returns The original information of the page is like a field object, such as headers, which is in the format of a mimetools.Message instance (refer to the description of HTTP Headers).
getcode() — returns the HTTP status code of the response, run the following code to get code=200 When unable to handle a response, urlopen throws a URLError (for python APIs, built-in exceptions such as ValueError, TypeError, etc. are also thrown .)

  • HTTPError is a subclass of URLError raised by HTTP URLs in special cases. Let's talk about URLError and HTTPError in detail. URLError - handlers when there is a problem running (usually because there is no network connection i.e. no route to the specified server, or does not exist at the specified server)

  • HTTPError - HTTPError is a subclass of URLError. Every response from the server HTTP contains "status code". Sometimes the status code cannot handle the request. The default handler will handle these exception responses. For example, when urllib2 finds that the URL of the response is different from the URL you requested, that is, when a redirect occurs, it will automatically handle it. For requests that cannot be processed, urlopen will throw -- -- HTTPError. Typical errors include '404' (page not found), '403' (request forbidden), '401' (authentication required), etc. It contains 2 important properties reason and code.

  • The program handles redirection by default

    Summarize

If you just simply download or display the download progress, and do not process the downloaded content, such as downloading pictures, css, js files, etc., you can use urlilb.urlretrieve().
If it is a download request, you need to fill in the form, enter the account number, password, etc. , it is recommended to use urllib2.urlopen(urllib2.Request())
when encoding dictionary data, use urllib.urlencode()

requests

I. Requests uses urllib3, which inherits all the features of urllib2. Requests supports HTTP connection retention and connection pooling, supports the use of cookies to maintain sessions, supports file uploads, supports automatic determination of the encoding of the response content, and supports automatic encoding of internationalized URLs and POST data. II. Example:

     import requests
     ...

     resp = requests.get(‘http://www.mywebsite.com/user‘)
     userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
     resp = requests.post(‘http://www.mywebsite.com/user‘, params=userdata)
     resp = requests.put(‘http://www.mywebsite.com/user/put‘)
     resp = requests.delete(‘http://www.mywebsite.com/user/delete‘)
     resp.json() # If json data is returned
     resp.text #returns not text data
     resp.headers[‘content-type‘]  #返回text/html;charset=utf-8
     f = open(‘request_index.html‘, ‘w‘)
     f.write(page.encode(‘utf8‘))          
     #test found that the page captured by requests must be encoded #write, (the captured page is unicode), urllib and urllib2 can be directly written to,
     #Because the page captured by these two is str

III. Other Features

Internationalized Domain Names and URLs
Keep-Alive & Connection Pool
Persistent Cookie Sessions
Browser-like SSL encryption authentication 
Basic/Digest Authentication
Elegant key/value cookies
Automatic decompression
Unicode-encoded response body
Multipart file upload
Connection timed out
Support .netrc
Thread safety for Python 2.6 - 3.4

IV. requests is not a library that comes with python, you need to install easy_install or pip install separately

V. Requests defect: Direct use cannot be called asynchronously, and the speed is slow (from others). The official urllib can replace it.

VI. Personally do not recommend using the requests module

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325294343&siteId=291194637