Python Reptile (b) _urllib2 use

The so-called web crawling, URL is the address specified network resource is read from the network stream out to save locally. There are many libraries in Python can be used to crawl the web, we first learn urllib2.

urllib2 is Python2.x built-in module (do not need to download, you can use the import)
urllib2 official website Documentation: https://docs.python.org/2/library/urllib2.html
urllib2 Source

urllib2It was changed in the python3.xurllib.request

urlopen

Let's take a section of code:

# - * - Coding: UTF-. 8 - * - 
# 01.urllib2_urlopen.py 

# import urllib2 library 

Import urllib2 

# to the specified url transmission request, and the server returns the file-like object 
Response = urllib2.urlopen ( " HTTP: // www.baidu.com " ) 

# operation class file object file object supports, such as read () method reads file 
HTML = response.read () 

# print string 
Print (HTML)

Run python code is written, it will print the results:

python2 01.urllib2_urlopen.py 

In fact, if we marked Baidu home page, right click in the browser "View Source", you will find us just print out is exactly the same. In other words, the above four lines of code had to help us put Baidu home page of all the code climb down.
A basic url python code corresponding to the request is really quite simple.

Request

View the official document url is used as follows:

urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])
    Open the URL url, which can be either a string or a Request object.

In our first example, urlopen () parameter is a url address;
but if you need to perform more complex operations, such as adding http header, you must create a Request instance as urlopen () parameters; and need to access url address parameter is a Request instance.

# - * - Coding: UTF-. 8 - * - 
# 02.urllib2_request.py 

Import urllib2 

# URL as Request () method parameters, constructs and returns a Request object 
Request = urllib2.Request ( " http://www.baidu .com " ) 

# the Request urlopen object as a parameter () method, transmitted to the server and receives a response 
response = urllib2.urlopen (Request) 

HTML = response.read () 

Print (HTML)

Operating results is exactly the same:

New Request instance, in addition to outside must be url parameters, two additional parameters may be provided:

  1. data (default empty): is accompanied by the data (for example, to post the data) url submission, while HTTP request from "GET" mode to "POST" method.
  2. headers (default empty): is a dictionary containing the key HTTP headers to be sent in.
    These two parameters we will discuss below.

User-Agent

But such direct use urllib2 sends a request to a Web site, do a little bit abrupt, like, people have every door, you direct the identity of a passerby broke in obviously not very polite. And there are some sites do not like being programmed (non-human Access) access, there may deny your request for access.

But if we use a legal status to request people to the site, apparently people is welcome, so we should give this our code with an identity, is called the User-Agenthead.

  • Internet Explorer is the world recognized allowed identity, if we want our crawlers like a real user, and that we are the first step is the need to be disguised as a recognized browser. Different browsers send request at the time, have different User-Agent header. urllib2 default User-Agent header is: Python-urllib/x.y(X and y are Python major and minor version number, e.g. Python-urllib / 2.7)
# - * - Coding: UTF-. 8 - * - 
# 03.urllib2_useragent.py 

Import urllib2 

URL = " http://www.itcast.cn " 

# IEs of 9.0 User-Agent, comprising ua-header in 
ua_header = { " Agent-the User " : " Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0; " } 

# url along with headers, with construction request request, the request will be included with User-Agent IE9.0 browser 
request = urllib2.Request (URL, headers = ua_header) 

# sends this request to the server 
Response = urllib2.urlopen (request) 

HTML = response.read () 

Print (HTML)

Add more information Header

Adding a specific HTTP Request Header in the, to construct a full HTTP request.

By calling the Request.add_header()add / modify a specific header you can also call Request.get_header()to see the existing header.

  • Add a specific header
# - * - Coding: UTF-. 8 - * - 
# 04.urllib2_headers.py 

Import urllib2 

URL = " http://www.itcast.cn " 

# IEs-9.0 - Agent of the User 
header = { " the User-- Agent " : " Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0; " } 
Request   = urllib2.Request (url, headers = header) 

# can also be added by calling Request.add_header () / modify a specific header 
Request. add_header ( " Connection " , " the Keep-Alive " ) 

#You can also view header information by calling Request.get_header () 
request.get_header (header_name = " Connection " ) 

Response = urllib2.urlopen (Request)
 Print (response.code)    # can view the response status code 

HTML = response.read ()
 Print (HTML) 

    random add / modify User- - Agent 

# - * - Coding: UTF-. 8 - * - 
# 05.urllib2_add_headers.py 

Import urllib2
 Import random 

URL = " http://www.itcast.cn " 

ua_list = [
     "Mozilla / 5.0 (Windows NT 6.1;) the Apple .... " ,
     " Mozilla / 5.0 (X11; CrOS i686 2268.111.0) ... " ,
     " ... the X-PPC Mac OS; Mozilla / 5.0 (Macintosh; U . " ,
     " Mozilla / 5.0 (Macintosh; Intel Mac OS ... " 
] 

user_agent = random.choice (ua_list) 

Request = urllib2.Request (url) 

# can () added by calling Request.add_header / modify a specific header 
request.add_header ( " the User-- Agent " , user_agent) 

# first letter capitalized, all lowercase behind 
request.add_header ( " the User-Agent ")

response = urllib2.urlopen(req)

html = response.read()

print(html)

note

The urllib2 module has been split across several modules in Python 3 named urllib.request and urllib.error

Guess you like

Origin www.cnblogs.com/moying-wq/p/11569768.html