The so-called web crawling, URL is the address specified network resource is read from the network stream out to save locally. There are many libraries in Python can be used to crawl the web, we first learn urllib2
.
urllib2 is Python2.x built-in module (do not need to download, you can use the import)
urllib2 official website Documentation: https://docs.python.org/2/library/urllib2.html
urllib2 Source
urllib2
It was changed in the python3.xurllib.request
urlopen
Let's take a section of code:
# - * - Coding: UTF-. 8 - * - # 01.urllib2_urlopen.py # import urllib2 library Import urllib2 # to the specified url transmission request, and the server returns the file-like object Response = urllib2.urlopen ( " HTTP: // www.baidu.com " ) # operation class file object file object supports, such as read () method reads file HTML = response.read () # print string Print (HTML)
Run python code is written, it will print the results:
python2 01.urllib2_urlopen.py
In fact, if we marked Baidu home page, right click in the browser "View Source", you will find us just print out is exactly the same. In other words, the above four lines of code had to help us put Baidu home page of all the code climb down.
A basic url python code corresponding to the request is really quite simple.
Request
View the official document url is used as follows:
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]) Open the URL url, which can be either a string or a Request object.
In our first example, urlopen () parameter is a url address;
but if you need to perform more complex operations, such as adding http header, you must create a Request instance as urlopen () parameters; and need to access url address parameter is a Request instance.
# - * - Coding: UTF-. 8 - * - # 02.urllib2_request.py Import urllib2 # URL as Request () method parameters, constructs and returns a Request object Request = urllib2.Request ( " http://www.baidu .com " ) # the Request urlopen object as a parameter () method, transmitted to the server and receives a response response = urllib2.urlopen (Request) HTML = response.read () Print (HTML)
Operating results is exactly the same:
New Request instance, in addition to outside must be url parameters, two additional parameters may be provided:
- data (default empty): is accompanied by the data (for example, to post the data) url submission, while HTTP request from "GET" mode to "POST" method.
- headers (default empty): is a dictionary containing the key HTTP headers to be sent in.
These two parameters we will discuss below.
User-Agent
But such direct use urllib2 sends a request to a Web site, do a little bit abrupt, like, people have every door, you direct the identity of a passerby broke in obviously not very polite. And there are some sites do not like being programmed (non-human Access) access, there may deny your request for access.
But if we use a legal status to request people to the site, apparently people is welcome, so we should give this our code with an identity, is called the User-Agent
head.
- Internet Explorer is the world recognized allowed identity, if we want our crawlers like a real user, and that we are the first step is the need to be disguised as a recognized browser. Different browsers send request at the time, have different User-Agent header. urllib2 default User-Agent header is:
Python-urllib/x.y
(X and y are Python major and minor version number, e.g. Python-urllib / 2.7)
# - * - Coding: UTF-. 8 - * - # 03.urllib2_useragent.py Import urllib2 URL = " http://www.itcast.cn " # IEs of 9.0 User-Agent, comprising ua-header in ua_header = { " Agent-the User " : " Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0; " } # url along with headers, with construction request request, the request will be included with User-Agent IE9.0 browser request = urllib2.Request (URL, headers = ua_header) # sends this request to the server Response = urllib2.urlopen (request) HTML = response.read () Print (HTML)
Add more information Header
Adding a specific HTTP Request Header in the, to construct a full HTTP request.
By calling the
Request.add_header()
add / modify a specific header you can also callRequest.get_header()
to see the existing header.
- Add a specific header
# - * - Coding: UTF-. 8 - * - # 04.urllib2_headers.py Import urllib2 URL = " http://www.itcast.cn " # IEs-9.0 - Agent of the User header = { " the User-- Agent " : " Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0; " } Request = urllib2.Request (url, headers = header) # can also be added by calling Request.add_header () / modify a specific header Request. add_header ( " Connection " , " the Keep-Alive " ) #You can also view header information by calling Request.get_header () request.get_header (header_name = " Connection " ) Response = urllib2.urlopen (Request) Print (response.code) # can view the response status code HTML = response.read () Print (HTML) random add / modify User- - Agent # - * - Coding: UTF-. 8 - * - # 05.urllib2_add_headers.py Import urllib2 Import random URL = " http://www.itcast.cn " ua_list = [ "Mozilla / 5.0 (Windows NT 6.1;) the Apple .... " , " Mozilla / 5.0 (X11; CrOS i686 2268.111.0) ... " , " ... the X-PPC Mac OS; Mozilla / 5.0 (Macintosh; U . " , " Mozilla / 5.0 (Macintosh; Intel Mac OS ... " ] user_agent = random.choice (ua_list) Request = urllib2.Request (url) # can () added by calling Request.add_header / modify a specific header request.add_header ( " the User-- Agent " , user_agent) # first letter capitalized, all lowercase behind request.add_header ( " the User-Agent ") response = urllib2.urlopen(req) html = response.read() print(html)
note
The urllib2 module has been split across several modules in Python 3 named urllib.request and urllib.error