Sesame HTTP: Advanced Usage of Urllib Library for Getting Started with Python Crawler

1. Set Headers

Some websites will not allow the program to access directly in the above way. If there is a problem with the identification, the site will not respond at all, so in order to fully simulate the work of the browser, we need to set some Headers properties.

First of all, open our browser, debug the browser F12, I use Chrome, open the network monitoring, the indication is as follows, such as Zhihu, after clicking login, we will find that the interface has changed after login, and a new interface appears. In essence, this page contains a lot of content. These content are not loaded at one time. In fact, many requests are executed. Generally, HTML files are requested first, and then JS, CSS, etc. are loaded. After many times After the request, the skeleton and muscles of the web page are complete, and the effect of the entire web page will come out.

To split these requests, we only look at the first request. You can see that there is a Request URL and headers. The following is the response. Then this header contains a lot of information, such as file encoding, compression method, requested agent, and so on.

Among them, the agent is the identity of the request. If the request identity is not written, the server may not respond, so you can set the agent in the headers, such as the following example, this example only shows how to set the headers, friends, take a look Just set the format.

import urllib  
import urllib2  

url = 'http://www.server.com/login'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
values = {'username' : 'cqc',  'password' : 'XXXX' }  
headers = { 'User-Agent' : user_agent }  
data = urllib.urlencode(values)  
request = urllib2.Request(url, data, headers)  
response = urllib2.urlopen(request)  
page = response.read()

In this way, we set up a header, which is passed in when the request is constructed. When the request is made, the header transmission is added. If the server recognizes the request from the browser, it will get a response.

In addition, we also have a way to deal with "anti-hotlinking" . For anti-hotlinking, the server will identify whether the referer in the headers is itself. If not, some servers will not respond, so we can also add the referer to the headers.

For example we can construct the following headers

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  ,
                        'Referer':'http://www.zhihu.com/articles' }

In the same way as the above method, the headers are passed into the Request parameter when transmitting the request, so that the anti-leech can be dealt with.

In addition, some attributes of headers, the following need special attention:

User-Agent : Some servers or Proxy will use this value to determine whether the request is sent by the browser
Content-Type : When using the REST interface, the server will check this value to determine how the content in the HTTP Body should be parsed.
application/xml : used when XML RPC such as RESTful/SOAP calls
application/json : used when JSON RPC calls
application/x-www-form-urlencoded : used when browser submits web forms when
using server-provided RESTful or SOAP When serving, a wrong Content-Type setting can cause a server denial of service

Others can review the browser's header content if necessary, and write the same data during construction.

2. Proxy settings

By default, urllib2 uses the environment variable http_proxy to set the HTTP Proxy. If there is a website, it will detect the number of visits of a certain IP in a certain period of time. If the number of visits is too many, it will prohibit your visit. So you can set up some proxy servers to help you do your work, and change the proxy every once in a while, and the website gentleman doesn't know who is making trouble, which is sour!

The following piece of code illustrates how the proxy is set up and used

import urllib2
enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
else:
    opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)

3. Timeout setting

The urlopen method has been mentioned in the previous section. The third parameter is the timeout setting. You can set how long to wait for the timeout. In order to solve the impact of slow response of some websites.

For example, in the following code, if the second parameter data is empty, then specify the timeout value and specify the formal parameter. If data has been passed in, it is not necessary to declare it.

import urllib2
response = urllib2.urlopen('http://www.baidu.com', timeout=10)


import urllib2
response = urllib2.urlopen('http://www.baidu.com',data, 10)

4. Using HTTP's PUT and DELETE methods

The http protocol has six request methods, get, head, put, delete, post, options. We sometimes need to use the PUT method or the DELETE method to request.

PUT: This method is relatively rare. HTML forms don't support this either. In essence, PUT and POST are very similar in that they both send data to the server, but there is an important difference between them. PUT usually specifies the storage location of resources, while POST does not. The data storage location of POST is determined by the server itself.
DELETE: Delete a resource. Basically, this is also rare, but there are still some places such as Amazon's S3 cloud service that uses this method to delete resources.

If you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Even so, we can still enable urllib2 to issue PUT or DELETE requests in the following ways, but the number of times used is indeed small, so I will mention it here.

import urllib2
request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

5.使用DebugLog

可以通过下面的方法把 Debug Log 打开，这样收发包的内容就会在屏幕上打印出来，方便调试，这个也不太常用，仅提一下

import urllib2
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')