python crawler base (1)

  • Common parameter request header

In htp protocol, the server sends a request, the data is divided into islands points, the first is the data on ur, the second body is placed on the data (in the post request 
request), the third that is, the data on the head. Here are some of the parameters in the request header crawlers are often used are: 
1. the User-Agent : browser name, this will often be to use the network crawlers. A two-page request, the server via this parameter can know the request was sent by the browser what. If we are sending a request by reptiles, then our User-4gent is Python, which for sites that have anti-crawler mechanism to say, you can easily tell you this request are reptiles, so we should constantly trade the value of the force inscriptions the value of the browser, to disguise our reptile
2. Referer : shows that when this request is coming from which ur1, this general can also be used to make anti-crawler technology, if it is not coming from the specified page, then do not do the relevant response
3. cookies : HTP protocol is stateless. That is the same person twice to send the request, the server is not capable of knowing whether these two requests from the same person-so this time do it with a cookie identification, sites generally do if you want to log in to access, then it needs to send a coke the information
  • Common response status code

1. 200 is : normal request, the data server returns the normal

2. 301 : permanent re-asked. Than plus when accessing w.jangdong.com will be redirected to w.Jd.cms

3. 302 : Temporary Redirect such as accessing a record needs to be set when the page, but this time not logged in, you will be redirected to the login page to ask

4. 400 : w1 requested can not be found on the server. In other words request ur1 error

5. 403 : Server shrink care access, access is not enough

6. 500 : Internal Server Error. The server may appear a bug

 

  • urllib library ur1lib Python library is a basic network request library. Can simulate the behavior of the browser, it sends a request to the specified server, and can save the data returned by the server
  • urlopen function:

In Phon3 urllib the library, all network requests and related methods, are set to urllib floor, the module below Request to first look at the basic usage urlopen:

from urllib inport request
resp.reqvest.urlopen("http://www.baidu.com")
print(resp. reado)

In fact, using a browser to access Baidu, right view the source code, you will find us just print out the data is exactly the same. In other words, the above three lines of code has helped us to put all the code one hundred Baidu climb down. Corresponding to a basic request url python code urlopen really simple to explain in detail the following functions:

1.url : url request

2.Data : request dta, assuming that the net value, it will become a post request

3. Return Value : returns the value of an object, the object is a class object has a file handle read (sze), readline, readine and the like getcode

 

  • urlretrieve function
This number can easily be saved to a local file on a web page. The following code can easily be Baidu Fei Chang of one hundred downloaded to the local:
from urllsb import request
request.urlretrieve('http://www.baiducom/,'baidu.html')
  • urlencode function:

When sending a request with a decision browser, if the url contains Chinese or other special characters, the browser will be automatically encoded to us. If using the code transmission request, then it must be encoded manually, this time should be used to achieve urlencode, urlencode the dictionary data may be converted into encoded data uRL sample code is as follows:

from urllib inport parse

data = {“name”:“爬虫”', “greet”:"he1 lo word”, “age”:100}
qs. = parse.urlencode (data)
print(qs)

parse_qs function can be decoded after the encoding parameters u.

 

  • ProxyHandler processor (setting agent)

Visits many sites will detect a certain period of time iP (by Ling statistics, system logs, etc.), if the number of visits and more like normal people, it will prohibit the access iP so we can set up some proxy servers, each from time to time for a proxy, even if iP is disabled, you can still change the iP continued crawling in urllib to set up a proxy server by using ProxyHandler, the following code shows how to use custom opener to use a proxy: this proxy is not used

from the urllib Import Request 

# this is not to use the proxy 
# RESP = request.urlopen ( "http://baidu.com") 
# Print (resq.read (). decode ( "UTF-. 8")) 

# This is the use of agent 
Handler = request.ProxyHander ({ " HTTP " : " 218.66.82: 32512, " }) 

opener = request.build_opener (Handler) 
REQ = request.Request ( " http://www.baidu.com " ) 
RESP = opener.open (REQ)
 Print (resp.read ())

 

Guess you like

Origin www.cnblogs.com/toling/p/11300207.html