Web crawlers started: your first reptile project (requests Library)

0. adoption requests library

Although urllib database applications are very broad, and as a Python library that comes with no installation, but most of the reptiles are now python library to handle complex application requests an http request. Simple requests on library grammar and clear, straightforward use and is becoming standard for most network crawl.

1. Install requests libraries
using pip installation, in cmd input interface:

pip install requests

Xiao Bian recommend a school python learning qun 491308659 Code: South candle
Whether you are big or small white cow, or would want to switch into the line can come together to learn to understand progress together! There are tools in the skirt, dry goods and a lot of technical information sharing

2. The sample code
will be processed http request header processing simple process for anti-anti reptiles, and a parameter setting agent, exception handling.

Import Requests 


DEF download (url, num_retries = 2, user_agent = ' wswp ' , Proxies = None):
     '' ' download a specified URL and return to the page content 
        parameters: 
            url (str): URL 
        keyword arguments: 
            user_agent (str) : user agent (default: wswp) 
            Proxies (dict): agent (dictionary): key: 'http''https' 
            : a string ( 'HTTP (S): // the IP') 
            num_retries (int): if 5xx error to retry (default: 2) 
            # 5xx server error, it indicates that the server could not be completed obvious valid request. 
            #https: //zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81 
    '' ' 
    Print ( ' ============ ============================== 'Print ( ' Downloading: ' , URL) 
    headers = { ' the User-- Agent ' : user_agent} # head set, the default will be the head and sometimes error page and pocketing 
    the try : 
        RESP = requests.get (URL, headers = headers , = Proxies Proxies) # simple and crude, .get (URL) 
        HTML = resp.text # acquire web content, string 
        IF resp.status_code> = 400: # exception handling, 4xx client error Back None 
            Print ( ' the Download error : ' , resp.text) 
            HTML = None
             IF num_retriesand 500 <= resp.status_code <600 :
                 # . 5 Type I error 
                return downloads (URL, num_retries -. 1) # If there is an error on the server retries twice 

    the except requests.exceptions.RequestException AS E: # Other errors, given normal 
        Print ( ' the Download error: ' , E) 
        HTML = None
     return HTML # returns HTML 


Print (downloads ( ' http://www.baidu.com ' ))

result:

Downloading: http://www.baidu.com
<!DOCTYPE html>
<!--STATUS OK-->

</script>

<script>
if(navigator.cookieEnabled){
    document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";
}
</script>



</body>
</html>

Guess you like

Origin www.cnblogs.com/pypypy/p/12003942.html