(Reptile) Python Reptile 03 (hidden)

table of Contents:

First, the simulation browser access

Second, the delay time of submission

Third, use a proxy

 

 

A simulated browser access

Because the crawler server will bring too much work pressure, so a lot of the server is shut out of the reptile, then how do we make the server think it is human access, rather than the code reptile it? Codes are as follows:

Import urllib.request   # introduced a request dependent 
Import urllib.parse     # introduced resolve dependencies 
Import json             # introduced json dependent on 

the while 1 :
     # input contents as variable 
    Content = the INPUT ( " Please enter the content to be translated: " ) 
    
    # link Request URL is stored as a variable, ease of use 
    URL = " http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule "       

    # will be assigned to the dictionary form From Data Data 
    Data = {} 
    Data [ " I " ] = Content       # the value of i is changed to form a variable content (input their content)
    data["from"] = "AUTO"
    data["to"] = "AUTO"
    data["smartresult"] = "dict"
    data["client"] = "fanyideskweb"
    data["salt"] = "15838472422107"
    data["sign"] = "85a1e803f3f0d04882d66c8cca808347"
    data["ts"] = "1583847242210"
    data["bv"] = "d6c3cd962e29b66abe48fcb8f4dd7f7d"
    data["doctype"] = "json"
    data["version"] = "2.1"
    data["keyfrom"] = "fanyi.web"
    data["action"] = "FY_BY_CLICKBUTTION"
    #The url of the form of data encoding, and the hard-coded in the form of Unicode utf-8 is stored in the variable data_utf8 
    data_utf8 = urllib.parse.urlencode (data) .encode ( " utf-8 " ) 
    
    # ***** ********************************************************** **************** compared with an increase of at Python02 blog ***************** 
    # let the server that this is a browser to access, rather than the code reptiles (if not custom User-Agent, when the reptiles would be relevant code basic information, rather than the client) 
    # custom request Headers in the User-Agent, stored in the dictionary head (if request Headers and User-Agent do not know what it is, check out my last blog post (reptile) Python02 ( actual)) 
    head = {} 
    head [ " the User-- Agent " ] = " the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 " 

    #********************************* *********** than modify the blog at Python02 ********************** 
    # the url, data_utf8, head acquisition request object (the urlopen not automatically translate the tape head, so use the request object to acquire request ) 
    REQ = urllib.request.Request (url, data_utf8, head)
     # obtained through the response object url and data_utf8, data_utf8 will be submitted in the form of POST 
    response = the urllib.request.urlopen (REQ) 
    
    '' ' 
    # Note: You can also the modification at the top and increase at all removed, replaced with the following code is exactly the same principle and effect: 
    REQ = urllib.request.Request (URL, data_utf8) 
    req.add_header ( "the User-- Agent", "the Mozilla / 5.0 (the Windows 10.0 NT; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 ") 
    Response = the urllib.request.urlopen (REQ) 
    '' ' 
    
    # UTF. 8-encoded file to be read Unicode encoding form decoded back
    . = response.read HTML () decode ( " UTF-. 8 " )
     Print ( " raw data returned ****** ****** " )
     # printout (observation will find: returns json structure ) 
    Print (HTML) 

    # Loading JSON string (a dictionary is observed) 
    target = json.loads (HTML) 

    # printout predetermined value (i.e., translation) dictionary 
    Print ( " ****** Show results of treatment ****** " )
     Print ( " translation:% S " % (target [ " translateResult " ] [0] [0] [ " TGT " ]))
View Code

Second, the delay time of submission

Because humans do not like reptiles visit as particularly frequent access request, the server will determine if the same IP access too often, the code will be judged to be reptiles, reptile will also be shut out, to make the code more like a reptile human access we can delay the submission time, codes are as follows (Note: only done two lines are added at the beginning and end of the file) :

Import the urllib.request   # incoming request dependent 
Import The urllib.parse     # incorporated resolve dependencies 
Import json             # introduced json lightweight data-dependent 
# ********************************************************** ********** ******************************* over a simulated browser access point increase ** 
Import time             # introduction of time-dependent process 

the while 1 :
     # input contents as variable 
    content = the iNPUT ( " Please enter the content to be translated (type exit to exit the program): " ) 

    # ********** ****************** *********************** over a simulated browser access point increase *************** 
    # determine whether to exit the program 
    iF Content == "Exit " :
         Print ( " ! Quit success " )
         BREAK 
    
    # link Request URL is stored as a variable, easy to use 
    url = " http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule "       

    # will form From Data assignment a dictionary Data 
    Data = {} 
    Data [ " i " ] = content       # value form i is changed to a variable content (their contents input) 
    Data [ " from " ] = " the AUTO " 
    Data [ " to " ] = "AUTO"
    data["smartresult"] = "dict"
    data["client"] = "fanyideskweb"
    data["salt"] = "15838472422107"
    data["sign"] = "85a1e803f3f0d04882d66c8cca808347"
    data["ts"] = "1583847242210"
    data["bv"] = "d6c3cd962e29b66abe48fcb8f4dd7f7d" 
    Data [ " DOCTYPE " ] = " JSON " 
    data [ " Version " ] = " 2.1 " 
    data [ " keyfrom " ] = " fanyi.web " 
    data [ " Action " ] = " FY_BY_CLICKBUTTION " 
    # The data encoded as the url form, and the hard-coded in the form of Unicode utf-8 is stored in the variable data_utf8 
    data_utf8 = urllib.parse.urlencode (Data) .encode ( " utf-8 " ) 
    
    #Let the server that this is a browser to access, rather than the code reptiles (if not custom User-Agent, at the reptile will be information about the code, the basic information rather than the client) 
    # Custom Request Headers in User- Agent, stored in the dictionary head (if request Headers and User-Agent do not know what it is, check out my last blog post (reptile) Python02 (real)) 
    head = {} 
    head [ " User-Agent " ] = " Mozilla /5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 " 

    # the url, data_utf8, head acquisition request object (the urlopen not automatically translate the tape head, so request to acquire the object with the request) 
    REQ = urllib.request.Request (url, data_utf8, head)
     # obtained through the response object url and data_utf8, data_utf8 will be submitted in the form of POST 
    response = the urllib.request.urlopen (REQ) 
    
    ' ''
    # Note: all of which may also be removed to increase at the top and at the modified, replaced with the following code is exactly the same principle and effect: 
    REQ = urllib.request.Request (URL, data_utf8) 
    req.add_header ( "the User-- Agent", " mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 ") 
    Response = the urllib.request.urlopen (REQ) 
    '' ' 
    
    # UTF the read -8 decoding encoded Unicode-encoded document back into 
    HTML response.read = (). decode ( " UTF-. 8 " )
     Print ( " raw data returned ****** ****** " )
     # printout (observation will find: returns json structure) 
    Print (HTML) 

    # Loading json string (a dictionary is observed) 
    target = json.loads (HTML) 

    #Printouts specified value (i.e., translation) dictionary 
    Print ( " ****** shows the results after treatment ****** " )
     Print ( " translation:% S " % (target [ " translateResult " ] [0] [0] [ " TGT " ])) 

    # ********************************* than a simulated browser access increased at ********************************* 
    # let the program sleep two seconds, so that the distal server that is more human use client access, rather than the code reptile 
    time.sleep (2)
View Code

Third, use a proxy

Since the use of the delay time of submission crawler low efficiency, it may be used in another way - proxy codes are as follows:

To be continued. . .

 

 

Reference in this blog:

Zero-based learning portal Python         https://www.bilibili.com/video/av4050443?p=56

 

Guess you like

Origin www.cnblogs.com/hwh000/p/12466499.html