table of Contents:
First, the simulation browser access
Second, the delay time of submission
Third, use a proxy
Because the crawler server will bring too much work pressure, so a lot of the server is shut out of the reptile, then how do we make the server think it is human access, rather than the code reptile it? Codes are as follows:
Import urllib.request # introduced a request dependent Import urllib.parse # introduced resolve dependencies Import json # introduced json dependent on the while 1 : # input contents as variable Content = the INPUT ( " Please enter the content to be translated: " ) # link Request URL is stored as a variable, ease of use URL = " http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule " # will be assigned to the dictionary form From Data Data Data = {} Data [ " I " ] = Content # the value of i is changed to form a variable content (input their content) data["from"] = "AUTO" data["to"] = "AUTO" data["smartresult"] = "dict" data["client"] = "fanyideskweb" data["salt"] = "15838472422107" data["sign"] = "85a1e803f3f0d04882d66c8cca808347" data["ts"] = "1583847242210" data["bv"] = "d6c3cd962e29b66abe48fcb8f4dd7f7d" data["doctype"] = "json" data["version"] = "2.1" data["keyfrom"] = "fanyi.web" data["action"] = "FY_BY_CLICKBUTTION" #The url of the form of data encoding, and the hard-coded in the form of Unicode utf-8 is stored in the variable data_utf8 data_utf8 = urllib.parse.urlencode (data) .encode ( " utf-8 " ) # ***** ********************************************************** **************** compared with an increase of at Python02 blog ***************** # let the server that this is a browser to access, rather than the code reptiles (if not custom User-Agent, when the reptiles would be relevant code basic information, rather than the client) # custom request Headers in the User-Agent, stored in the dictionary head (if request Headers and User-Agent do not know what it is, check out my last blog post (reptile) Python02 ( actual)) head = {} head [ " the User-- Agent " ] = " the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 " #********************************* *********** than modify the blog at Python02 ********************** # the url, data_utf8, head acquisition request object (the urlopen not automatically translate the tape head, so use the request object to acquire request ) REQ = urllib.request.Request (url, data_utf8, head) # obtained through the response object url and data_utf8, data_utf8 will be submitted in the form of POST response = the urllib.request.urlopen (REQ) '' ' # Note: You can also the modification at the top and increase at all removed, replaced with the following code is exactly the same principle and effect: REQ = urllib.request.Request (URL, data_utf8) req.add_header ( "the User-- Agent", "the Mozilla / 5.0 (the Windows 10.0 NT; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 ") Response = the urllib.request.urlopen (REQ) '' ' # UTF. 8-encoded file to be read Unicode encoding form decoded back . = response.read HTML () decode ( " UTF-. 8 " ) Print ( " raw data returned ****** ****** " ) # printout (observation will find: returns json structure ) Print (HTML) # Loading JSON string (a dictionary is observed) target = json.loads (HTML) # printout predetermined value (i.e., translation) dictionary Print ( " ****** Show results of treatment ****** " ) Print ( " translation:% S " % (target [ " translateResult " ] [0] [0] [ " TGT " ]))
Second, the delay time of submission
Because humans do not like reptiles visit as particularly frequent access request, the server will determine if the same IP access too often, the code will be judged to be reptiles, reptile will also be shut out, to make the code more like a reptile human access we can delay the submission time, codes are as follows (Note: only done two lines are added at the beginning and end of the file) :
Import the urllib.request # incoming request dependent Import The urllib.parse # incorporated resolve dependencies Import json # introduced json lightweight data-dependent # ********************************************************** ********** ******************************* over a simulated browser access point increase ** Import time # introduction of time-dependent process the while 1 : # input contents as variable content = the iNPUT ( " Please enter the content to be translated (type exit to exit the program): " ) # ********** ****************** *********************** over a simulated browser access point increase *************** # determine whether to exit the program iF Content == "Exit " : Print ( " ! Quit success " ) BREAK # link Request URL is stored as a variable, easy to use url = " http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule " # will form From Data assignment a dictionary Data Data = {} Data [ " i " ] = content # value form i is changed to a variable content (their contents input) Data [ " from " ] = " the AUTO " Data [ " to " ] = "AUTO" data["smartresult"] = "dict" data["client"] = "fanyideskweb" data["salt"] = "15838472422107" data["sign"] = "85a1e803f3f0d04882d66c8cca808347" data["ts"] = "1583847242210" data["bv"] = "d6c3cd962e29b66abe48fcb8f4dd7f7d" Data [ " DOCTYPE " ] = " JSON " data [ " Version " ] = " 2.1 " data [ " keyfrom " ] = " fanyi.web " data [ " Action " ] = " FY_BY_CLICKBUTTION " # The data encoded as the url form, and the hard-coded in the form of Unicode utf-8 is stored in the variable data_utf8 data_utf8 = urllib.parse.urlencode (Data) .encode ( " utf-8 " ) #Let the server that this is a browser to access, rather than the code reptiles (if not custom User-Agent, at the reptile will be information about the code, the basic information rather than the client) # Custom Request Headers in User- Agent, stored in the dictionary head (if request Headers and User-Agent do not know what it is, check out my last blog post (reptile) Python02 (real)) head = {} head [ " User-Agent " ] = " Mozilla /5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 " # the url, data_utf8, head acquisition request object (the urlopen not automatically translate the tape head, so request to acquire the object with the request) REQ = urllib.request.Request (url, data_utf8, head) # obtained through the response object url and data_utf8, data_utf8 will be submitted in the form of POST response = the urllib.request.urlopen (REQ) ' '' # Note: all of which may also be removed to increase at the top and at the modified, replaced with the following code is exactly the same principle and effect: REQ = urllib.request.Request (URL, data_utf8) req.add_header ( "the User-- Agent", " mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36 ") Response = the urllib.request.urlopen (REQ) '' ' # UTF the read -8 decoding encoded Unicode-encoded document back into HTML response.read = (). decode ( " UTF-. 8 " ) Print ( " raw data returned ****** ****** " ) # printout (observation will find: returns json structure) Print (HTML) # Loading json string (a dictionary is observed) target = json.loads (HTML) #Printouts specified value (i.e., translation) dictionary Print ( " ****** shows the results after treatment ****** " ) Print ( " translation:% S " % (target [ " translateResult " ] [0] [0] [ " TGT " ])) # ********************************* than a simulated browser access increased at ********************************* # let the program sleep two seconds, so that the distal server that is more human use client access, rather than the code reptile time.sleep (2)
Since the use of the delay time of submission crawler low efficiency, it may be used in another way - proxy codes are as follows:
To be continued. . .
Reference in this blog:
Zero-based learning portal Python https://www.bilibili.com/video/av4050443?p=56