Winter vacation learning progress 15

  Then the last python reptile advanced the study and treatment against anti reptiles some websites and web search for passing parameters to the url with the keys, analog manual input.

  We know that, whether it is normal or malicious reptile reptiles will not only lead to information disclosure website, but also cause the server too much pressure. Imagine a computer simulation of artificial requesting access to the server, and interact with the speed of the server computer, the server is bound to lead to excessive pressure or paralysis. Therefore, many regular large sites do not want web crawlers crawling data, set up anti-reptile mechanism. The most common is the user-agent, simply put, is being crawled sites through user-agent information access in the url, visit the site to determine the browser access, or access to a computer program, the computer automatically shield program access.

  We can automatically set the user-agent information, to simulate a browser to access the site, you can go around part of the anti reptile mechanism. Of course, I tried this method, still can not bypass the restrictions Baidu's search, seems to have to continue to learn other ways to bypass anti-crawler mechanism.

  Code is as follows (bypassing the case Amazon anti-reptile mechanisms)

  

# - * - Coding: UTF-8 - * - 
# @time: 2020/2/8 10:15 
# @author: Duoduo 
# @FileName: pc1.py 
# @Software: PyCharm 

Import Requests
 Import Re 

# climbed out of Exception Handling framework 
"" " " 
DEF GetHttp (url): 
    the try: 
        r = requests.get (url) 
        r.raise_for_status () 
        r.encoding = r.apparent_encoding 
        return r.text 
    the except: 
        return "access error" 
"" "" " 
# analog browser (for anti-Amazon reptile mechanism (Sorry, we just want to make sure visitors are not currently automated programs)) 
kv = { ' the User-Agent ' :'Mozilla/5.0'}
url='https://www.amazon.cn/dp/B007J4IZNO/'
r=requests.get(url,headers=kv)
r.encoding=r.apparent_encoding
print(r.status_code)
print(r.text)

 

Guess you like

Origin www.cnblogs.com/Aduorisk/p/12317770.html