Simulation of crawler browser--Hreader property

# Simulate the browser
    headers = ( "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36")

Commonly used "User-Agent":

ua_list = [
    "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ",
    "Mozilla/5.0 (Macintosh; Intel Mac OS... "
]

user_agent = random.choice(ua_list)

两种让爬虫模拟成浏览器的方法:

Method 1: Use build_opener() to modify the header
Since urlopen() does not support some advanced HTTP functions, if we want to modify the header, we can use urllib.request.build_opener() for
example:
url = "http://blog.csdn.net/weiwei_pig/article/details/51178226"
header = (“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36”)
opener = urllib.request.build_opener()#Create a build_opener operation object
opener.addheaders = [header]#Add header information
data = opener.open(url).read()#Receive return information and read

#At this time, it has been imitated as a browser to open, and we save the crawled information
fhandle = open(“F:/python/part4/3.html”,“wb”)
a = fhandle.write(data)#print(a)View the number of bytes written
fhandle.close()
Method 2: Use add_header() to add headers
In addition to the above methods, you can also use add_header() under urllib.request.Request() to implement browser simulation:
import urllib.request
url = "http://blog.csdn.net/weiwei_pig/article/details/51178226"
req = urllib.request.Request(url)#Create a request object
req.add_header(“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36”)
data = urllib.request.urlopen(req).read()
data = data.decode("utf-8")#Transcode, convert the original data in the form of utf-8 encoding
print(data)



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325561795&siteId=291194637