Browser camouflage technology for crawler (019)

1: The principle of browser camouflage technology:

We tried to crawl csdn's blog, and we can find the status code of 403 returned, because the other server will block the crawler. At this point we need to disguise as a browser to crawl. We generally use headers for browser camouflage.

Two: Actual combat

The value corresponding to the User-Agent field is used in the header of the web page of the browser to determine whether it is a browser.

Therefore, if you want to simulate a browser, you need to modify the message when requesting, and change the value of User-Agent to the value that the corresponding browser should have.

(1) Next, I open my blog at https://blog.csdn.net/weixin_41167340, then click F12, click F5 to refresh, and the following figure will appear. Click on any webpage, find "Network", click on any "wh.js", and then look for the field User-Agent in the "header". Found that this field corresponds to a value. Copy this value down.


code show as below:


At this point, open the local webpage: the crawled webpage appears.


Third, to sum up the code:

import urllib.request
url="https://blog.csdn.net/weixin_41167340"
headers=("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fh=open("G:/BaiduDownload/python网络爬虫/WODE/019.html","wb")
fh.write(data)
fh.close()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325548906&siteId=291194637