Crawler encounters blank page

About two months ago, the comment data dynamically loaded by JD.com could still be accessed normally, but someone asked meAbout JD.com comment crawler tutorialAfter commenting, I discovered that the comment data page could not be viewed normally.

In fact, JD.com's robots agreement prohibits access to URLs containing "?", as shown in the figure below
Insert image description here
However, there has been no anti-crawling mechanism before, causing us to These novices use JD.com to practice, which actually affects the normal operation of their website. Therefore, it is recommended to pay attention not to access too frequently when using crawlers.

Cause Analysis:

First of all, after confirming that the URL is correct, the first thing that comes to mind is the anti-crawler mechanism.
Common anti-crawling: requires login status (such as Taobao), access frequency detection, etc.< a i=4> Commonly used strategies: construct cookies, change browsers, use proxy IP, selenium simulated clicks, etc. After packet capture analysis, we found that this kind of anti-crawling is called a web page referer. It can record the URL of the website before you visit the new web page. Chrome refreshes the web page and press F12 to view the previous web page address of the Referer in the network In this way, you can see that it is not reasonable to directly use the dynamically generated URL to access the referer of the JD review page, so it was identified and a blank page was returned (of course, it may also be other anti-crawlers).


Insert image description here

ways to improve:

Here we take python code as an example to discuss how to deal with Referer anti-crawlers.

import requests

url = "https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv16247&productId=100000177760&score=0&sortType=5&page=6&pageSize=10&isShadowSku=0&rid=0&fold=1"
headers = {
    
    
    'Accept': '*/*',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    'Referer':"https://item.jd.com/100000177760.html#comment"}
r = requests.get(url,headers=headers)
print(r.text)

The results are as follows
Insert image description here
In this way, the Referer anti-crawling mechanism can be bypassed. Most other anti-crawling methods are also implemented by constructing request headers. Go give it a try!

Guess you like

Origin blog.csdn.net/weixin_42474261/article/details/90728322