Analysis of crawler crawling methods and anti-crawler strategies

                                                             Spider, Anti-Spider

Foreword:Web crawler (Spider or Crawler), as the name suggests, is a bug crawling on the Internet. So why does this bug crawl on the Internet? It’s simple: gather information. In the Internet era, whoever has the information has the initiative.

, some simple suggestions for the crawler part:

(1) Minimize the number of requests,

(2) If you can capture the list page, don’t capture the details page.

(3) Reduce server pressure

(4) If you really have high performance requirements, you can consider multi-threading (some mature frameworks such as Scrapy already support it) or even distributed.

Second, anti-crawling strategy:
(1)Anti-crawling: Use user-agent to determine whether it is a crawler.
   Anti-crawling: This can be solved by disguising the user-agent in the request header.
(2) Anti-climbing: Block the IP.
     Anti-crawling: You can camouflage IP through a proxy
(3) Anti-crawling: Determine whether it is by access frequency A crawler.
     Anti-crawling: You can set the request interval and crawling interval.
(4) Anti-crawling: When the total number of requests within a certain period exceeds the upper limit, a verification code will pop up
     Anti-crawling :Can be cracked by solving the captcha.
(5) Anti-crawling: Obtain page data through js.
   Anti-anti-crawling: Page data is obtained through selenium+phantomjs.

Guess you like

Origin blog.csdn.net/Smile_Lai/article/details/101712571