Analysis of website anti-climbing plan

One, crawler identification method

1. HTTP log and traffic analysis
Set a threshold for statistics on IP access frequency. If a certain IP access frequency and traffic exceed a certain threshold within a unit time, it can be classified as a crawler.
2. Headers parameter detection
Generally there are User-Agent, Referer, Cookies, etc. The target site can detect the value of the User-Agent or Referer parameter to determine whether it is a crawler, and the Referer parameter can also prevent theft link.
User-Agent is to check the type and version of the client used by the user, and Referer is to check where the request comes from, usually it can be used to judge the hotlink of the picture. The website may detect the number of times the session_id in the cookie is used, and if it exceeds the limit, it will trigger an anti-crawl strategy.
3. Place a link invisible to the browser in the source code of the webpage.
Normal users can't see the link when using the browser. Of course they won't click. If the link is detected to be clicked, the visiting IP will be classified as a crawler .

Two, common anti-crawler strategies

After we identify the crawler, we can usually take the following measures to achieve anti-crawl:

  1. Temporarily or permanently block visiting IP. Difficulty: Simple. Difficulty of cracking: easy.

  2. Return the verification code to block the behavior. Difficulty of realization: simple. Cracking difficulty: difficult

  3. Use ajax asynchronous loading: if the crawler is only a static web page crawler, the crawled content is empty. Difficulty of realization: medium. Difficulty of cracking: medium.

  4. Crawler trap: Turn the crawled content into other information unrelated to this website. Difficulty of realization: medium. Difficulty of cracking: easy

  5. Accelerate cookies authentication service: Before accessing, the server will first determine whether the cookies requested by the client are correct. If it is not correct, an error status code is returned. Difficulty in implementation: third-party support is required. Cracking difficulty: difficult

  6. javascript rendering: web developers put important information in the web page but do not write it into html tags, and the browser will automatically render the js code of the script tag to display the information in the browser, and the crawler does not have the ability to execute js code , So the information generated by the js event cannot be read out. Difficulty of realization: medium. Difficulty of cracking: medium.

If there are crawlers that can bypass these anti-crawling strategies, the website will generally give up blocking, because the cost of blocking is too high. In addition, if the anti-crawl policy is too strict, it will affect the access of normal users. Simply put, the more anti-climbing measures, the worse the website user experience.

3. Measures that attackers may take against anti-climbing strategies

From the perspective of the attacker, the anti-climbing measures involved above

-For anti-climbing strategy-temporarily or permanently ban visiting IP

You can set the waiting time (explicit or implicit) and use the hidden proxy IP to solve it.

-For anti-crawl strategy-verification code

If you don’t play the verification code every time, you can also use the high hidden proxy IP to solve it. If you feel that the high hidden proxy is unstable or it is not convenient to use the Tor network, you can also use the Tor network;
if you play the verification code every time, verification code recognition Yes, simple verification codes can be written by yourself. Python has many well-known image processing (recognition) libraries (such as PIL/Pillow, Mahotas, Pymorph, pytesser, tesseract-ocr, openCV, etc.) and algorithms (such as the famous KNN [K proximity algorithm] and SVM [support vector machine]), but complex verification codes such as logic judgments and calculations, character adhesion deformation, pre-noise multi-color interference, multi-language character mashups can only be accessed manually The coding platform has come to fight.
Manual recognition: suitable for more complex verification codes, with high accuracy but high cost.
Machine recognition: call the online verification code recognition software interface to recognize the verification code, the accuracy rate is more than 90%.

-For anti-crawl strategy-ajax asynchronous loading

①You can use fiddler/wireshark to capture packets to analyze the interface of the Ajax request, and then construct a request to access the server to get the real data packet returned by the regular imitation of the server;
②You can use selenium+phantomjs to solve it, phantomjs is a headless and interfaceless browser, use Selenium can drive it to simulate all operations of the browser, but the shortcomings are also obvious, and the crawling efficiency is low.

  • Targeting anti-crawl strategy-crawler trap

Generally, it is a relatively simple loop trap, which can judge the link that the crawler will crawl, and does not crawl the same page repeatedly. In addition, be careful to crawl after seeing specific elements. For example, you can use scrapy's LinkExtractor to set the unique parameter to True or directly set the maximum number of cycles of the crawler.

  • Targeting anti-climbing strategy-accelerating music service

Slightly complicated, you can put the js code returned by the browser in a string, and then use nodejs to decompress this code, and then decrypt the partial information, and get the key information into the header of the next access request . Please refer to: Regarding the latest anti-reptile mechanism of Accelerated Music.

-Targeting anti-climbing strategy-javascript rendering

Use selenium+phantomjs to solve.

Four, program selection

At present, there are generally two common anti-climbing methods for corporate websites

The first is for companies to select and implement anti-climbing measures based on their own characteristics of the website. This method is more flexible, but requires a lot of cost and has a high threshold for implementation. Generally used by large-scale websites such as Taobao and JD.

The second is to adopt the anti-climbing service integrated by a third-party provider. This method has a good interception effect and low implementation cost. At present, the domestic Ji test company has a more professional service https://www.geetest.com/BotSonar

Guess you like

Origin blog.csdn.net/mrliqifeng/article/details/101199357