Brief introduction of anti-reptile

The access rate and purpose of crawlers are different from those of normal users. Most crawlers will crawl the target application uncontrollably, which brings huge pressure to the target application server. The network request issued by the crawler is called "junk traffic" by the operator.

  In order to ensure the normal operation of the server or reduce the pressure and operating cost of the server, developers have to use a variety of technical means to restrict the crawler's access to server resources. Because crawlers and anti-crawlers are applications of comprehensive technologies, the phenomenon of anti-crawlers is related to the tools and development languages ​​used by the crawler engineers, and even to the personal capabilities of the crawler engineers. Therefore, the concept of anti-crawlers is often vague, and the industry is not clear. Definition. In short, the behavior of restricting crawlers from accessing server resources and obtaining data is called anti-crawler. Restrictions include but are not limited to request restriction, rejection response, client authentication, text obfuscation, and the use of dynamic rendering technology. These restrictions can be divided into active anti-reptiles and passive anti-reptiles based on the starting point.

  (1) Active anti-crawler: Developers consciously use technical means to distinguish between normal users and crawlers, and limit crawlers' interrogation behavior on the website, such as verifying request header information, limiting access frequency, using verification codes, etc.

  (2) Passive anti-crawler: In order to improve user experience or save resources, some techniques are used to indirectly increase the difficulty of crawler access, such as data segment loading, clicking to switch tabs, and mouse hovering to preview data.

  (3) In addition, anti-reptiles can also be divided into more detailed features, such as information-checking anti-reptiles, dynamic rendering anti-reptiles, text confusion anti-reptiles, feature recognition anti-reptiles, etc. It should be noted that the same restriction phenomenon can be classified into different anti-crawler types.For example, a random string is generated by JavaScript and the string is sent to the server in the request header, and the server verifies the identity of the client. This restriction method can be said to be an information verification type anti-reptile, but also a dynamic rendering anti-reptile.

  Anti-crawler not only needs to understand the traffic situation of the website, but also needs to understand the methods commonly used by crawler engineers, and carry out targeted protection from multiple aspects. The design, implementation, and testing of anti-crawler programs require a lot of time, and often require the cooperation of multiple departments to complete. From this point of view, in addition to technical difficulty, time cost is also very high.

  Economic expenses usually include IP agency fees, cloud server purchase fees, and VIP account opening fees. In addition, it takes more time than anti-crawlers. This is because when the algorithm or web page structure of the target website changes, the crawler code needs to be changed accordingly, and sometimes the code needs to be rewritten. We can understand the relationship between them tit for tat and mutual improvement through the following table:

reptile Anti crawler
Python code initiates a network request to the target website and crawls website data Abnormal traffic is monitored, if the request is not from the Chinese website data, the request is rejected
Simulate the browser logo and deceive the target website Monitoring that a large number of requests are from the same browser ID, consider crawler forgery, and limit the frequency of access
Use IP rotation or multiple machines to initiate requests to the target website Add verification codes to some populations or forms to distinguish normal users from crawlers
Simple verification codes can be identified by codes, and complex verification codes can continue to initiate requests to the target website by accessing the coding platform Improve the account system, stipulate that only VIPs can browse key information, and prevent precious data from being crawled on a large scale by crawlers
Register multiple accounts and open website VIP Custom obfuscation rules obfuscate important information on the website and increase the difficulty of crawler identification
When the cost of decryption is high, take screenshots to obtain key data Distinguish users and crawlers based on the characteristics of automated testing frameworks or browsers
The cost is too high, it is possible to give up crawling The cost is too high to limit crawling completely

 

Guess you like

Origin blog.csdn.net/someby/article/details/106154583