10 useful "anti-crawler" measures!

Article source: Farnast

Hello everyone, I am Brother Tao. Today I will share with you 10 extremely useful "anti-reptile" measures! The content is 4,000 words and takes about 10 minutes to read.

Crawling is a common application scenario of Python. Many practice projects ask everyone to crawl a certain website. When crawling web pages, you will most likely encounter some anti-crawling measures. How should you respond in this situation? This article sorts out common anti-climbing measures and countermeasures.

1. Control access through User-Agent

Whether it is a browser or a crawler program, when it initiates a network request to the server, it will send a header file: headers, such as Zhihu's requests headers

09bcc9d51537b6a1952216c745bb629d.png

Most of the fields here are used by the browser to identify itself to the service.

For crawlers, the field that requires the most attention is: User-Agent

Many websites will establish a user-agent whitelist, and only user-agents within the normal range can be accessed normally.

Solution:

You can set up your own user-agent, or better yet, you can randomly select one from a series of user-agent that meets the standards and use it. picture

Difficulty of implementation: ★

2.IP restrictions

If a fixed IP accesses a website quickly and in large quantities within a short period of time, the backend administrator can write IP restrictions to prevent the IP from continuing to access.

Solution:

The more mature method is: IP proxy pool

3d414ee217e3e359dcce70dddb5ad075.png

Simply put, it is to access from different IPs through IP proxy, so that the IP will not be blocked.

However, obtaining an IP proxy is a very troublesome matter in itself. There are free and paid ones online, but the quality is uneven. If needed by the enterprise, you can build your own agent pool by purchasing cluster cloud services yourself.

Difficulty of implementation: ★

3. SESSION access restrictions

The background counts the operations of logged-in users, such as short-term click events and request data events, and compares them with normal values ​​to distinguish whether the user is dealing with an abnormal state. If so, restrict the logged-in user's operation permissions.

Disadvantages: The data burying function needs to be added, and the threshold setting is not good, which can easily cause misoperation.

Solution:

Register multiple accounts and simulate normal operations.

Difficulty of implementation: ★★★

4.Spider Trap

Spider traps cause the web crawler to enter something like an infinite loop, which wastes the spider's resources, reduces its productivity, and in the case of a poorly written crawler, can cause the program to crash. Polite spiders alternate requests between different hosts and do not request documents multiple times from the same server every few seconds, which means that "polite" web crawlers have a much smaller impact than "impolite" crawlers.

Anti-crawling method:

  • Create unlimited depth directory structure HTTP://example.com/bar/foo/bar/foo/bar/foo/bar/

  • Dynamic pages to generate an unlimited number of documents for web crawlers. Such as cluttered article pages generated by algorithms.

  • The document was filled with so many characters that it crashed the lexer parsing the document.

In addition, websites with spider traps usually have robots.txt that tells robots not to enter the trap, so legitimate "polite" robots will not fall into the trap, while "impolite" robots that ignore the robots.txt setting will be affected by the trap.

Solution

The web pages are clustered according to the referenced CSS files, and the maximum number of web pages that can be included in the category is controlled to prevent crawlers from being able to get out after entering the trap. A penalty will be given to web pages that do not contain CSS to limit the number of links it can generate. This method is not theoretically guaranteed to prevent crawlers from falling into an infinite loop, but in fact this solution works quite well, because most web pages use CSS, especially dynamic web pages.

Disadvantages: Anti-crawling methods 1 and 2 will add a lot of useless directories or files, causing a waste of resources. They are also very unfriendly to normal SEO and may be punished.

Difficulty of implementation: ★★★

5. Verification code

CAPTCHA is a public, fully automated program that distinguishes whether a user is a computer or a human. It can prevent: malicious cracking of passwords, ticket fraud, forum flooding, and effectively prevents a hacker from using a specific program to brute force a specific registered user from making continuous login attempts. In fact, using verification codes is a common method for many websites now.

Image verification code: complex

f51405d2c56bb0b72fcc58932537ee9b.png

The coding platform employs manpower to help people identify verification codes. After recognition, the results are sent back. The total process takes less than a few seconds. Such a coding platform also has a memory function. After the picture is recognized as a "spatula", the next time the picture appears again, the system will directly determine that it is a "spatula". Over time, the images in the image verification code server will be marked, and the machine will automatically recognize them.

Image verification code: simple type

27b96a9c01e9577aca9be532c4b869d8.png

The above two can be identified directly using OCR recognition technology (using python third-party library-esserocr) without processing.

5271ebc7317deb5e2dc3133b1b0a9ce2.png

The background is blurry

1072bdb7976f5a1e4cc0c275df7013a9.png

clear and distinct

After grayscale transformation and binarization, the blurred verification code background becomes a clearly visible verification code.

453eae472a946c6d5607c279873f1452.png

Image verification code that is easy to confuse people

For this kind of verification code, the language usually comes with its own graphics library, and adding distortion makes it look like this. We can use 90,000 pictures for training to achieve human-like accuracy and achieve the effect of identifying verification codes.

The SMS verification code uses ebbrowser technology to simulate the user's behavior of opening the SMS, and finally obtains the SMS verification code.

Calculation question picture verification code

a53ef768eef53f5a3f062b5bfce081c8.png 295e96d154edf072008b3522323084c1.png

Manually extract all possible Chinese characters and save them as black and white pictures, binarize the verification code according to the font color, remove the noise, and then compare the pixels of all pictures with them in turn, calculate the similarity value, and find the most similar one pictures

Sliding verification code

6c089ebc97b7e68d91ebe47393a14985.png

For the sliding verification code, we can use the pixels of the image as clues to determine the basic attribute value and check the difference in position. If the difference exceeds the basic attribute value, we can determine the approximate position of the image.

Pattern verification code

c97cf5de1f2d6e8cfcbc86f10efa936e.png

Each time the drag order is different, the result will be different. How can we identify it?

Use machine learning to learn all drag sequences, use 10,000 pictures for training, complete human-like operations, and finally identify them using selenium technology to simulate human drag sequences and exhaust all drag methods, so as to achieve other effects

Mark upside down text verification code

c382fc2da96a069e2d768d5309a0f393.png

We might as well analyze it: For Chinese characters, there is a huge Chinese character library of five thousand years, plus different fonts of the characters, distortion and noise of the characters, which makes it even more difficult.

Method: First click on the first two inverted characters to determine the coordinates of the 7 characters. The positions of the 7 Chinese characters in the verification code are determined. You only need to confirm the coordinates of each character in advance and put them into the list, and then Manually determine the text serial number of the inverted text, and enter the coordinates corresponding to the serial number in the list to achieve successful login.

Solution

Connect to a third-party verification code platform to crack the website's verification code in real time.

Disadvantages: It affects normal user experience operations. The more complex the verification code is, the worse the website experience will be.

Difficulty of implementation: ★★

6. Limit crawlers through robots.txt

robots.txt (uniformly lowercase) is an ASCII-encoded text file stored in the root directory of the website. It usually tells the Internet search engine robots (also known as web spiders) which content on this website should not be searched. What is obtained by the engine's robot, and what can be obtained by the robot.

The robots.txt protocol is not a specification, but just a convention, so it does not guarantee the privacy of the website. Note that robots.txt uses string comparison to determine whether to obtain the URL, so the URL with and without the slash "/" at the end of the directory represents a different URL.

045750a238b24b834404c61f1a2e6812.pngDisadvantages: It's just a gentleman's agreement. It is effective for good crawlers such as search engines, but not for purposeful crawlers.

Solution

If you use the scrapy framework, just set ROBOTSTXT_OBEY in the settings file to False

Difficulty of implementation: ★

7. Dynamic data loading

Python's requests library can only crawl static pages, but cannot crawl dynamically loaded pages. Using JS to load data can improve the crawler threshold.

Solution

Capture packet to obtain data url

The request URL of the data can be obtained through packet capture, and then the data can be captured by analyzing and changing the URL parameters.

Example:

  • Look at the package in this part of https://image.baidu.com. As you can see, in this part of the package, the URL under search is exactly the same as the address we visited, but its response contains js code.

69948d004c566199b8c8f143c6594a68.png
  • When I scroll down the animal picture homepage and want to see more, more packages appear. As you can see from the picture, what you get after scrolling down the page is a series of json data. In data, you can see words such as thumbURL. Its value is a url. This is the link to the picture.

bfe6357cf6ce0a01ba93d21dd16ddeb9.png
  • Open a browser page and visit thumbURL="https://ss1.bdstatic.com/70cFvXSh_Q1YnxGkpoWK1HF6hhy/it/u=1968180540,4118301545&fm=27&gp=0.jpg" to find the picture in the search results.

  • According to the previous analysis, we can know that the request

URL="https://image.baidu.com/search/acjsontn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf8&oe=utf8&adpicid=&st=-1&z=&ic=0&word=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&pn=30&rn=30&gsm=1e&1531038037275="

Use your browser to visit the link to make sure it is public.

  • Finally, you can look for the rules of URLs and construct the URLs to get all the photos. Use selenium

By using selenium to simulate user operation of the browser, and then combining packages such as BeautifulSoup to parse the web page to obtain data, this method is simple and intuitive, but the disadvantage is that it is relatively slow.

Disadvantages: If the data API is not encrypted, the interface is easily exposed, making it easier for crawler users to obtain data.

Difficulty of implementation: ★

8. Data Encryption – Using Encryption Algorithms

Front-end encryption generates a string of encryption instructions by encrypting front-end data such as query parameters, user-agent, verification codes, cookies, etc., and uses the encryption instructions as parameters before making a server data request. If the encryption parameter is empty or wrong, the server will not respond to the request.

Server-side encryption also has an encryption logic on the server side, which generates a string of codes and matches the requested code. If the match is successful, the data will be returned.

Solution

The JS encryption cracking method is to find the encryption code of JS, and then use the third-party library js2py to run the JS code in Python to obtain the corresponding encoding.

Case reference:

https://blog.csdn.net/lsh19950928/article/details/81585881

Disadvantages: The encryption algorithm is written in plain text in JS, and crawler users can still analyze it.

Difficulty of implementation: ★★★

9. Data encryption - using font file mapping

In fact, if you can understand the JS code, it is still very easy to crack in this way, so you need to do the following operations to increase the difficulty of cracking.

  • Encrypt JS

  • Use multiple different font files, and then agree to use a specified font file method, such as timestamp modulo. In this way, the data crawled will be mapped differently each time, and the mapping results will be different, which greatly increases the difficulty of cracking. This method is more difficult than using encryption algorithms, because there are several fixed encryption algorithms, which are easy for the other party to obtain and crack. The font file mapping can be mapped according to any rules. Normal data will be displayed incorrectly, making it difficult for crawlers. aware. Reference case: https://www.jianshu.com/p/f79d8e674768

Disadvantages: font files need to be generated, which increases the size of website loading resources.

Difficulty of implementation: ★★★★

10. Occlusion of non-visible areas

This method is mainly aimed at crawlers using senlium. If the simulation interface does not enter the visible area, the unseen data will be blocked to prevent senlium's click() operation. This method can only slightly reduce the crawling speed of the crawler, but does not prevent the continued data crawling.

Difficulty of implementation: ★

Organizing useful notes

  100 Frequently Asked Questions about Reptiles.pdf, so comprehensive!

Python automated operation and maintenance 100 frequently asked questions.pdf

100 common problems in Python web development.pdf

124 Python cases, complete source code!

PYTHON 3.10 Chinese version official documentation

"The Road to Python 2.0.pdf", which took three months to compile, is open for download

The most classic programming textbook "Think Python" open source Chinese version.PDF download

077f68115cf287b94d34c1be7f4b073d.png

Guess you like

Origin blog.csdn.net/wuShiJingZuo/article/details/133326969