Common reptiles and anti-reptile struggle

00. The magnificent battle between Spider, Anti-Spider, Anti-Anti-Spider... Spider, Anti-Spider, Anti-Anti-Spider A magnificent fight between (Anti-Anti-Spider)...

Day 1 Xiaomo wanted all the movies on a certain site, wrote a standard crawler (based on the HttpClient library), constantly traversed the movie list page of a certain site, and analyzed the movie names according to Html and stored them in his own database. Xiaoli, the operation and maintenance of this site, found that the number of requests increased sharply in a certain period of time. After analyzing the logs, it was found that all users were IP (xxx.xxx.xxx.xxx), and the user-agent was still Python-urllib/2.7. Based on these two points After judging non-human beings, they are directly blocked on the server.

Day 2 Xiaomo's movie only climbed halfway, so he also changed his strategy accordingly: 1. The user-agent imitates Baidu ("Baiduspider..."), and 2IP changes an IP agent every half an hour. Xiaoli also found the corresponding changes, so he set a frequency limit on the server, and then blocked the IP after more than 120 requests per minute. At the same time, considering that Baidu's crawlers may be accidentally injured, considering the marketing department's monthly delivery of hundreds of thousands, I wrote a script to check whether this ip is really Baidu's through hostname, and set a whitelist.

On Day 3, after Xiao Mo discovered the new restrictions, he thought that I was not in a hurry to ask for these data, and left the server to crawl slowly, so I modified the code to randomly crawl once every 1-3 seconds, crawl 10 times and rest for 10 seconds. Climb only at 8-12, 18-20 every day, and take a rest every few days. Xiaoli saw that the new log headers were too big, and then setting the rules would accidentally hurt real users, so he prepared to change his mind. When the total requests in 3 hours exceeded 50 times, a verification code pop-up box popped up, no If entered correctly, the IP will be recorded in the blacklist. On Day 4, Xiao Mo was a little silly when he saw the verification code, but it was not impossible. He first learned image recognition (keywords PIL, tesseract), and then binarized the verification code, segmented words, and trained patterns. In short Finally, Xiaoli's verification code was recognized (about the verification code, the identification of the verification code, and the anti-recognition of the verification code is also a magnificent history of struggle...), and then the crawler ran again. Xiaoli is a good and persevering student. After seeing that the verification code was broken, he discussed the development mode with the developers. The data is no longer rendered directly, but is acquired asynchronously by the front-end students, and is encrypted by the JavaScript library. A dynamic token is generated, and the encryption library is obfuscated at the same time (the more important steps are indeed done by websites, see the login process of Taobao and Weibo).

Is there no way to encrypt the encrypted library that was obfuscated on Day 5? Of course not, you can debug slowly and find the encryption principle, but Xiao Mo is not going to use such a time-consuming and labor-intensive method. He gave up the crawler based on HttpClient and chose the crawler with built-in browser engine (keywords: PhantomJS, Selenium), Run the page in the browser engine, get the correct result directly, and get the other party's data again. Xiao Li: .....

The struggle between reptiles and reptiles with hairy reptiles is still going on... Usually, in the battle between reptiles and anti-reptiles, the reptiles will definitely win. In other words, as long as the web pages that humans can access normally, crawlers can certainly crawl them with the same resources. Some suggestions about the crawler part: 1. Minimize the number of requests, if you can grab the list page, but not the details page, to reduce the pressure on the server, it is not easy for programmers to eat a meal. 2. Don't just look at Web sites, but also mobile apps and H5. Such anti-crawling measures are generally less. 3. In practical application, generally the defender will limit the frequency according to IP and it will be over. Unless the data is very core, more verification will not be carried out. After all, the cost will be considered. 4. If you really have high performance requirements, you can consider multithreading (some mature frameworks such as Scrapy already support it), or even distributed .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324843971&siteId=291194637