How big is the brain hole of front-end engineers to fight back against crawlers?

How big is the brain hole of front-end engineers to fight back against crawlers?


1 Introduction

For a web page, we often want it to be well-structured and clear in content, so that search engines can accurately recognize it.
On the other hand, there are some scenarios where we do not want the content to be easily obtained, such as the transaction volume of e-commerce websites, the titles of educational websites, etc. Because these contents are often the lifeblood of a product, they must be effectively protected. This is where the topic of reptiles and anti-reptiles comes from.

2. Common anti-reptile strategies

But there is no website in the world that can be perfectly anti-crawlers.

If the page wants to be displayed normally in front of users without giving crawlers a chance, it must be able to identify real people and robots. Therefore, engineers have made various attempts. Most of these strategies are used in the back- end , and they are also relatively conventional and effective means at present, such as:

  • User-Agent + Referer detection
  • Account and Cookie Verification
  • verification code
  • IP Restriction Frequency

And reptiles can be infinitely close to real people, such as:

  • chrome headless or phantomjs to simulate browser environment
  • tesseract recognizes verification code
  • Agent IP Taobao can buy

So we say, 100% anti-crawling strategy? nonexistent.
It's more of a physical activity, and it's a matter of difficulty.

However, as front-end engineers, we can increase the difficulty of the game and design some anti-crawling strategies that are very (sang) meaningful (bing) thinking (kuang) .

3. Front-end and anti-crawlers

3.1 FONT-FACE patchwork

Example: Cat's Eye Movie

In Maoyan movies, the box office data are not purely numbers.
The page uses font-face to define the character set, and uses unicode to map and display. That is to say, except for image recognition, the character set must be crawled at the same time to recognize the numbers.

Moreover, every time the page is refreshed, the url of the character set changes, which undoubtedly increases the crawling cost more difficult.

3.2 BACKGROUND Patchwork

Example: Meituan

Similar to font's strategy, Meituan uses background patchwork. The numbers are actually pictures, and different characters are displayed according to different background offsets.

And different pages, the character ordering of pictures is also different. However, in theory, it only needs to generate 0-9 and a decimal point, so I don't understand why there are repeated characters.

Page A:

Page B:

3.3 Character Interspersed

Example: WeChat Official Account Article

Articles of some WeChat official accounts are interspersed with various mysterious characters, and these characters are hidden through styles.
Although this method is shocking... but in fact, it is not too difficult to identify and filter, and it can even be done better, but it is also a kind of brain hole.

By the way, can I get reimbursed for my mobile data?

3.4 Pseudo-element hidden

Example: Auto Home

Home of the car, the key manufacturer information is put into the content of the pseudo-element.
This is also a way of thinking: to crawl web pages, css must be parsed, and the content of pseudo-elements needs to be obtained, which increases the difficulty of crawlers.

3.5 Element positioning overlay

Example: where to go

And where are the math lovers? For a 4-digit ticket price, first use four ilabels to render, then use two blabels to absolute position the offset, cover the ilabels that are intentionally displayed wrong, and finally form the correct visually. price…

This shows that the crawler can not parse CSS, and it has to do math problems.

3.6 IFRAME asynchronous loading

Example: NetEase Cloud Music

As soon as the NetEase Cloud Music page is opened, there is almost only one html source code iframe, and its src is blank: about:blank. Then js starts to run, and the frame of the entire page is asynchronously stuffed into the iframe...

However, the difficulty brought by this method is not great, it is just a detour in asynchronous and iframe processing (or there are other reasons, not entirely based on anti-crawling considerations), whether you use selenium or phantom, there are APIs that can be used to the content information in the iframe.

3.7 Character segmentation

Example: whole network proxy IP

On some pages that display proxy IP information, the protection of IP is also a lot of trouble.

They will first divide the IP numbers and symbols into dom nodes, and then insert confusing numbers in the middle. If the crawler does not know this strategy, they will think that they have successfully obtained the value; but if the crawler notices, it will be solved. .

3.8 Character set substitution

Example: go where to move the side

The mobile version of Qunar will also deceive the crawler.

3211 is clearly written in the html, but 1233 is displayed visually. It turns out that they redefined the character set, and the order of 3 and 1 just swapped the result...

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326257707&siteId=291194637