"Want to learn Python crawler must-see series" common anti-climbing methods and solutions

learning target

  1. Understand the reasons for server anti-crawl

  2. Understand what kind of crawlers the server often opposes

  3. Understand some common concepts in the anti-crawler field

  4. Understand the three directions of anti-climbing

  5. Understand common anti-climbing based on identity recognition

  6. Understand common anti-crawling based on crawling behavior

  7. Understand common anti-climbing based on data encryption


1 Reasons for server anti-crawl

  • Crawlers account for the total PV (PV refers to the number of page visits, every time a page is opened or refreshed, even a PV) has a high proportion, which is a waste of money (especially crawlers in March).

    What is the concept of a crawler in March? We will meet a crawling peak every March. When writing a thesis, a large number of masters will choose to crawl some websites and conduct public opinion analysis. Because the paper is handed in in May, so, everyone has read the book, you know, the various DotA and LOL in the early stage, it’s too late in March, hurry up to grab the data, analyze it in April, and hand in the paper in May. That's the rhythm.

  • The resources that the company can inquire for free are taken away in batches, losing competitiveness and making less money.

    The data can be queried directly in the non-logged state. If you are forced to log in, you can make the other party pay the price by blocking the account, which is also the practice of many websites. But it does not force the other party to log in. So if there is no anti-crawler, the other party can copy the information in batches, and the company's competitiveness will be greatly reduced. Competitors can capture the data, and users will know it after a long time. They only need to go to the competitor. There is no need to come to our website. This is not good for us.

  • The chance of suing a crawler is small

    Reptiles are still a sidekick in China, that is, it is possible that the prosecution can be successful, or it may be completely invalid. Therefore, we still need to use technical means to make the final guarantee.

2 What kind of crawlers are often opposed by the server?

  • Very low-level fresh graduates

    The crawlers of fresh graduates are usually simple and rude, regardless of server pressure, and the unpredictable number of people makes it easy to hang the site.

  • Very low-level small startup company

    There are more and more start-up companies now, and I don’t know who is being fooled. Then everyone starts a business and finds that they don’t know what to do. They think that big data is hot, so they start to do big data. The analysis program was almost written, and I found that I had no data at hand. How to do? Write a crawler. As a result, there are countless small crawlers, constantly crawling data out of the company's survival considerations.

  • I accidentally made a mistake, a little out-of-control crawler that no one will stop

    Some websites have already done corresponding anti-crawling, but crawlers are still crawling tirelessly. What does that mean? That is to say, they can't crawl any data at all, everything is wrong except that the httpcode is 200, but the crawlers still don't stop. This is probably some small crawlers hosted on some servers. They are no longer claimed. Working hard.

  • Forming business rivals

    This is the biggest opponent. They have the skills, money, and what they want. If they fight with you, you can only bite the bullet and fight with him.

  • Search engine

    Don’t think that search engines are all good people. They also have convulsions, and if convulsions will cause server performance to decline, the amount of requests is no different from network attacks.

 

3 Some common concepts in the field of anti-reptiles

Because anti-crawler is a relatively new field for the time being, some definitions must be made by yourself:

  • Crawler: A way to obtain website information in batches using any technical means. The key is batch size.

  • Anti-crawler: Use any technical means to prevent others from obtaining information on your own website in batches. The key is also the batch.

  • Injury by mistake: In the process of anti-crawler, ordinary users are mistakenly identified as crawlers. An anti-crawler strategy with a high false-injury rate cannot be used no matter how effective it is.

  • Intercept: successfully prevent crawlers from accessing. There will be the concept of interception rate. Generally speaking, the higher the interception rate of the anti-crawler strategy, the higher the possibility of accidental injury. So there is a trade-off.

  • Resources: the sum of machine cost and labor cost.

It must be remembered here that labor costs are also resources and more important than machines. Because, according to Moore's Law, machines are getting cheaper. According to the development trend of the IT industry, programmer salaries are getting more and more expensive. Therefore, usually server anti-crawl is to let crawler engineers work overtime, and the machine cost is not particularly valuable.

4 Three directions of anti-climbing

  • Anti-crawl based on identity recognition

  • Anti-crawl based on crawler behavior

  • Anti-climbing based on data encryption

5 Common anti-climbing based on identity recognition

1 Anti-crawl through the headers field

There are many fields in headers, and these fields may be taken by the other server to judge whether it is a crawler

1.1 Anti-crawl through the User-Agent field in headers

  • Anti-crawl principle: Crawlers do not have User-Agent by default, but use module default settings

  • Solution: Just add User-Agent before request; a better way is to use User-Agent pool to solve (collect a bunch of User-Agent, or randomly generate User-Agent)

1.2 Anti-climbing through the referer field or other fields

  • Anti-crawling principle: crawlers will not bring the referer field by default, and the server side judges whether the request is legal by judging the origin of the request

  • Solution: add referer field

1.3 Anti-crawling through cookies

  • Reasons for anti-crawl: check cookies to see if the user who initiated the request has the appropriate permissions to perform anti-crawl

  • Solution: Perform a simulated login, and perform data crawling after successfully obtaining the cookies

2 Anti-climbing by requesting parameters

There are many ways to obtain request parameters. When sending a request to the server, it is often necessary to carry request parameters. Usually, the server can determine whether it is a crawler by checking whether the request parameters are correct.

2.1 By obtaining request data from html static files (github login data)

  • Reason for anti-climbing: anti-climbing by increasing the difficulty of obtaining request parameters

  • Solution: carefully analyze each package obtained by capturing the package, and figure out the connection between the requests

2.2 Get request data by sending a request

  • Reason for anti-climbing: anti-climbing by increasing the difficulty of obtaining request parameters

  • Solution: carefully analyze each package obtained from the capture, figure out the connection between the requests, and figure out the source of the request parameters

2.3 Generate request parameters through js

  • Anti-climbing principle: js generates request parameters

  • Solution: Analyze js, observe the implementation process of encryption, obtain the execution result of js through js2py, or use selenium to achieve

2.4 Anti-crawl through verification code

  • Anti-crawling principle: the other server forces the verification of the user's browsing behavior through the pop-up verification code

  • Solution: The coding platform or machine learning method is used to identify the verification code. The coding platform is cheap and easy to use, and it is more recommended

6 Common anti-crawling based on crawler behavior

1 Based on request frequency or total number of requests

The behavior of crawlers is obviously different from that of ordinary users. The request frequency and number of requests of crawlers are much higher than that of ordinary users.

1.1 Anti-climbing by requesting ip/account total number of requests per unit time

  • Anti-crawling principle: The normal browser requests a website, and the speed is not too fast. If the same ip/account requests a large number of other servers, it is more likely to be recognized as a crawler

  • Solution: Correspondingly, the problem can be solved by purchasing high-quality ip/purchasing multiple accounts

1.2 Anti-climbing through the interval between requests for the same ip/account

  • Anti-crawling principle: Normal people operate the browser to browse the website, the time interval between requests is random, and the time interval between two requests before and after the crawler is usually relatively fixed and the time interval is short, so it can be used for anti-crawling

  • Solution: Randomly wait between requests to simulate real user operations. After the time interval is added, in order to obtain data at a high speed, try to use the proxy pool as much as possible. If it is an account, set a random sleep between account requests

1.3 Anti-climbing by setting a threshold for the number of requests per day for the request ip/account

  • Anti-crawl principle: normal browsing behavior, the number of requests per day is limited, usually if a certain value is exceeded, the server will refuse to respond

  • Solution: Corresponding method by purchasing high-quality ip/multi-account, and setting random dormancy between requests

2 Anti-crawl based on crawling behavior, usually analysis on crawling steps

2.1 Anti-climbing by jumping through js

  • Anti-climbing principle: js implements page jump, unable to get the next page url in the source code

  • Solution: Capture the packet multiple times to obtain the strip URL, analyze the law

2.2 Get the crawler ip (or proxy ip) through the honeypot (trap) and perform anti-crawling

  • Anti-crawling principle: When the crawler obtains the link and makes the request, the crawler will extract the follow-up link according to the regular, xpath, css and other methods. At this time, the server can set a trap url, which will be obtained by the extraction rule, but normal users cannot Obtain, so that it can effectively distinguish crawlers from normal users

  • Solution: After finishing writing the crawler, use a proxy to crawl in batches to test / carefully analyze the response content structure to find out the traps in the page

2.3 Anti-climbing through fake data

  • Anti-climbing principle: adding fake data to the returned response to pollute the database, usually family dramas will not be seen by normal users

  • Solution: Long-term operation, check the correspondence between the data in the database and the data in the actual page, if there is a problem / carefully analyze the response content

2.4 Blocking task queue

  • Anti-crawling principle: By generating a large number of junk URLs, the task queue is blocked and the actual work efficiency of the crawler is reduced

  • Solution: Observe the request response status during the running process / carefully analyze the source code to obtain the spam URL generation rules, and filter the URL

2.5 Blocking network IO

  • Anti-crawling principle: The process of sending a request to get a response is actually the process of downloading. A large file URL is mixed in the task queue. When the crawler makes the request, it will occupy the network io. If there are multiple threads, it will occupy the thread.

  • Solution: Observe the running status of the crawler/multi-thread timing the request thread/send request money

2.6 Comprehensive audit of operation and maintenance platform

  • Anti-crawling principle: Comprehensive management is carried out through the operation and maintenance platform, usually using a compound anti-crawling strategy, and multiple methods are used at the same time

  • Solution: Observe and analyze carefully, run the test target website for a long time, check the data collection speed, and deal with it in many ways

7 Common anti-climbing based on data encryption

1 Specialized processing of the data contained in the response

The usual specialization processing mainly refers to css data offset/custom font/data encryption/data picture/special encoding format, etc.

1.1 Anti-climbing by customizing the fontThe picture below is from the computer version of Maoyan Movie

  • Anti-climbing idea: use your own font file

  • Solution: switch to mobile version/analyze font files for translation

1.2 Anti-climbing through css The picture below is from the computer version of Maoyan Qunar

  • Anti-climbing idea: source data is not real data, and real data can be generated through css displacement

  • Solution: Calculate the offset of css


1.3 Anti-climbing by dynamically generating data through js

  • Anti-climbing principle: dynamically generated by js

  • Solution: Analyze key js, obtain data generation process, and simulate data generation

1.4 Anti-climbing through data visualization

1.5 Anti-climbing through encoding format

  • Anti-crawling principle: The default encoding format is not applicable. After obtaining the response, the crawler usually uses the utf-8 format to decode. At this time, the decoding result will be garbled or an error will be reported.

  • Solution: Multi-format decoding according to the source code, or the real decoding format

summary

  • Master common anti-climbing methods, principles and coping ideas

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/113952324