Write a crawler's ideas, how to deal with anti-crawling!

After writing crawlers for so many years, I often still bump into the anti-crawl mechanism. Although it can be solved most of the time, after all, there are many kinds of anti-climbing mechanisms. Sometimes when you encounter an anti-climbing mechanism that you haven't seen for a long time, you will feel uncomfortable, and you can't think of a way to deal with it, and waste a lot of time. I have written a lot of crawlers recently, and I won’t write them for a while. While I’m still familiar, I’ll record the memo for your convenience.

I wrote an overview of commonly used anti-crawler banning methods before, but it is mainly from the perspective of anti-crawling, and this article is mainly from the perspective of writing crawlers.

The opening chapter is clear, when encountering the anti-climbing mechanism, there are nothing more than four methods to crawl down the data:

  1. Add agent

  2. Slow down

  3. Crack interface

  4. Register more accounts

In order to show that they are tall and tall, many articles boast about high concurrency, distributed, and machine learning to crack the verification code. They are all nonsense. Rather than talking about these things, it is better to climb down the data honestly. If you have to talk about some fancy things, then monitoring is more important than anything else.

To add, this article is about small data-collecting crawlers, which means that you have to collect a lot of information from a few sites in a short period of time. Rather than a search engine-based full-web crawler, it collects comprehensive information from a large number of sites over a long period of time. (Of course, the entire network must be high concurrent)

 

Why do we say that crawlers should not raise concurrency?

We know that computer programs are roughly divided into two categories according to different bottlenecks, CPU-intensive and IO-intensive. CPU-intensive tasks are computing tasks, such as codecs; IO-intensive tasks are network tasks, such as downloading or web servers. So what kind of crawler? You probably have to answer IO-intensive, congratulations on your answer. But this is not the point I want to talk about. The point is that crawlers are not only IO-intensive tasks, in fact I want to call them IP-intensive tasks.

What are IP-intensive tasks? According to the above definition, we know that, for crawlers, the most bottleneck is actually the number of IPs you hold! **As a qualified crawler writer, you must be good at forging various HTTP headers and cracking JS encryption parameters, but only one-the source IP-you cannot forge. There are a lot of things that seem to be difficult to do. If the Xiaobawang server on the other site can support it, it is easy to add enough IPs. You don't have to rack your brains to think about various strategies.

 

Why not use a ready-made framework?

As mentioned above, the so-called "high concurrency" is of no use to crawlers, so I don't really understand frameworks like Scrapy that use coroutines to improve concurrency. I wrote an article on why not to use Scrapy before, so I won’t go into details here.

In addition, if you write a lot of crawlers, you must have your own set of things. At this time, you may have your own small framework, which is okay. But I still want to mention two points:

  1. Never make it a function to generate a new crawler project from a template. What if you fix a bug in the template? Are the previously generated crawlers still modified one by one?

  2. The framework is as simple as possible, and the reusable functions are extracted into separate utility functions or libraries. It is inevitable that when the framework needs to be changed or is not applicable, a separate module can still be reused at this time.

 

Ideas when getting the grab task

Closer to home, we started talking about what to do when we get a site that needs to be crawled.

 

Crawling with small data volume

First start easy mode. If the structure of the website you want to grab is relatively simple, and you need less data. So the first thing you have to consider is not to write a crawler. Write a js expression console.log in the browser console, and maybe export the data.

If you want a little more data, it may be more complicated to open a page and copy the data. At this time, you can consider writing a small script. Don't just write an endless loop while True. Every time you crawl a page, at least time.sleep(1) is the minimum respect for the other party's website. Of course, your boss may be more anxious about data, but you have to be more relaxed.

 

What about the dynamic loading of the browser?

Beginners may encounter the first pit here: dynamic web pages. This may be a good thing, or it may be a bad thing. If it is a dynamic web page, the data is naturally loaded by ajax. If the ajax request does not have parameter verification, then it is simple, just changing from parsing html to parsing json.

Another situation is that the interface requires parameter verification. At this time, there are two processing methods:

  1. If you just crawl the data, go directly to the browser and you are done.

  2. If you suspect that the browser is taking up too much resources, you will often need to crack the interface. In this case, a certain JS reverse capability is required.

Some websites even have some restrictions on the browser. It will detect whether your browser is a normal user browser or a program-controlled automated browser. But this is not a problem, it is nothing more than forging the webdriver object and navigator object. I also wrote a specific article about how to forge it.

Of course, you may also encounter a special situation with a relatively simple situation at this time, that is, a certain update interface of the other party is fixed, and there is no timestamp in the encryption parameters, so just repeat the request directly. Generally speaking, this situation is regarded as "a blind cat ran into a dead mouse". In most cases, the interface signature has verified the search parameters and timestamp, that is, if you change the keyword or want to replay the request, it will not work. , At this time, just crack it honestly.

Generally, it is not difficult to crack JS, and there are not many commonly used information digests or encryption methods. But don't go on cracking it yet. Spend five minutes searching to see if anyone else has cracked it. It may save you half a day to a few days of work. Why not?

If you really want to start cracking, you can search (Opt+Cmd+F) globally in the JS console and click on keywords such as AES, MD5, and you may be rewarded. On the other hand, add breakpoints to the ajax request to gradually find the encrypted function.

After finding the encryption function, if it is simpler and write it directly in a function, you can extract it and call node directly to calculate the parameters, or you can rewrite it in Python more diligently. However, the tricky part is that some functions are bound to the window object or the DOM. At this time, you can also consider saving the entire JS file to complete the required interface.

 

The most common IP bans

As we said earlier, as a veteran crawler, forging and cracking simple encryption are basic skills. The more painful thing is to block IP. This is a capital problem that you can't solve without any hard work.

When our crawling rate is relatively fast, the IP may be blocked by the other party. At this time, it may be temporarily blocked, it may be continuously blocked, or it may be permanently blocked.

Permanent blackout is relatively ruthless, and there is nothing to do, just change the IP. What needs to be distinguished is temporary blackout and continuous blackout. If it is temporarily blocked, that is, if your request exceeds the threshold, the request will fail, but it will be restored after a while, and then your program logic will not need to be changed. Anyway, you will keep requesting, and there will always be data. If it is a continuous blackout, it will be a pitfall. In other words, if you are blacked out and do not stop the request, then you will not be able to get out of the small black room and you must stop and sleep for a few seconds. At this time, the logic of your program needs to adapt to this mechanism. If it is a separate script, for some standardized systems, these mechanisms need to be considered.

When we need to change the IP, we definitely can't remember to change the IP in a few minutes by hand. It's too annoying. Generally, an IP pool is needed.

Proxy IP is divided into several categories according to quality and source:

  1. Relatively spam public IP

  2. Relatively stable computer room IP

  3. Civilian network segment IP

There are some sites on the Internet that will provide some free proxy IP, it is estimated that they are all scanned. These IPs may be scanned and used by countless programs, so they can be said to be public IPs. By collecting and verifying these IPs, an agent pool can be constructed. If you are really poor, or the amount of crawling is not very large, you can use this IP. Although these IPs are extremely slow and have a very high failure rate, it is better than using one of your own export IPs.

Relatively stable computer room IP. This usually requires money to buy. I want to grab more data. 100 a month is the start. It is sufficient for most crawls.

For some abnormal sites, they will even verify the purpose of the source IP. For example, when you see that your IP is from the Alibaba Cloud computer room, you can block it without saying anything. At this time, the so-called "civil IP" is needed. This kind of special manufacturer and App cooperate to provide civil network export, or you can buy an ADSL machine for automatic dial-up construction. Anyway, the cost is very, very high, starting at least 1,000 a month.

 

Bring account or verification code

IP is considered anonymous after all. For some websites with more sensitive data, they may require you to log in before you can access. If there is not much data, just run it with your own account and it will be over. If the access limit of each account is limited, or the data to be crawled is particularly large, you may need to register a lot of accounts. At this time, as long as it is not a paid account, then it is actually easy to say (except WeChat). You can choose:

  • Buy some accounts, such as Weibo accounts for one piece and a half.

  • Register some by yourself. There is a free mailbox online, and there is also a mobile SMS receiving platform.

Don't rush to write a script to automate registration or something at this time, maybe you don't need so many accounts.

The limitation that is slightly weaker than the required account is the verification code. The graphic verification code is not a problem now. It is easy to directly code the platform or train a model. Click a little bit more complicated, choose pictures, choose words, etc. as desserts after dinner, if you can't figure it out, it's all luck. One point I want to say here is, please distinguish whether the website you want to crawl must have a verification code every time you request it, or whether the verification code is issued before the IP is blocked. If you don’t have a verification code for every request, just increase the proxy pool. There is no need to pick these things. Really, time is the most precious.

However, one thing that needs special attention here is: we must consider the legal risks. The need for account access has shown that this is not public data. It may trigger the other party's commercial interests or violate the user's privacy. You must think twice before climbing.

 

Things are not that simple

If a website uses only one method, then it is easy for us to analyze the problem. Unfortunately, there is basically no such good thing. For example, a website may detect the webdriver of the browser and block the IP. At this time, you have to use the browser and add a proxy. Sometimes setting a proxy for the browser is quite complicated. It is also possible that you use your account cookies to get up quickly, but you will even block the IP, hum, hum, and add the proxy pool, and then find that the account cannot be changed to log in to the IP, so I just want to curse.

There are also some immortal websites, such as Taobao or Judgment.com, which may not be of any value at all, but fortunately, I probably won't touch these websites.

 

Which interface to choose to climb

In fact, I found that there are only a few common crawler tasks:

  1. Find a list of the website and crawl all the data inside.

  2. Get some id or keywords yourself, and crawl all the data through the query or search interface.

  3. Brush some updates or recommended interfaces of the website in order to keep crawling.

The first choice is definitely a stateless interface. In most cases, the search interface can be used directly on the website. If there are interfaces that need to log in, and there are interfaces that do not need to log in, then you don't have to think about it, and definitely climb the interfaces that do not need to log in. Even if you log in, many IPs will still be blocked, why bother? Some website details or page turning may require login. There is really no way but to use these interfaces. At this time, it is necessary to isolate the login exit and the agent pool of ordinary crawlers.

It is necessary to determine whether the crawler task is to crawl full data or incremental data. Generally speaking, the demand for crawling the full amount of data is a bit different. You must argue with the demander. Maybe the other party didn't think about it clearly. You think you should save the data first and then slowly figure out how to use it. For this kind of silly X demand, you must Go back, don't rush to kneel and lick or show off your skills. If you must crawl the full amount, you can take your time. You don't have to hurry and hang up the other party's website. This is not something to show off, but to be despised. And the general website will not let you see the full amount of data. For example, it may only be able to turn 500 pages. At this time, you can only obtain as much data as possible by subdividing the query conditions. It’s generally better to say something about increments. Just like the above, it’s fine to update or recommend the interface regularly.

Want to climb the app? Generally speaking, the current website still has enough data, unless it is a website that only has App but no website, then there is no way. The idea of ​​cracking App is actually the same as that of JS, but many apps may be shelled or the encryption algorithm written in C. Compared to JS, it is not an order of magnitude at all. I basically don't know anything about app cracking, that is, I just rely on writing some web crawlers to eat.

You should be thankful that what you need to write is just a crawler, not a posting robot. The general website's prevention mechanism for this kind of spam is definitely more complicated than a crawler.

Here again, it is particularly emphasized: cracking someone else's App may be illegal and requires legal responsibility.

 

Finally, to summarize

So in summary, I think the ideas that need to be considered when encountering a website are:

  1. Estimate the data and time nodes that need to be crawled, and figure out how much data needs to be crawled per second. Don't just design an architecture, 80% of it won't be used at all.

  2. If the required rate is relatively small, then run time.sleep(5) slowly, that is, try not to trigger the ban.

  3. Try to find a public interface or page that can be accessed without logging in, and go directly to the proxy pool. Don't think about so many other things.

  4. For data that can be obtained from one interface, don't request more interfaces to minimize the amount of access.

  5. JS that can be cracked quickly can also be cracked. If it is more complicated, go directly to the browser, and the browser will directly pretend to avoid problems.

  6. Those who need login authentication must consider the problem of cookie invalidation in different places. It is best to use a separate high-quality IP. Make a set of routing mechanism to ensure that each cookie goes out from the same IP.

In short, solve one problem at a time, don't trigger two anti-climbing problems at the same time, it is easy to press the gourd to start.

That's it, the core point of this article - the simplest and rude one is to add an IP pool, if one is not enough, then two. No problem can be solved by adding money. Many classmates may think that you are called a crawler, and there is no distributed system. The most fun reverse is that you can copy other people's answers online, so it's not funny! However, unfortunately, this is the real world. For business, the most important thing for crawlers is that you get useful data instead of writing code. Isn’t it good to have this time to go home with your family?

 

references

This article has no references, pure stream of consciousness. The article quoted a few articles I wrote before, so I didn't bother to post it, so I am interested in finding it on the website or official account.

PS: Monitoring is very important. Crawlers are most afraid of running and running on the opposite side. The version has been revised or the reverse crawling has been added. Only with monitoring can we find problems in time. Regarding monitoring, Prometheus is highly recommended, you can refer to my previous article.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112783113