Crawler movement is restricted? Apocalypse teaches you a trick to crack!

In today's prevalence of IP agents, crawler technology has been fully used by us to collect data. But if we don't use proxy IP, our crawling action will often be subject to many restrictions, interrupting our crawling progress. Is there any way to avoid it? Take a look down with Tianqi IP ~Insert picture description here

1. Verification code
We will encounter on many websites, if the request volume is large, we will encounter the verification code. For example, the most criticized 12306 verification code is actually to prevent the generation of illegitimate requests to a certain extent.
For the verification code, the image can be identified through OCR. There are many codes shared by the gods on Github to check it out.

2. Headers restriction
This should be the most common and basic anti-crawler method, mainly to determine whether you are operating in a real browser.
This is generally easy to solve, just copy the Headers information in the browser and it will be OK.
It is worth noting that many websites only need userAgent information to pass, but some websites also need to verify some other information, such as Zhihu, some pages also need authorization information. So which headers need to be added, still need to be tried, and may also need information such as Referer and Accept-encoding.

3. Return the forged information.
This is really why the programmer makes it difficult for the programmer. The anti-crawler engineers also took great pains. On the one hand, it prevents the real data from being crawled on a large scale. On the other hand, it also increases the burden on your later data processing. If the data is forged well, it may not be easy for you to find that you are crawling fake data. Of course, you can only rely on you to clean the data later.

4. Reduce the returned information. The
most basic thing is to hide the actual amount of data. Only by constantly loading can the information be refreshed. Others are even more perverted. They will only show you a part of the information, which no one can see, and the crawler can't do anything. For example, CNKI, the content you can get every time you search is very limited. There seems to be no good solution to this, but such a dry website is after all a small number, because this method actually sacrifices a part of the real user experience to some extent.

5. Dynamic loading
Through asynchronous loading, on the one hand, it is for anti-crawler, on the other hand, it can also bring different experience to web browsing and realize more functions. Many dynamic websites use ajax or JavaScript to load the requested web page.
When encountering a dynamically loaded web page, you need to analyze the ajax request. Generally, you can directly find the json file that contains the data we want.
If the website encrypts files, you can use the selenium+phantomJS framework to call the browser kernel, and use phantomJS to execute js to simulate human operations and trigger js scripts in the page. In theory, selenium is a more versatile crawler solution, because this is indeed a real user behavior. Unless the website's anti-reptiles are so strict that they would rather kill by mistake.

6.
IP restriction IP is also the original intention of many website anti-reptiles. Some people just write a loop and start violent crawling, which will indeed bring a great burden to the website server, and this kind of frequent visits will obviously not It's a real user behavior, and simply and decisively block you.
In this case, you can abide by the rules and slow down your crawling speed, just stop for a few seconds each time you crawl.
Of course, you can also bypass this restriction by constantly changing IPs. You can build an IP pool by yourself and switch IPs after climbing to a certain amount. You can also choose some IP-changing software such as Tianqi IP to help you complete it.

Guess you like

Origin blog.csdn.net/tianqiIP/article/details/113844024