Concluding remarks: the only way from reptilian Xiaobai to master

If you have seen this lesson, then congratulations you have completed all the content of this column. The knowledge points of crawlers are very complicated. I believe you have experienced a lot of ups and downs along the way.

In this lesson, we will summarize the content that web crawlers need to learn. This is also some technology stack that I personally think crawler engineers should have. Due to the limited space of the column, it is certainly impossible to cover all the knowledge points, but the basic knowledge All have been covered. Below I will summarize and sort out the knowledge points of web crawlers. If you want to learn more about web crawlers, you can refer to it.

The learning of web crawlers is related to computer network, programming foundation, front-end development, back-end development, App development and reverse engineering, network security, database, operation and maintenance, machine learning, data analysis and other content. It treats the content like a big web Some mainstream technology stacks are now connected together. Because of the many directions covered, the learning is also very scattered and messy.

1. Beginner crawler

Some of the most basic websites often do not carry any anti-climbing measures. For example, on a blog site, if we want to crawl the entire site, we can follow the list page to the article page, and then crawl down the information such as the time, author, and text of the article.

How to write the code? It is enough to use Python's requests and other libraries, write a basic logic, and obtain the source code of an article by the way, and use XPath, BeautifulSoup, PyQuery or regular expressions, or crude string matching to get what you want Cut out the content and add a text to save it.

The code is also very simple, just a few method calls. The logic is also very simple, a few cycles plus storage. Finally, we can see that an article was saved on our computer. Of course, if you are not good at writing code or are too lazy to write, then use basic visual crawling tools, such as a certain claw fish, a certain descendant collector, and the data can also be crawled down by visual clicking.

If the storage aspect is slightly expanded, you can connect to MySQL, MongoDB, Elasticsearch, Kafka, etc. to save data and achieve persistent storage. It will be more convenient to query or operate in the future.

Anyway, no matter how efficient it is, a website that has no anti-crawl at all can be done in the most basic way. At this point, can you say that you can crawl? No, it's not even close.

2. Ajax, dynamic rendering

With the development of the Internet, the front-end technology is constantly changing, and the data loading method is no longer pure server-side rendering. Now you can see that the data of many websites may be transmitted through an interface, or even if it is not an interface, it is some JSON data and then rendered through JavaScript.

At this time, you have to use requests to crawl again, because the source code that requests crawled down is obtained by server-side rendering, and the browser sees the page and the results obtained by requests are different. The real data is obtained through JavaScript execution. The data source may be Ajax, some data in the page, or some ifame pages, etc., but in most cases it may be obtained through the Ajax interface.

Therefore, it is necessary to analyze Ajax in many cases, and then use the program to simulate after knowing the calling method of these interfaces. But some interfaces carry encrypted parameters, such as token, sign, etc., which are not easy to simulate. What should I do?

One way is to analyze the JavaScript logic of the website, learn the code inside, study how these parameters are constructed, and find out the idea and then use the crawler to simulate or rewrite it. If you parse it out, then the direct simulation method will be much more efficient. It requires some JavaScript foundation. Of course, some website encryption logic is too powerful. You may not be able to parse it out for a week, and you can only give up. Up.
What should I do if I can’t solve it this way or don’t want to solve it? At this time, a simple and rude method can be used, that is, to crawl directly by simulating a browser, such as Puppeteer, Pyppeteer, Selenium, Splash, etc., so that the source code crawled is the real web code, and the data is natural It's easy to extract, and at the same time bypass the process of analyzing Ajax and some JavaScript logic. In this way, it is visible and can be crawled, and it is not difficult. At the same time, the browser is simulated, and there are not some legal problems.

But in fact, the latter method will also encounter various anti-climbing situations. Now many websites will identify webdriver and see that you are using tools such as Selenium, and directly refuse to connect or return no data, so you encounter this The website has to be dedicated to solve this problem.

3. Multi-process, multi-thread, coroutine

The above situation is relatively simple to simulate with a single-threaded crawler, but there is a problem that the speed is slow.

Crawlers are I/O-intensive tasks, so in most cases they are waiting for a response from the network. If the network response is slow, you have to wait all the time. But this spare time can actually allow the CPU to do more things. then what should we do? Open some more threads.

So at this time, we can add multi-process and multi-thread in some scenarios. Although multi-thread has GIL lock, the impact on crawlers is not that big, so multi-process and multi-thread can be doubled. To improve the crawling speed, the corresponding libraries include threading, multiprocessing, etc.

Asynchronous coroutines are even more powerful. Using tools such as aiohttp, gevent, tornado, etc., you can basically open as many concurrency as you want, but you still have to be cautious and don't mess up the target website.
In short, with these tools, the crawling speed will be improved. But the speed is not necessarily a good thing. Anti-climbing measures will definitely come next. You can block your IP, block your account, play verification codes, and return fake data. So sometimes turtle speed climbing seems to be a solution?

4. Distributed

Multi-threading, multi-process, and coroutine can accelerate, but after all, it is a single-machine crawler. To truly achieve scale, you have to rely on distributed crawlers.

What is the core of distributed? Resource Sharing. Such as crawling queue sharing, deduplication fingerprint sharing, and so on.

We can use some basic queues or components to achieve distribution, such as RabbitMQ, Celery, Kafka, Redis, etc., but after many people's attempts to implement a distributed crawler by ourselves, performance and scalability will always have some problems. Many companies actually have a set of distributed crawlers developed by themselves, which are closer to the business. This is of course the best.

Now mainstream Python distributed crawlers are still based on Scrapy, docking with Scrapy-Redis, Scrapy-Redis-BloomFilter or using Scrapy-Cluster, etc., they are all based on Redis to share the crawling queue, more or less always encounter some memory The problem. So some people also consider connecting to other message queues, such as RabbitMQ, Kafka, etc., which can solve some problems, and the efficiency is not bad.

In short, to improve crawling efficiency, distribution must be mastered.

5. Verification code

It is inevitable to encounter anti-crawling when crawling, and the verification code is one of them. To be able to crawl back, you must first solve the verification code.

Now you can see that many websites will have a variety of verification codes, such as the simplest graphic verification code. If the text rules of the verification code, it can be recognized by OCR detection or basic model library. You can directly connect one by one. To solve this problem, the accuracy rate is still possible.

However, you may not see any graphic verification codes now. They are all behavior verification codes, such as certain verification, certain shield, etc. There are also many foreign countries, such as reCaptcha. For some simpler ones, such as sliding ones, you can find ways to identify gaps, such as image processing comparison and deep learning recognition.

For trajectory behavior, you can write a simulation of normal human behavior by yourself, adding jitter and so on. How to simulate after having the track, if you are very good, you can directly analyze the JavaScript logic of the verification code, enter the track data, and get some encrypted parameters inside, and put these parameters directly into the form or interface. Used directly. Of course, you can also drag by simulating a browser, and you can also get the encryption parameters in a certain way, or you can log in together by simulating a browser, or you can crawl with Cookies.

Of course, dragging is just a verification code, as well as text selection, logical reasoning, etc. If you really don’t want to solve it by yourself, you can find a coding platform to analyze it and simulate it, but after all, it costs money, and some masters will choose to train by themselves Deep learning related models, collecting data, labeling, training, and training different models for different businesses. In this way, with the core technology, there is no need to spend money to find a coding platform. After studying the logic of the verification code, the encryption parameters can be analyzed. However, some verification codes are so difficult to parse that I can't figure it out.
Of course, some verification codes may pop up because of too frequent requests. This can be solved if you change the IP.

6. Sealed IP

Blocking IP is also a headache, and the effective method is to change the proxy. There are many kinds of agents, free of charge on the market, and too many charges.

First of all, you can use the free agents on the market, build an agent pool by yourself, collect all the free agents in the entire network, and then add a tester to keep testing, and the tested URL can be changed to the URL you want to crawl. Generally, those who pass the test can be directly used to crawl the target website. I have also built a proxy pool myself, and now I have docked some free proxies, timed crawling, timed testing, and also wrote an API to fetch it, and put it on GitHub: https://github.com/Python3WebSpider/ProxyPool , set it up Docker image is provided, Kubernetes script is provided, you can directly use it.

The same is true for paid agents. Many merchants provide an agent extraction interface. You can get hundreds of agents with one request, and we can also access them into the agent pool. But this agency service is also divided into various packages, and the quality of open agency, exclusive agency, etc. and the probability of being blocked are also different.

Some businesses also use tunneling technology to build agents, so that we don’t know the address and port of the agent. The agent pool is maintained by them, such as a cloud. This is more worry-free to use, but the controllability Worse.

There are more stable agents, such as dial-up agents, cellular agents, etc., which have higher access costs, but to a certain extent they can also solve some IP blocking problems.

7. Block account

Some information requires simulated login to crawl. If you crawl too fast, your account will be blocked directly by the target website, and there will be no way. For example, if you crawl the official account, if someone blocks your WX account, then it's all over.

One solution is to slow down the frequency and control the rhythm. Another way is to look at other terminals, such as mobile page, app page, wap page, and see if there is a way to bypass the login.

Another better method is to divert. If you have enough numbers, create a pool, such as Cookies pool, Token pool, Sign pool, etc. Cookies and Tokens from multiple accounts will be placed in this pool, and one will be randomly obtained from it when used. If you want to ensure that the crawling efficiency remains unchanged, compared with 20 accounts, the frequency of accessing Cookies and Token for each account will be 1/5 of the original for 100 accounts, and the probability of being blocked is also Then it lowered.

8. Wonderful anti-climbing

The above are several more mainstream anti-climbing methods, of course there are many weird anti-climbing methods. For example, returning fake data, returning graphical data, returning out-of-order data, etc., all require specific analysis.

You have to be careful about these anti-climbings. I have seen an anti-climbing that directly returns rm -rf / before. If you happen to have a script that simulates the execution to return the result, imagine the consequences for yourself.

9. JavaScript reverse engineering

Speaking of the point. With the advancement of front-end technology and the enhancement of website anti-climbing awareness, many websites choose to work hard on the front-end, which is to encrypt or obfuscate some logic or code on the front-end. Of course, this is not only to protect the front-end code from being easily stolen, but also to prevent crawling. For example, many Ajax interfaces will carry some parameters, such as sign, token, etc., which have also been discussed in the previous article. This kind of data can be crawled by the methods such as Selenium mentioned above, but in general it is too inefficient. After all, it simulates the entire process of web page rendering, and the real data may only be hidden in a small interface. .

If we can find out the real logic of some interface parameters and use code to simulate the execution, the efficiency will be doubled, and the anti-climbing phenomenon mentioned above can be avoided to a certain extent. But what is the problem? It's more difficult to achieve.

Webpack is one aspect. The front-end code is compressed and transcoded into some bundle files. The meaning of some variables has been lost and it is not easy to restore. Then some websites add some obfuscator mechanisms to turn the front-end code into something you can't understand at all, such as string shuffling, variable hexadecimalization, control flow flattening, unlimited debugging, console disablement, etc. , The code and logic of the front end have been completely changed. Some use WebAssembly and other technologies to directly compile the core logic of the front-end, then you can only learn it slowly. Although some have certain skills, it still takes a lot of time. But once it's resolved, everything will be fine.

Many companies recruit crawler engineers will ask if they have JavaScript reverse engineering, which websites have been cracked, such as a certain treasure, a certain number, a certain article, etc., and they may directly hire you if they solve a certain requirement. Each website has different logic and difficulty.

10.App

Of course, crawlers are more than just web crawlers. With the development of the Internet era, more and more companies now choose to put data on apps, and some companies even have apps but no websites. So the data can only be crawled through the App.

How to climb it? The basic is the packet capture tool. After Charles, Fiddler, etc. capture the interface, they can directly simulate it.

What if the interface has encryption parameters? One way you can process it while crawling, such as mitmproxy directly monitoring interface data. On the other hand, you can take Hook, such as Xposed, you can also get it.

Then how to automate when climbing? You can't poke with your hands. In fact, there are many tools, such as Android's native adb tool. Appium is now a more mainstream solution. Of course, there are other wizards that can be implemented.

Finally, sometimes, I really don’t want to go through the automated process. I just want to extract some of the interface logic inside. Then I need to reverse engineering. Tools such as IDA Pro, jdax, FRIDA, etc. will come in handy. Of course this The process is as painful as JavaScript reverse engineering, and you may even have to read assembly instructions.

11. Intelligent

After you have mastered the above knowledge, congratulations you have exceeded 80% to 90% of crawlers. Of course, those who specialize in JavaScript reverse engineering and App reverse engineering are all people at the top of the food chain. This is strictly speaking. Not in the category of crawlers.

In addition to the above skills, in some situations, we may also need to combine some machine learning techniques to make our crawlers smarter.

For example, many blogs and news articles now have relatively high similarity in page structure, and the information to be extracted is also relatively similar.

For example, how to distinguish a page is an index page or a detail page? How to extract the article link of the detail page? How to parse the page content of the article page? These can actually be calculated by some algorithms.

Therefore, some intelligent analysis technologies have also emerged, such as extracting the details page. The GeneralNewsExtractor written by a friend of mine performs very well.

Suppose there is a need to crawl 10,000 news website data, do you want to write XPath one by one? If you have intelligent analysis technology, it will take a minute to complete this under the condition of tolerating certain errors.

In short, if we can learn this, our crawler technology will be even more powerful.

12. Operation and maintenance

This content is also a highlight. Crawlers and operation and maintenance are also closely related. such as:

  • After writing a crawler, how to quickly deploy it to 100 hosts and run it.

  • How to flexibly monitor the running status of each crawler.

  • The crawler has some code changes, how to update it quickly.

  • How to monitor the memory and CPU consumption of some crawlers.

  • How to scientifically control the timing operation of crawlers.

  • There is a problem with the crawler, how to receive notifications in time, and how to set up a scientific alarm mechanism.

Here, everyone has their own methods of deployment, such as Ansible. If you use Scrapy, there is Scrapyd, and then you can complete some monitoring and timing tasks with some management tools. But what I use now is Docker + Kubernetes, plus a set of DevOps solutions, such as GitHub Actions, Azure Pipelines, Jenkins, etc., to quickly realize distribution and deployment.

Some of you use crontab for timing tasks, some use apscheduler, some use management tools, and some use Kubernetes. In my case, you will use Kubernetes more, and timing tasks are also easy to implement.

As for monitoring, there are many. Special crawler management tools have some monitoring and alarm functions. Some cloud services also bring some monitoring functions. I use Kubernetes + Prometheus + Grafana. The CPU, memory, and operating status are clear at a glance. The alarm mechanism is also very convenient to configure in Grafana. It supports Webhook, email and even a nail.

For data storage and monitoring, Kafka and Elasticsearch personally feel very convenient. I mainly use the latter, and then cooperate with Grafana to monitor the amount of data crawling, crawling speed, etc. at a glance.

13. Law

In addition, I hope you pay attention to some legal issues in the process of doing web crawlers, basically:

  • Don't touch personal privacy information.

  • Avoid commercial competition and see the legal restrictions of the target site.

  • Limit the concurrency speed and do not affect the normal operation of the target site.

  • Don't touch illegal products, pornography, gambling and drugs.

  • Do not advertise and spread the cracking scheme of the target site or App casually.

  • Non-public data must be cautious.

For more information, please refer to some articles:
https://mp.weixin.qq.com/s/aXr-ZE0ZifTm2h5w8BGh_Q

https://mp.weixin.qq.com/s/zVTMQz78L16i7j8wXGjbLA

https://mp.weixin.qq.com/s/MrJbodU0tcW6FRZ3JQa3xQ

14. Conclusion

At this point, some of the knowledge points covered by crawlers are almost the same. Through combing and discovering computer network, programming foundation, front-end development, back-end development, App development and reverse engineering, network security, database, operation and maintenance, and machine learning are all covered? The above summary can be regarded as the path from the crawler Xiaobai to the crawler master. There are actually many researchable points in each direction, and each point will be very remarkable if it is refined.

Finally, thank you for studying my course, and hope you will learn something in the learning process.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/108479501