How to improve the efficiency of crawlers?

Crawlers have become an indispensable skill for practitioners in all walks of life. Whether they are engaged in technology, products, data analysis, finance, or cold-start startups, they all want to use crawlers to grab data. For such large-scale crawlers, the core issue is efficiency. How to obtain more data in a shorter time is the top priority of crawler optimization. What should I do? Apocalypse IP teaches you four tricks!
Insert picture description here

1. Minimize the number of visits.

The main time-consuming task of a single crawler task is waiting for the response of the network request, so if you can reduce the network request, try to reduce the request as much as possible. This can not only reduce the pressure of the target website, but also reduce the pressure of the proxy server, and at the same time reduce your own workload. Improve work efficiency.

2. Streamline the process and reduce duplication.

Most websites are not strictly non-intersecting tree structure, but multiple intersecting network structure, so web pages deep from multiple entrances will have a lot of repetitions, generally based on URL or ID for uniqueness judgment, crawling You don’t need to climb anymore. If some data can be obtained in one page or under multiple pages, then choose to obtain it in only one page.

Three, multi-threaded tasks.

A large number of crawlers is an IO blocking task, so the use of multi-threaded concurrency can effectively improve the overall speed. Multithreading can better improve resource utilization, program design is firmer, and program response is faster.

4. Distributed tasks.

The above three points are all achieved to the extreme, but the number of web pages that can be crawled per unit time by a single machine is not enough to achieve the goal, and the task cannot be completed in time within the specified time, then multiple machines can only perform crawling tasks at the same time. , This is a distributed crawler. For example, if there are 100W pages to be crawled, 5 machines can be used to crawl 20W pages that do not repeat each other. The time consumed by a single machine is reduced by 5 times.

Doing the above four points can basically improve the efficiency of crawlers, which not only reduces the workload and saves time, but also reduces the triggering of anti-crawler strategies.

Guess you like

Origin blog.csdn.net/tianqiIP/article/details/112521117