Implement a comprehensive crawler by changing the scrapy source code for spider distribution

Recently, I am writing a project. The requirements of the project are as follows:
I want to crawl about 100 kinds of hundreds of webpages, and the crawling frequency of these webpages is different. Some are crawled once a day, some are crawled once a week,
2 . , the content of web crawling has changed, that is to say, the content to be crawled will be changed according to the needs.
In view of the above requirements, we must do it in steps and run it on multiple servers. If we use the scrapy framework, it will take a few Hundreds of spiders, if you want to use multiple servers, how can you ensure the utilization of each service? If you run different spiders on different servers, this situation may occur if you do not control it. , the spiders on some servers are running at full capacity, while the spiders on other servers are idle and cannot achieve load balancing. If load balancing is achieved, these servers should be used as master-slave structures. If these hundreds of crawlers are started together . Use the master to assign tasks to each server. After the master-slave structure is made, the master server assigns tasks to the slave servers, and maintaining this assignment is also a very error-prone thing.
Another way is to start these hundreds of spiders at the same time on different servers and listen at the same time. scrapy_redis is not automatically closed by default. That is, if there is no request from the server. The server still won't shut down. He will continue to monitor. In this case, it needs to be closed manually, and the maintenance of the opening and closing of these hundreds of crawlers is a very troublesome thing.

My solution is. Put these hundreds of web content crawlers, these hundreds of crawlers. Put together to make a comprehensive crawler (main crawler).
If this project only starts one crawler. For those urls that do not need to be crawled, we only need to configure them in our configuration file, and comment out the sub-crawlers that do not need to be started, so that the crawlers that do not need to be started will not be excited.

Doing this solves the above problem,

First, among all the servers, we have only started one crawler, so the maintenance of this crawler on or off becomes very easy.
Second, it means that I only started one crawler. All servers run only one crawler, and all go to a queue to fetch data so that all servers are automatically load balanced.
Third, crawler management is convenient. If we want to crawl a certain webpage, we only need to register the class corresponding to the webpage to be crawled in the configuration file, and comment out the other classes that do not need to be crawled. In this way, the main crawler will crawl as required.

But the scrapy framework we are using. The default is to use a single crawler as a unit, that is. All crawlers are just a class file. All his parsing functions are in this class, so we need to change the source code of scrapy. Put the parsing class of the crawler file in another package for easy maintenance. This step is the most important task, which involves changing the scrapy source code.

The crawled website is mainly divided into the following two situations:
First, full-site crawling, that is to say, I give a starting URL, and then I give the rule rule to the crawler, and the crawler will follow such rules, according to this The starting URL is crawled throughout the site.
Second, the whole site cannot be crawled. URLs can only be constructed from given ID or other data pairs, and then crawled through these URLs.

 

Other functions:
1. Set up the Logger class. After it is defined, we can simply call it elsewhere.
Two, I also define some other helper methods. For example, how long does it take to close the crawler after there is no data in the crawler queue.
Third, when the crawler is started or closed or an exception occurs. Send me an email, so monitoring this crawler project becomes easier.

The full resolution of the project is very complicated. Here is the project address .

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325107163&siteId=291194637