Distributed practice briefly crawler system

Issues raised:
reptile maintenance is a problem, when the site changed, need to re-acquire development, analytical, etc.;
in addition reptiles crawling cycle is also a problem, different website crawling cycle are not the same;
reptile encounter seal IP, proxy pool is a program, it is best to support multi-machine deployment;
reptiles also support multi-threading;
we need a unified management system, management system is best able to manage a variety of reptiles, reptile at the time of the new management system does not change;

Solution:
management system is only responsible for scheduling tasks, and save the results;
reptiles terminal is responsible for obtaining and task assignments, support a variety of reptiles;
reptile itself, focusing on data acquisition, data analysis, but also responsible for generating regular tasks;
after this split, each time a new crawler, will only need to add new crawler added to the system, the management system do not change substantially and the management terminal;

Method:
data structures, increase task parameter type, reptiles task to determine the parameters generated according to the task type;
each task parameters will be saved to the management server, when the reptile generation task, starting with the management server gets the task parameter;
then the generated task to submit management servers, management servers distribute the tasks to perform to the terminal;

 

ant design pro + tornado pitched one can run as designed

Guess you like

Origin www.cnblogs.com/van28/p/11244757.html