Scrapy reptiles pause and start

scrapy every reptile, can pause and record pause state when crawling what url, can begin crawling over the URL from the suspended state when the restart is not crawling

To pause and restart recording status

method one:

1 , first cd into scrapy project in (of course, you can also run through scripting in Python files directly in pycharm)

 2 , create record-keeping information scrapy project in the folder

 3 , execute the command: 

  scrapy crawl reptile name -s JOBDIR = record-keeping information path 

  scrapy crawl cnblogs: as -s JOBDIR = zant / 001 

  execute command starts the specified reptiles, and recording status to the specified directory 

reptile has been launched, we can press on the keyboard ctrl + c after stopping reptiles, stop us look at the record folder, it will be more than three files, which requests.queue folder in the file is the URL p0 log file, this file exists it means there is unfinished URL, it will automatically delete the URL when all is completed file 

when we re-execute the command: scrapy crawl cnblogs when -s JOBDIR = zant / 001 crawlers will start crawling continue where it left off in accordance with p0 file.

 

Method Two:

Add the following code in settings.py file: 

JOBDIR = ' sharejs.com '

Use the command scrapy crawl reptile name, it will automatically generate a sharejs.com directory, and then work into this folder list 

 

Guess you like

Origin www.cnblogs.com/songzhixue/p/11491146.html