Disclaimer: This article is for study and research only, and it is forbidden to be used for illegal purposes. Otherwise, you will be at your own risk. If there is any infringement, please notify and delete it, thank you!
Project scene:
When we start the crawler work, there may be a few crawlers at the beginning. We can manually schedule the deployment. After the accumulation of time, it may be from ten to one hundred. When the 100 crawlers are completed, we have to manually reset them. It’s very troublesome to start up, and if you want to look at their output logs, you have to find them one by one. At this time, you must have a tool for quick deployment, task scheduling and log viewing. Here we choose the scrapyd deployment tool. +spiderkeeper visual crawler management UI to achieve this function.
Module overview:
Scrapy: is an open source web crawler framework written in Python. It is a program framework designed to crawl network data and extract structured data.
installation:pip install scrapy
Scrapyd: is a service that runs Scrapy crawlers, which allows you to deploy Scrapy projects and control its crawlers using HTTP JSON API.
installation:pip install scrapyd
Scrapyd-Client: Scrapyd-client is a Scrapyd client that allows you to deploy projects to the Scrapyd server. You can also generate egg files.
installation:pip install scrapyd-client
Spiderkeeper: a visual crawler management UI, which can be set to run regularly and view data statistics.
installation:pip install SpiderKeeper
solution:
1. Create a new scrapy crawler project scrapy startproject myspider here, and then enter the myspider directory to create a scrapy genspider spider www.baidu.com
2. Modify scrapy.cfg, add a deployment name my after deploy
3. Start scrapyd
4. Upload our crawler project in the myspider directoryscrapyd-deploy my -p myspider
5. After the upload is successful, you can see the status as OK, and execute the crawlercurl http://127.0.0.1:6800/schedule.json -d project=myspider -d spider=spider
6. Status shows OK to indicate successful operation, then we will display it on the UI of spiderkeeper, start spiderkeeper, and monitor http://localhost:6800 spiderkeeper --server=http://localhost:6800
7. After startup, visit http://server ip:5000 to open the Spiderkeeper management page. The default account password is admin.
8. Click Create Project to create the project, and then we are going to generate the egg file scrapyd-deploy --build-egg output.egg
, and the information in the red box indicates success
9. Then we upload the egg file