Use scrapyd+spiderkeeper to deploy scrapy crawler in Linux environment

Disclaimer: This article is for study and research only, and it is forbidden to be used for illegal purposes. Otherwise, you will be at your own risk. If there is any infringement, please notify and delete it, thank you!

Project scene:

When we start the crawler work, there may be a few crawlers at the beginning. We can manually schedule the deployment. After the accumulation of time, it may be from ten to one hundred. When the 100 crawlers are completed, we have to manually reset them. It’s very troublesome to start up, and if you want to look at their output logs, you have to find them one by one. At this time, you must have a tool for quick deployment, task scheduling and log viewing. Here we choose the scrapyd deployment tool. +spiderkeeper visual crawler management UI to achieve this function.

Module overview:

Scrapy: is an open source web crawler framework written in Python. It is a program framework designed to crawl network data and extract structured data.

installation:pip install scrapy

Scrapyd: is a service that runs Scrapy crawlers, which allows you to deploy Scrapy projects and control its crawlers using HTTP JSON API.

installation:pip install scrapyd

Scrapyd-Client: Scrapyd-client is a Scrapyd client that allows you to deploy projects to the Scrapyd server. You can also generate egg files.

installation:pip install scrapyd-client

Spiderkeeper: a visual crawler management UI, which can be set to run regularly and view data statistics.

installation:pip install SpiderKeeper


solution:


1. Create a new scrapy crawler project scrapy startproject myspider here, and then enter the myspider directory to create a scrapy genspider spider www.baidu.com

Insert picture description here

2. Modify scrapy.cfg, add a deployment name my after deploy

Insert picture description here

3. Start scrapyd

Insert picture description here


4. Upload our crawler project in the myspider directoryscrapyd-deploy my -p myspider

Insert picture description here

5. After the upload is successful, you can see the status as OK, and execute the crawlercurl http://127.0.0.1:6800/schedule.json -d project=myspider -d spider=spider

Insert picture description here

6. Status shows OK to indicate successful operation, then we will display it on the UI of spiderkeeper, start spiderkeeper, and monitor http://localhost:6800 spiderkeeper --server=http://localhost:6800

Insert picture description here

7. After startup, visit http://server ip:5000 to open the Spiderkeeper management page. The default account password is admin.

Insert picture description here

8. Click Create Project to create the project, and then we are going to generate the egg file scrapyd-deploy --build-egg output.egg, and the information in the red box indicates success

Insert picture description here

9. Then we upload the egg file

Insert picture description here


Insert picture description here


Insert picture description here

10. After clicking submit, click on project and select the project we just created

Insert picture description here

to sum up

So far, our scrapy project has been successfully deployed. If your scrapy crawler code is updated later, you only need to upload the crawler to scrapyd again.scrapyd-deploy 部署名称 -p 项目名称
Reference link: https://zhuanlan.zhihu.com/p/63302475

Guess you like

Origin blog.csdn.net/qq_26079939/article/details/108599062