Detailed explanation of Scrapyd usage

Preface:
It must be clear that scrapyd is not scrapy.
Scrapyd is a crawler framework, while scrapyd is a tool for managing scrapy in the web version. After the scrapy crawler is written, it can be run from the command line, but if it can be operated on the web, it will be compared Convenience. scrapyd is to solve this problem. It can view the tasks being executed on the web page, create new crawler tasks, and terminate crawler tasks. The function is relatively powerful. There is also a more powerful domestic tool, gerapy!

Scrapyd usage details:

1. Install scrapyd

pip install

2. Install scrapyd-client

pip install scrapyd-client

3. Run scrapyd

First switch the command line path to the root directory of the Scrapy project.
To execute the following commands, you need to execute scrapyd on the command line to run scrapyd

$ scrapyd   
$的意思是.在命令行下

4. Publish the project to scrapyd

4.1: Modify the scapy.cfg file of the crawler

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = CZBK.settings
[deploy]                   
url = http://localhost:6800/   将#注释掉
project = CZBK

[deploy: The server name is set arbitrarily (trager)]. Generally, when it is necessary to publish the crawler to multiple target servers at the same time, it can be published to the specified server by specifying the name. Equivalent to the server name.

4.2: Check whether the scrapy configuration is correct
scrapyd-deploy -l # Note that it is a lowercase L, not the number 1

  1. release crawler

scrapyd-deploy <target> -p <project> --version <version>

target is the target name after deploy in the previous configuration file.
Project can be defined at will, regardless of the project name of the crawler.
version is a custom version number. If not written, it defaults to the current timestamp.

Note, do not put irrelevant py files in the crawler directory. Placing irrelevant py files will cause the release to fail, but when the crawler is successfully released, a setup.py file will be generated in the current directory, which can be deleted.

$scrapyd-deploy -p cz $ means. Under the command line. Be careful not to close the command line that just started scrapyd

Information after successful publishing:
Packing version 1523349647
Deploying to project "cz" in http://localhost:6800/addversion.json
Server response (200):

{"project": 
"cz", "node_name":
"ubuntu", "status":
"ok", "spiders": 
1, "version":
"1523349647"}

6. Start the crawler

curl http://127.0.0.1:6800/schedule.json -d project=工程名 -d spider=爬虫名

The project name can be found in the successful publishing information (this project name is not a project, but the parameter after -p when publishing)


$curl http://127.0.0.1:6800/schedule.json -d project=cz -d spider=cz

Success message:

{"node_name":
"ubuntu", "status":
"ok", "jobid":
"23be21443cc411e89c37000c29e9c505"}

After running, you can view the details of the running crawler at http://127.0.0.1:6800/jobs

  1. Cancel the crawler

curl http://127.0.0.1:6800/cancel.json -d project=cz -d job=jobid #jobid不要带""

Note: gerapy is continuously updated

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325947117&siteId=291194637