Lecture 50: Don't worry about Scrapy deployment, the principle and use of Scrapyd

In the last lesson, our distributed crawler deployment is completed and can run successfully, but there is a very cumbersome link, that is, code deployment.

We imagine the following scenarios:

  • If the code is deployed by uploading files, we first need to compress the code, then upload the files to the server by SFTP or FTP, and then connect to the server to decompress the files. Each server needs to be configured like this.

  • If we use Git synchronization to deploy the code, we can first push the code to a Git repository, and then connect to each host remotely to perform the Pull operation, and to synchronize the code, each server also needs to do an operation.

If the code is suddenly updated, then we must update each server, and if the version of which host is not well controlled, it may also affect the overall distributed crawling status.

So we need a more convenient tool to deploy the Scrapy project. If we can save the operation of logging in to the server one by one, it will be much more convenient.

In this section, we will take a look at Scrapyd, a tool that provides distributed deployment.

1. Understand Scrapyd

Next, let's take a closer look at Scrapyd. Scrapyd is a service program that runs Scrapy crawlers. It provides a series of HTTP interfaces to help us deploy, start, stop, and delete crawlers. Scrapyd supports version management and can also manage multiple crawler tasks. With it, we can easily complete the deployment task scheduling of the Scrapy crawler project.

Ready to work

First, we need to install scrapyd. Generally, the server we deploy is Linux, so here we take Linux as an example.

Here is recommended to use pip installation, the command is as follows:

pip3 install scrapyd 

In addition to the preparation of our projects can run successfully, you need to install the project itself dependent on the environment, as a project to rely on Scrapy, Scrapy-Redis, Gerapy-Pyppeteerand other libraries, also needs to be installed on the server, otherwise there will be problems of failed deployments.
After installation, you need to create a new configuration file /etc/scrapyd/scrapyd.conf. Scrapyd will read this configuration file when it is running.

After Scrapyd 1.2 version, this file will not be created automatically, we need to add it ourselves. First, execute the following command to create a new file:

sudo mkdir /etc/scrapyd   
sudo vi /etc/scrapyd/scrapyd.conf 

Then write the following:

[scrapyd]   
eggs_dir    = eggs   
logs_dir    = logs   
items_dir   =   
jobs_to_keep = 5   
dbs_dir     = dbs   
max_proc    = 0   
max_proc_per_cpu = 10   
finished_to_keep = 100   
poll_interval = 5.0   
bind_address = 0.0.0.0   
http_port   = 6800   
debug       = off   
runner      = scrapyd.runner   
application = scrapyd.app.application   
launcher    = scrapyd.launcher.Launcher   
webroot     = scrapyd.website.Root   
​ 
[services]   
schedule.json     = scrapyd.webservice.Schedule   
cancel.json       = scrapyd.webservice.Cancel   
addversion.json   = scrapyd.webservice.AddVersion   
listprojects.json = scrapyd.webservice.ListProjects   
listversions.json = scrapyd.webservice.ListVersions   
listspiders.json  = scrapyd.webservice.ListSpiders   
delproject.json   = scrapyd.webservice.DeleteProject   
delversion.json   = scrapyd.webservice.DeleteVersion   
listjobs.json     = scrapyd.webservice.ListJobs   
daemonstatus.json = scrapyd.webservice.DaemonStatus 

The content of the configuration file can be found in the official document https://scrapyd.readthedocs.io/en/stable/config.html#example-configuration-file . The configuration files here have been modified. One of them is max_proc_per_cpu. The official default is 4, that is, a host can run up to 4 Scrapy tasks per CPU, which is increased to 10. The other is bind_address, which defaults to local 127.0.0.1, which is modified here to 0.0.0.0 to make the external network accessible.

Scrapyd is a pure Python project, you can directly call it to run here. In order to keep the program running in the background, Linux and Mac can use the following commands:

(scrapyd > /dev/null &) 

In this way, Scrapyd can continue to run in the background, and the console output is directly ignored. Of course, if you want to record the output log, you can modify the output destination as follows:

(scrapyd> ~/scrapyd.log &) 

At this time, the running results of Scrapyd will be output to the ~/scrapyd.log file. Of course, you can also use screen, tmux, supervisor and other tools to implement process guarding.

After installing and running Scrapyd, we can visit the 6800 port of the server to see a WebUI page. For example, my server address is 120.27.34.25. After installing Scrapyd on it and running successfully, then I can open it in the local browser : http://120.27.34.25:6800, You can see the home page of Scrapyd, here please replace it with your server address to view, as shown in the figure:
Insert picture description here
If you can successfully access this page, then it proves that there is no problem with Scrapyd configuration.

2. Features of Scrapyd

Scrapyd provides a series of HTTP interfaces to implement various operations. Here we can sort out the functions of the interface and take Scrapyd's IP 120.27.34.25 as an example to explain.

2.1 daemonstatus.json

This interface is responsible for viewing the current status of Scrapyd services and tasks. We can use the curl command to request this interface. The command is as follows:

curl http://139.217.26.30:6800/daemonstatus.json 

So we will get the following result:

{
    
    "status": "ok", "finished": 90, "running": 9, "node_name": "datacrawl-vm", "pending": 0} 

The returned result is a Json string, status is the current running status, finished represents the currently completed Scrapy task, running represents the running Scrapy task, pending represents the Scrapyd task waiting to be scheduled, node_name is the name of the host.

2.2 addversion.json

This interface is mainly used to deploy the Scrapy project. When deploying, we need to first package the project into an Egg file, and then pass in the project name and deployment version.

We can implement project deployment in the following ways:

curl http://120.27.34.25:6800/addversion.json -F project=wenbo -F version=first -F egg=@weibo.egg 

Here -F means adding a parameter, and we also need to package the project into an Egg file and place it locally.
After making the request in this way, we can get the following results:

{
    
    "status": "ok", "spiders": 3} 

This result indicates that the deployment was successful, and the number of Spiders contained in it was 3. This method may be cumbersome to deploy. Later I will introduce more convenient tools to implement project deployment.

2.3 schedule.json

This interface is responsible for scheduling the deployed Scrapy project to run. We can implement task scheduling through the following interfaces:

curl http://120.27.34.25:6800/schedule.json -d project=weibo -d spider=weibocn 

Two parameters need to be passed in here, project is the name of the Scrapy project, spider is the name of the spider. The returned results are as follows:

{
    
    "status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"} 

status represents the startup status of the Scrapy project, and jobid represents the code of the currently running crawling task.

2.4 cancel.json

This interface can be used to cancel a crawling task. If the task is pending, then it will be removed. If the task is running, then it will be terminated.

We can use the following command to cancel the running of the task:

curl http://120.27.34.25:6800/cancel.json -d project=weibo -d job=6487ec79947edab326d6db28a2d86511e8247444 

Two parameters need to be passed in here, project is the project name, job is the code of the crawling task. The returned results are as follows:

{
    
    "status": "ok", "prevstate": "running"} 

status represents the execution of the request, and prevstate represents the previous running status.

2.5 listprojects.json

This interface is used to list all project descriptions deployed to the Scrapyd service. We can use the following command to get all the project descriptions on the Scrapyd server:

curl http://120.27.34.25:6800/listprojects.json 

There is no need to pass in any parameters. The returned results are as follows:

{
    
    "status": "ok", "projects": ["weibo", "zhihu"]} 

status represents the execution status of the request, and projects is a list of project names.

2.6 listversions.json

This interface is used to get all the version numbers of a project, the version numbers are arranged in order, and the last entry is the latest version number.

We can use the following command to get the version number of the project:

curl http://120.27.34.25:6800/listversions.json?project=weibo 

A parameter project is required here, which is the name of the project. The returned results are as follows:

{
    
    "status": "ok", "versions": ["v1", "v2"]} 

status represents the execution status of the request, and versions is a list of version numbers.

2.7 listspiders.json

This interface is used to get all the Spider names of the latest version of a project. We can use the following command to get the spider name of the project:

curl http://120.27.34.25:6800/listspiders.json?project=weibo 

A parameter project is required here, which is the name of the project. The returned results are as follows:

{
    
    "status": "ok", "spiders": ["weibocn"]} 

status represents the execution status of the request, spiders is a list of spider names.

2.8 listjobs.json

This interface is used to obtain the details of all tasks currently running in a project. We can use the following command to get all task details:

curl http://120.27.34.25:6800/listjobs.json?project=weibo 

A parameter project is required here, which is the name of the project. The returned results are as follows:

{
    
    "status": "ok", 
 "pending": [{
    
    "id": "78391cc0fcaf11e1b0090800272a6d06", "spider": "weibocn"}], 
 "running": [{
    
    "id": "422e608f9f28cef127b3d5ef93fe9399", "spider": "weibocn", "start_time": "2017-07-12 10:14:03.594664"}], 
 "finished": [{
    
    "id": "2f16646cfcaf11e1b0090800272a6d06", "spider": "weibocn", "start_time": "2017-07-12 10:14:03.594664", "end_time": "2017-07-12 10:24:03.594664"}]} 

status represents the execution of the request, pendings represents the currently waiting tasks, running represents the currently running tasks, and finished represents the completed tasks.

2.9 delversion.json

This interface is used to delete a certain version of the project. We can delete the project version with the following command:

curl http://120.27.34.25:6800/delversion.json -d project=weibo -d version=v1 

Here we need a parameter project, which is the name of the project, and a parameter version, which is the version of the project. The returned results are as follows:

{
    
    "status": "ok"} 

status represents the execution of the request, so that the deletion is successful.

2.10 delproject.json

This interface is used to delete an item. We can delete an item with the following command:

curl http://120.27.34.25:6800/delproject.json -d project=weibo 

A parameter project is required here, which is the name of the project. The returned results are as follows:

{
    
    "status": "ok"} 

status represents the execution of the request, so that the deletion is successful.
The above are all the interfaces of Scrapyd. We can directly request the HTTP interface to control the deployment, startup, and operation of the project.

3. Use of ScrapydAPI

The above interfaces may not be very convenient to use, it does not matter, there is a ScrapydAPI library that encapsulates these interfaces again, and the installation method is as follows:

pip3 install python-scrapyd-api 

Let's take a look at the use of ScrapydAPI. In fact, the core principle is the same as the HTTP interface request method, but it is more convenient to use after being packaged in Python.
We can create a ScrapydAPI object in the following way:

from scrapyd_api import ScrapydAPI 
scrapyd = ScrapydAPI('http://120.27.34.25:6800') 

Then you can implement the corresponding interface operation by calling its method. For example, the deployment operation can use the following methods:

egg = open('weibo.egg', 'rb') 
scrapyd.add_version('weibo', 'v1', egg) 

In this way, we can package the project as an Egg file, and then deploy the locally packaged Egg project to the remote Scrapyd.

In addition, ScrapydAPI also implements all API interfaces provided by Scrapyd, with the same names and the same parameters.

For example, we can call the list_projects method to list all deployed projects in Scrapyd:

scrapyd.list_projects() 
['weibo', 'zhihu'] 

In addition, there are other methods that are not listed here. The names and parameters are the same. For more detailed operations, please refer to its official document: http://python-scrapyd-api.readthedocs.io/ .
We can use it to deploy the project and control the operation of the task through the HTTP interface. However, there is an inconvenience here is the deployment process. First, it needs to package the Egg file and then upload it. It is still more cumbersome. Here is another one. Tools Scrapyd-Client.

4. Scrapyd-Client deployment

In order to facilitate the deployment of Scrapy projects, Scrapyd-Client provides two functions:

  • Package the project into an Egg file.

  • Deploy the packaged Egg file to Scrapyd through the addversion.json interface.

In other words, to Scrapyd-Clienthelp us implement all the deployment, we no longer need to care about how the Egg file is generated, and there is no need to read the Egg file and request the interface to upload. All these operations only need to execute a command. One-click deployment.

To deploy the Scrapy project, we first need to modify the configuration file of the project. For example, in the Scrapy project we wrote before, there will be a scrapy.cfg file on the first layer of the project. Its content is as follows:

[settings] 
default = scrapypyppeteer.settings 
​ 
[deploy] 
#url = http://localhost:6800/ 
project = scrapypyppeteer 

Here we need to configure deploy. For example, if we want to deploy the project to Scrapyd at 120.27.34.25, we need to modify it as follows:

[deploy] 
url = http://120.27.34.25:6800/ 
project = scrapypyppeteer 

Then we execute the following command in the path where the scrapy.cfg file is located:

scrapyd-deploy 

The results are as follows:

Packing version 1501682277 
Deploying to project "weibo" in http://120.27.34.25:6800/addversion.json 
Server response (200): 
{
    
    "status": "ok", "spiders": 1, "node_name": "datacrawl-vm", "project": "scrapypyppeteer", "version": "1501682277"} 

Returning this result means that the deployment was successful.

We can also specify the project version. If not specified, it will default to the current timestamp. If specified, pass the version parameter, for example:

scrapyd-deploy --version 201707131455 

It is worth noting that in the Scrapyd 1.2.0 version of Python3, we do not specify the version number as a string with letters, it needs to be a pure number, otherwise an error may occur.

In addition, if we have multiple hosts, we can configure the alias of each host, for example, we can modify the configuration file to:

[deploy:vm1] 
url = http://120.27.34.24:6800/ 
project = scrapypyppeteer 
​ 
[deploy:vm2] 
url = http://139.217.26.30:6800/ 
project = scrapypyppeteer 

If there are multiple hosts, configure them here. One host corresponds to a set of configurations. Add the host's alias after deploy, so if we want to deploy the project to the vm2 host with IP 139.217.26.30, we only need Execute the following commands:

scrapyd-deploy vm2 

So we can deploy the project to the host named vm2.
In this way, if we have multiple hosts, we only need to configure the Scrapyd address of each host in the scrapy.cfg file, and then call the scrapyd-deploy command and the host name to implement deployment, which is very convenient.

If Scrapyd has set access restrictions, we can add the user name and password configuration in the configuration file, and modify the port at the same time to change it to the Nginx proxy port. If we use 6801 in module 1, then we need to change it to 6801, modify as follows:

[deploy:vm1] 
url = http://120.27.34.24:6801/ 
project = scrapypyppeteer 
username = admin 
password = admin 
​ 
[deploy:vm2] 
url = http://139.217.26.30:6801/ 
project = scrapypyppeteer 
username = germey 
password = germey 

In this way, by adding the username and password fields, we can automatically perform Auth verification during deployment, and then successfully implement the deployment.

5. Summary

Above we have introduced the deployment methods of Scrapyd, Scrapyd-API, and Scrapyd-Client, I hope you can try more.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/108685372