Sesame HTTP: Installation of Scrapyd

Scrapyd is a tool for deploying and running Scrapy projects. With it, you can upload the written Scrapy project to the cloud host and control its operation through API.

Since it is a Scrapy project deployment, it basically uses the Linux host, so the installation in this section is for the Linux host.

1. Related Links

GitHub：https://github.com/scrapy/scrapyd
PyPI：https://pypi.python.org/pypi/scrapyd
Official documentation: https://scrapyd.readthedocs.io

2. pip install

It is recommended to use pip to install, the command is as follows:

pip3 install scrapyd

3. Configuration

After the installation is complete, you need to create a new configuration file /etc/scrapyd/scrapyd.conf. Scrapyd will read this configuration file when it is running.

After Scrapyd 1.2 version, this file will not be created automatically, we need to add it ourselves.

First, execute the following command to create a new file:

sudo mkdir /etc/scrapyd
sudo vi /etc/scrapyd/scrapyd.conf

Then write the following:

[scrapyd]
eggs_dir    = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 10
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

The content of the configuration file can be found in the official documentation https://scrapyd.readthedocs.io/en/stable/config.html#example-configuration-file . The configuration files here have been modified, one of which is that the max_proc_per_cpuofficial default is 4, that is, a host can run up to 4 Scrapy tasks per CPU, which is increased to 10 here. The other is bind_addressthat the default is local 127.0.0.1, which is modified to 0.0.0.0 here so that it can be accessed by the external network.

4. Running in the background

Scrapyd is a pure Python project, which can be called directly to run here. To keep the program running in the background, Linux and Mac can use the following command:

(scrapyd > /dev/null &)

In this way, Scrapyd will continue to run in the background, and the console output will be ignored directly. Of course, if you want to record the output log, you can modify the output target, such as:

(scrapyd > ~/scrapyd.log &)

At this point, the running results of Scrapyd will be output to the ~/scrapyd.log file.

Of course, you can also use tools such as screen, tmux, supervisor, etc. to implement process daemon.

After running, you can access the Web UI on port 6800 of the browser, from which you can see the current Scrapyd running tasks, logs, etc., as shown in Figure 1.

Of course, a better way to run Scrapyd is to use the Supervisor daemon, if you are interested, you can refer to: http://supervisord.org/ .

In addition, Scrapyd also supports Docker, and we will introduce how to make and run Scrapyd Docker images later.

5. Access Authentication

Once configured, both Scrapyd and its interface are publicly accessible. If you want to configure access authentication, you can use Nginx as a reverse proxy, where you need to install the Nginx server first.

Here is an example of Ubuntu, the installation command is as follows:

sudo apt-get install nginx

Then modify the Nginx configuration file nginx.conf and add the following configuration:

http {
    server {
        listen 6801;
        location / {
            proxy_pass    http://127.0.0.1:6800/;
            auth_basic    "Restricted";
            auth_basic_user_file    /etc/nginx/conf.d/.htpasswd;
        }
    }
}

这里使用的用户名和密码配置放置在/etc/nginx/conf.d目录下，我们需要使用htpasswd命令创建。例如，创建一个用户名为admin的文件，命令如下：

htpasswd -c .htpasswd admin

接着就会提示我们输入密码，输入两次之后，就会生成密码文件。此时查看这个文件的内容：

cat .htpasswd 
admin:5ZBxQr0rCqwbc

配置完成后，重启一下Nginx服务，运行如下命令：

sudo nginx -s reload

这样就成功配置了Scrapyd的访问认证了。