Scrapyd is a tool for deploying and running Scrapy projects. With it, you can upload the written Scrapy project to the cloud host and control its operation through API.
Since it is a Scrapy project deployment, it basically uses the Linux host, so the installation in this section is for the Linux host.
1. Related Links
- GitHub:https://github.com/scrapy/scrapyd
- PyPI:https://pypi.python.org/pypi/scrapyd
- Official documentation: https://scrapyd.readthedocs.io
2. pip install
It is recommended to use pip to install, the command is as follows:
pip3 install scrapyd
3. Configuration
After the installation is complete, you need to create a new configuration file /etc/scrapyd/scrapyd.conf. Scrapyd will read this configuration file when it is running.
After Scrapyd 1.2 version, this file will not be created automatically, we need to add it ourselves.
First, execute the following command to create a new file:
sudo mkdir /etc/scrapyd sudo vi /etc/scrapyd/scrapyd.conf
Then write the following:
[scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 10 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
The content of the configuration file can be found in the official documentation https://scrapyd.readthedocs.io/en/stable/config.html#example-configuration-file . The configuration files here have been modified, one of which is that the max_proc_per_cpu
official default is 4, that is, a host can run up to 4 Scrapy tasks per CPU, which is increased to 10 here. The other is bind_address
that the default is local 127.0.0.1, which is modified to 0.0.0.0 here so that it can be accessed by the external network.
4. Running in the background
Scrapyd is a pure Python project, which can be called directly to run here. To keep the program running in the background, Linux and Mac can use the following command:
(scrapyd > /dev/null &)
In this way, Scrapyd will continue to run in the background, and the console output will be ignored directly. Of course, if you want to record the output log, you can modify the output target, such as:
(scrapyd > ~/scrapyd.log &)
At this point, the running results of Scrapyd will be output to the ~/scrapyd.log file.
Of course, you can also use tools such as screen, tmux, supervisor, etc. to implement process daemon.
After running, you can access the Web UI on port 6800 of the browser, from which you can see the current Scrapyd running tasks, logs, etc., as shown in Figure 1.
Of course, a better way to run Scrapyd is to use the Supervisor daemon, if you are interested, you can refer to: http://supervisord.org/ .
In addition, Scrapyd also supports Docker, and we will introduce how to make and run Scrapyd Docker images later.
5. Access Authentication
Once configured, both Scrapyd and its interface are publicly accessible. If you want to configure access authentication, you can use Nginx as a reverse proxy, where you need to install the Nginx server first.
Here is an example of Ubuntu, the installation command is as follows:
sudo apt-get install nginx
Then modify the Nginx configuration file nginx.conf and add the following configuration:
http { server { listen 6801; location / { proxy_pass http://127.0.0.1:6800/; auth_basic "Restricted"; auth_basic_user_file /etc/nginx/conf.d/.htpasswd; } } }
这里使用的用户名和密码配置放置在/etc/nginx/conf.d目录下,我们需要使用htpasswd
命令创建。例如,创建一个用户名为admin的文件,命令如下:
htpasswd -c .htpasswd admin
接着就会提示我们输入密码,输入两次之后,就会生成密码文件。此时查看这个文件的内容:
cat .htpasswd admin:5ZBxQr0rCqwbc
配置完成后,重启一下Nginx服务,运行如下命令:
sudo nginx -s reload
这样就成功配置了Scrapyd的访问认证了。