Python3 web crawler combat -12 deployment-related library is installed: Docker, Scrapyd

If you want to grab a large-scale data, then it will use a distributed crawler, for a distributed crawler, we certainly need multiple hosts, each host multiple tasks reptiles, but the source code is actually only one. So we need to do is copy the code deployed to multiple hosts up co-operation, then how to deploy it is a question worth considering.

For Scrapy, it has an extended component called the Scrapyd, we only need to install Scrapyd Scrapy remote management tasks, including the deployment of source code, start the task, listening tasks and other operations. There are also ScrapydClient and ScrapydAPI to help us more easily deploy and monitor the completion of the operation.

There is also a way to deploy Docker cluster deployment, we only need to make for the reptile Docker mirror, Docker long as the host is installed, you can run directly reptiles, and eliminates the need to worry about the environment configuration, versioning issues.

In this section we will introduce the configuration process related to the environment.

Docker installation

Docker is a container technology that can be applied and the environment to package, to form a separate, similar to the iOS form of APP's "application", the application can be distributed directly to any of the supported Docker environment by simply the command to start the run. Docker is one of the most popular container of implementation. And similar virtualization technology, it greatly facilitates the deployment of applications and services; and virtualization technologies, which in a way to achieve a more lightweight packaging applications and services. Each application can make use of Docker isolated from each other, running multiple applications simultaneously on the same machine, but share the same operating system among themselves. Docker's advantage is that it can manage resources on a more granular level, but also save more resources than virtualization technology. Python resource sharing qun 784758214, there are installation package, PDF, learning videos, here is the gathering place for learners Python, zero-based, advanced, are welcome
in this paragraph refer to: DaoCloud official documentation
for reptiles, if we need a large-scale deployment crawler system, then use the Docker will greatly improve efficiency, we must first of its profits.
This section introduce installation Docker under three major platforms.

1. Links

2. Linux installation of

Detailed step by step installation instructions can be found in the official documentation:https://docs.docker.com/engin ....
In the official document details the installation method different Linux system, the installation process execution can be installed according to the document step by step to success.
But in order to make installation easier, Docker official also provides one-click installation script, using it makes installation more convenient, do not go step by step to execute the command installed in this tell us about the one-click script installation.
The first is the installation script Docker official offer, compared to other scripts, must be more reliable official, the installation command as follows:

curl -sSL https://get.docker.com/ | sh

As long as the execution of a command, wait for a while Docker will be installed, it is very convenient.
But the official script installation has a drawback, it is slow, it may download times out, so in order to speed up the download speed, we can use a mirror image of the country to install, so there's Ali cloud and DaoCloud installation script.
Ali cloud installation script:

curl -sSL http://acs-public-mirror.oss-cn-hangzhou.aliyuncs.com/docker-engine/internet | sh -

DaoCloud installation script:

curl -sSL https://get.daocloud.io/docker | sh

Both scripts can choose one, speed is very good.
After waiting for the script to finish, you can use the Docker related commands, such as running a test Hello World Mirror:

docker run hello-world

operation result:

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
78445dd45222: Pull complete 
Digest: sha256:c5515758d4c5e1e838e9cd307f6c6a0d620b5e07e6f927b07d05f6d12a1ac8d7
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.

If the above content appears similar tips it can prove Docker normal use.

Installation under 3. Mac

Mac platform also has two options, Docker for Mac and Docker Toolbox.
Docker for Mac system requirements for the OS X EI Captain 10.11 or later, at least 4GB of memory, if your system meets this requirement, it is strongly recommended to install Docker for Mac.
HomeBrew may be used, the installation command is as follows:

brew cask install docker
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Further you can manually download installation package, the installation package download address:https://download.docker.com/m ...
After the download is complete, simply double-click the installation package, and then drag it to the application program can be.
Click on the program icon to run Docker, Docker will find that there has been an icon in the menu bar, third small whale icon in Figure 1-83:

Python3 web crawler combat -12 deployment-related library is installed: Docker, Scrapyd

Figure 1-83 menu bar,
click on the icon to expand the menu later, and then click the Start button to start Docker, success will be prompted to start Docker is running, as shown in Figure 1-84:

Python3 web crawler combat -12 deployment-related library is installed: Docker, Scrapyd

Figure 1-84 run page
then we can use the Docker command at the command line.
You can use the following command to test run:

sudo docker run hello-world

1-85 operation results are shown below:

Python3 web crawler combat -12 deployment-related library is installed: Docker, Scrapyd

Figure 1-85 operating results
if the output appears similar to the proof Docker has been successfully installed.
If your system does not meet the requirements, you can download Docker Toolbox, installation instructions are:https://docs.docker.com/toolb ....
On the distinction Docker for Mac and Docker Toolbox, you can see:https://docs.docker.com/docke...。

4. Mirror acceleration

安装好 Docker 之后,在运行测试命令时,我们会发现它首先会下载一个 Hello World 的镜像,然后将其运行,但是下载速度有时候会非常慢,这是因为它默认还是从国外的 Docker Hub 下载的,所以为了提高镜像的下载速度,我们还可以使用国内镜像来加速下载,所以这就有了 Docker 加速器一说。
推荐的 Docker 加速器有 DaoCloud 和阿里云。
DaoCloud:https://www.daocloud.io/mirror
阿里云:https://cr.console.aliyun.com...
不同平台的镜像加速方法配置可以参考 DaoCloud 的官方文档:http://guide.daocloud.io/dcs/...。
配置完成之后,可以发现镜像的下载速度会快非常多。
以上便是 Docker 的安装方式说明。

Scrapyd的安装

Scrapyd 是一个用于部署和运行 Scrapy 项目的工具。有了它,你可以将写好的 Scrapy 项目上传到云主机并通过 API 来控制它的运行。
既然是 Scrapy 项目部署,所以基本上都使用 Linux 主机,所以本节的安装是针对于 Linux 主机的。

1. 相关链接

2. Pip安装

推荐使用 Pip 安装,命令如下:

pip3 install scrapyd

3. 配置

安装完毕之后需要新建一个配置文件 /etc/scrapyd/scrapyd.conf,Scrapyd 在运行的时候会读取此配置文件。
在 Scrapyd 1.2 版本之后不会自动创建该文件,需要我们自行添加。
执行命令新建文件:

sudo mkdir /etc/scrapyd
sudo vi /etc/scrapyd/scrapyd.conf

写入如下内容:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 10
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

配置文件的内容可以参见官方文档:https://scrapyd.readthedocs.i...,在这里的配置文件有所修改,其中之一是 max_proc_per_cpu 官方默认为 4,即一台主机每个 CPU 最多运行 4 个Scrapy Job,在此提高为 10,另外一个是 bind_address,默认为本地 127.0.0.1,在此修改为 0.0.0.0,以使外网可以访问。

4. 后台运行

由于 Scrapyd 是一个纯 Python 项目,在这里可以直接调用 scrapyd 来运行,为了使程序一直在后台运行,Linux 和 Mac 可以使用如下命令:

(scrapyd > /dev/null &)

这样 Scrapyd 就会在后台持续运行了,控制台输出直接忽略,当然如果想记录输出日志可以修改输出目标,如:

(scrapyd > ~/scrapyd.log &)

则会输出 Scrapyd 运行输出到 ~/scrapyd.log 文件中。
运行之后便可以在浏览器的 6800 访问 WebUI 了,可以简略看到当前 Scrapyd 的运行 Job、Log 等内容,如图 1-86 所示:

Python3 web crawler combat -12 deployment-related library is installed: Docker, Scrapyd

图 1-86 Scrapyd 首页
当然运行 Scrapyd 更佳的方式是使用 Supervisor 守护进程运行,如果感兴趣可以参考:http://supervisord.org/
另外 Scrapyd 也支持 Docker,在后文我们会介绍 Scrapyd Docker 镜像的制作和运行方法。

5. 访问认证

After configuration is complete Scrapyd limit and its interface is publicly accessible, if it is to configure the access authentication, then you can do by means Nginx reverse proxy, where the need to install Nginx server.
Here will be described an example in Ubuntu, install command follows:

sudo apt-get install nginx

Then modify the Nginx configuration file nginx.conf, add the following configuration:

http {
    server {
        listen 6801;
        location / {
            proxy_pass    http://127.0.0.1:6800/;
            auth_basic    "Restricted";
            auth_basic_user_file    /etc/nginx/conf.d/.htpasswd;
        }
    }
}

Username and password in the configuration used here /etc/nginx/conf.d placed in the directory, we need to use the htpasswd command to create, for example, create a user named admin file, the command is as follows:

htpasswd -c .htpasswd admin

Continued would prompt us to enter a password, and then enter it twice, you will generate a password file, check the contents:

cat .htpasswd 
admin:5ZBxQr0rCqwbc

After configuration is complete we restart the Nginx service, run the following command:

sudo nginx -s reload
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

This successfully configured to access the authentication of the Scrapyd.

6. Conclusion

This section describes Scrapyd installation method, later we will know in detail the deployment and operational status monitoring method project Scrapy project.

Guess you like

Origin blog.51cto.com/14445003/2425407