How to quickly build a practical crawler management platform

This article contains a lot of content and involves a wide range of knowledge. It will take about 20 minutes to read it. Please read it patiently.

foreword

Most enterprises are inseparable from crawlers, and crawlers are an effective way to obtain data. For search engines, crawlers are indispensable; for public opinion companies, crawlers are the foundation; for NLP, crawlers can obtain corpus; for startups, crawlers can obtain initial content. However, the crawler technology is complex, and different types of crawling scenarios will use different technologies. For example, a simple static page can be directly handled by HTTP request + HTML parser; a dynamic page needs to use automated testing tools such as Puppeteer or Selenium; a website with anti-crawling needs to use proxy, coding and other technologies; and so on. Therefore, enterprises or individuals that have scale requirements for crawlers need to deal with different types of crawlers at the same time, which will add a lot of additional management costs out of thin air. At the same time, crawler managers also need to deal with website content changes, continuous incremental crawling, and task failures. Therefore, a mature crawler management process should include a management system that can effectively deal with the above problems.

Understand what is a crawler management platform

definition

> The crawler management platform is a one-stop management system that integrates crawler deployment, task scheduling, task monitoring, and result display. The crawler management platform generally supports distributed, and can run cooperatively on multiple machines.

Of course, the above definition is narrow, usually aimed at technicians or developers or technical managers. Enterprises generally develop their own internal crawler management systems to deal with complex crawler management needs. Such a system is a crawler management platform in the narrow sense defined above.

Generalized crawler management platform

And what is a generalized crawler management platform? You may have heard of the Sharpshooter (later transformed into the Hou Yi Collector) and the Octopus. The former is based on cloud services, which can write, run and monitor crawlers online, and is the closest to the narrowly defined crawler management platform in the broad crawler platform; the latter is a popular commercial crawler crawling tool that allows Xiaobai users to drag and drop Write, run crawlers, and export data. You may also have seen various API aggregation service providers, such as Aggregate Data, which is a platform that can directly call website interfaces to obtain data. This is actually a variant of the crawler platform, but it helps you complete the crawler writing process. Process. And between the two, there is a foreign company called Kimonolab, which has developed a Chrome plug-in called Kimono, which allows users to visually click on elements on the page and generate scraping rules, which can be displayed on its website. The crawler program is generated, the user submits the task, and the background can automatically grab the data on the website. Kimono is a great crawler application, but unfortunately, Kimonolab has been acquired by the big data company Plantir, and it is not possible to experience it now.

In this article, we mainly focus on the narrowly defined crawler management platform, so the crawler management platform mentioned later refers to the narrow definition.

Crawler management platform module

The following are the modules involved in a typical crawler management platform.

Crawler management platform architecture

The modules of a typical crawler management platform mainly include the following:

Task management: how to execute and schedule crawler crawling tasks, and how to monitor tasks, including log monitoring, etc.;
Crawler management: including crawler deployment, deploying (packaging or copying) the developed crawler to the corresponding node, as well as crawler configuration and version management;
Node management: including registration and monitoring of nodes (servers/machines), as well as communication between nodes, how to monitor node performance status, etc.;
Front-end application: Includes a visual UI interface that the user can interact with to communicate with the back-end application.

Of course, some crawler management platforms may have more than these modules, and it may include other more practical functions, such as configurable crawling rules, visual configuration crawling rules, proxy pools, cookie pools, exception monitoring, and so on.

Why do you need a crawler management platform

With the crawler management platform, developers, especially crawler engineers, can easily add crawlers, perform tasks, and view results without switching back and forth between command lines, which is very error-prone. A common scenario is that the crawler engineer used scrapy and crontab to manage crawler tasks in the initial technical selection. He had to carefully select the time interval of the scheduled tasks so as not to fill up the server CPU or memory; the more difficult problem is that , he also needs to save the logs generated by scrapy to a file. Once the crawler makes an error, he has to use shell commands to view the logs one by one to locate the cause of the error. In severe cases, it will take a whole day; there is another serious problem, The crawler engineer may find that the company's business volume is increasing, he needs to write hundreds of crawlers to meet the company's business needs, and using scrapy and crontab to manage is a complete nightmare. The poor crawler engineer can actually choose a suitable crawler management platform to solve his problem.

How to choose a suitable crawler management platform

When you are willing to solve the aforementioned difficult problems encountered by crawler engineers, and instead want to choose a suitable crawler management platform.

The first question you should answer is: Do we need to develop a system from scratch (Start from scratch)? To answer this question, you should first answer the following questions:

Are our needs complex enough to fully customize a new system (such as requiring complex rights management)?
Does our team have enough technical strength to develop this system (for example, experienced front-end and back-end development engineers)?
Are our time resources sufficient for us to develop the system (eg a project planning cycle of one year)?

If the answer to any of the above three questions is "no", you should consider using an open source crawler management platform already on the market to meet your needs.

The following are the open source crawler management platforms available on the market:

platform name	Technology	advantage	shortcoming
SpiderKeeper	Python Flask	Based on scrapyd, the open source version of Scrapyhub, very simple UI interface, supports scheduled tasks	It may be too concise, does not support paging, does not support node management, and does not support crawlers other than scrapy
Gerapy	Python Django + View	Gerapy is a crawler management platform developed by Cui Qingcai. It is very simple to install and deploy. It is also based on scrapyd. It has a beautiful UI interface and supports functions such as node management, code editing, and configurable rules.	It also does not support crawlers other than scrapy, and according to user feedback, version 1.0 has many bugs, and it is expected that version 2.0 will be improved to a certain extent
Scrapydweb	Python Flask + View	Beautiful UI interface, built-in scrapy log parser, there are many task operation statistics charts, support node management, timed tasks, email reminders, mobile interface, it is a scrapy-based crawler management platform with complete functions	It also does not support crawlers other than scrapy. Python Flask is the backend, and there are certain performance limitations.
Crawlab	Golang + Vue	Not limited to scrapy, it can run crawlers in any language and framework, beautiful UI interface, natural support for distributed crawlers, support for node management, crawler management, task management, timed tasks, result export, data statistics and other functions	Deployment is a little troublesome (but it can be deployed with one click using Docker), and the latest version does not support configurable crawlers for the time being

In general, SpiderKeeper may be the earliest crawler management platform, but its functions are relatively limited; although Gerapy has complete functions and a beautiful interface, there are many bugs to be dealt with, and users in need are advised to wait for version 2.0; Scrapydweb is a A relatively complete crawler management platform, but like the previous two, it is based on scrapyd, so it can only run scrapy crawler; and Crawlab is a very flexible crawler management platform, which can run Python, Nodejs, Java, PHP, Go to write The crawler of Docker has complete functions, but it is more troublesome to deploy than the first three, but it can be deployed in one piece for Docker users (we will talk about it later).

Therefore, for developers who rely heavily on scrapy crawlers and do not want to toss, Scrapydweb can be considered; for crawler developers with various types and complex technical structures, the more flexible Crawlab should be given priority. Of course, it is not that Crawlab is not friendly to scrapy support. Crawlab can also integrate scrapy very well, which will be introduced later.

As the author of Crawlab, I don't want Wang Po to sell melons and boast. The author just hopes to recommend the best technology selection to developers, so that developers can decide which crawler management platform to use according to their own needs.

Introduction to Crawlab, a crawler management platform

Introduction

Crawlab is a distributed crawler management platform based on Golang, which supports Python, NodeJS, Java, Go, PHP and other programming languages and crawler frameworks.

Since its launch in March this year, Crawlab has been well received by crawler enthusiasts and developers. Many users also expressed that they would use Crawlab to build the company's crawler platform. After several months of iterations, Crawlab has successively launched functions such as timed tasks, data analysis, website information, configurable crawlers, automatic field extraction, downloading results, and uploading crawlers, making the platform more practical and comprehensive, which can really help The user solves the difficult problem of crawler management. Today, there are nearly 1k stars on Github, and related communities (WeChat groups, WeChat official accounts) have also been established, and a quarter of users said they have applied Crawlab to enterprise crawler management. It can be seen that Crawlab is concerned and liked by developers.

Solve the problem

Crawlab mainly solves the problem of difficult management of a large number of crawlers. For example, it is not easy to manage projects mixed with scrapy and selenium that need to monitor hundreds of websites, and the cost of command line management is very high and error-prone. Crawlab supports any language and any framework, with task scheduling and task monitoring, it is easy to effectively monitor and manage large-scale crawler projects.

interface and usage

Below is a screenshot of the Crawlab crawler list page.

Crawlab crawler list

Users only need to upload the crawler to Crawlab, configure the execution command, and click the "Run" button to execute the crawler task. Crawler tasks can run on any node. As can be seen from the above figure, Crawlab has modules such as node management, crawler management, task management, timed tasks, and user management.

Overall structure

The following is the overall architecture diagram of Crawlab, which consists of five major parts:

Master Node: Responsible for task dispatch, API, deployment of crawlers, etc.;
Worker Node: Responsible for executing crawler tasks;
MongoDB database: store daily operating data such as nodes, crawlers, and tasks;
Redis database: store task message queue, node heartbeat and other information.
Front-end client: Vue application, responsible for front-end interaction and requesting data from the back-end.

Crawlab Architecture

How to use Crawlab and its detailed principles are beyond the scope of this article. If you are interested, you can refer to the Github homepage or related documents .

Github address and Demo

View Demo
Github: https://github.com/tikazyq/crawlab

Install Crawlab using Docker deployment

Docker image

Docker is the most convenient and concise way to deploy Crawlab. Other deployment methods include direct deployment, but it is not recommended for developers who want to quickly build a platform. Crawlab has registered the relevant image on Dockerhub , and developers only need to execute the docker pull tikazyq/crawlabcommand to download the image of Crawlab.

Readers can go to Dockerhub to view the image of Crawlab, which is only less than 300Mb. Address: https://hub.docker.com/r/tikazyq/crawlab/tags

Dockerhub Page

Install Docker

To deploy Crawlab with Docker, you must first ensure that Docker is installed. Please refer to the following documents to install.

operating system	Documentation
Mac	https://docs.docker.com/docker-for-mac/install
Windows	https://docs.docker.com/docker-for-windows/install
Ubuntu	https://docs.docker.com/install/linux/docker-ce/ubuntu
Debian	https://docs.docker.com/install/linux/docker-ce/debian
CentOS	https://docs.docker.com/install/linux/docker-ce/centos
Fedora	https://docs.docker.com/install/linux/docker-ce/fedora
Other Linux distributions	https://docs.docker.com/install/linux/docker-ce/binaries

Install Docker Compose

Docker Compose is a simple tool for running Docker clusters, very lightweight, we will use Docker Compose to deploy Crawlab with one click.

Docker's official website already has a tutorial on how to install Docker Compose, click the link to view it. Here is a brief introduction.

operating system	installation steps
Mac	Docker Desktop for Mac or Docker Toolbox comes with it, no need to install it separately
Windows	Docker Desktop for Windows or Docker Toolbox comes with it, no need to install it separately
Linux	Refer to the command below the table
Other options	By `pip`installing, `pip install docker-compose`if not `virtualenv`, you need to use`sudo`

For Linux users, please use the following command to install.

# 下载 docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose`

# 将 docker-compose 变成执行文件
sudo chmod +x /usr/local/bin/docker-compose

pull image

Before pulling an image, you need to configure the image source. Because in China, the speed of using the original image source is not very fast, and it is necessary to use the domestic agent of DockerHub. Please create a /etc/docker/daemon.jsonfile and enter the following content.

{
  "registry-mirrors": ["https://registry.docker-cn.com"]
}

Then pull the image, it will be much faster. Of course, you can also use other mirror sources, you can search online. Run the following command to pull down the Crawlab image.

docker pull tikazyq/crawlab:latest

The following figure shows the command line interface when pulling an image.

docker pull

Start Crawlab

We will start Crawlab and its dependent databases MongoDB and Redis with Docker Compose. First we need to modify the yaml configuration file of Docker Compose docker-compose.yml. This configuration file defines the Container Services that need to be started and the Network Configuration. Here we use the one that comes with Crawlab docker-compose.yml.

version: '3.3'  # Docker Compose 的版本号（请看后续说明）
services:  # 服务
  master:  # 服务名称
    image: tikazyq/crawlab:latest  # 服务对应的镜像名称
    container_name: master  # 服务对应的容器名称
    environment:  # 这里定义传入的环境变量
      CRAWLAB_API_ADDRESS: "localhost:8000"  # 前端调用的 API 地址，默认为 localhost:8000
      CRAWLAB_SERVER_MASTER: "Y"  # 是否为主节点，Y/N
      CRAWLAB_MONGO_HOST: "mongo"  # MongoDB host，由于在 Docker Compose 里，可以引用服务名称
      CRAWLAB_REDIS_ADDRESS: "redis"  # Redis host，由于在 Docker Compose 里，可以引用服务名称
    ports:  # 映射的端口
      - "8080:8080" # 前端端口
      - "8000:8000" # 后端端口
    depends_on: # 依赖的服务
      - mongo  # MongoDB
      - redis  # Redis
  worker:  # 工作节点，与主节点配置类似，不重复写了
    image: tikazyq/crawlab:latest
    container_name: worker
    environment:
      CRAWLAB_SERVER_MASTER: "N"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    depends_on:
      - mongo
      - redis
  mongo:  # MongoDB 服务名称
    image: mongo:latest  # MongoDB 镜像名称
    restart: always  # 重启策略为“总是”
    ports:  # 映射端口
      - "27017:27017"
  redis:  # Redis 服务名称
    image: redis:latest  # Redis 镜像名称
    restart: always  # 重启策略为“总是”
    ports:  # 映射端口
      - "6379:6379"

Readers can configure according to their own requirements docker-compose.yml. In particular, you need to pay attention to CRAWLAB_API_ADDRESSthis environment variable. Many beginner users are unable to log in because of incorrect configuration of this variable. In most cases, you do not need to make any configuration changes. Please refer to the Q&A for common problems, and the detailed environment variable configuration documentation to help configure Crawlab according to your environment.

Then, run the following command to start Crawlab. You can add a -dparameter to make Docker Compose run in the background.

docker-compose up

After running the above command, Docker Compose will pull the MongoDB and Redis images, which may take a few minutes. After the pull is complete, the four services will be started in sequence, and you will see the following on the command line.

docker-compose

Under normal circumstances, you should be able to see that all four services have started successfully, and logs can be printed smoothly. If the startup is unsuccessful, please contact the author (tikazyq1) on WeChat or raise an Issue on Github.

If you start Docker Compose on the local machine, you can enter it in the browser http://localhost:8080, and then you can see the login interface; if you start Docker Compose on another machine, you need to enter it in the browser http://<your_ip>:8080to see The login interface <your_ip>is the IP address of other machines (please ensure that port 8080 is open to the outside world on this machine).

The initial login username and password is admin/admin, you can use this username and password to log in. If your environment variables CRAWLAB_API_ADDRESSare not set correctly, you may see the login button keep spinning in circles without any prompt after clicking login. At this time, please docker-compose.ymlset the correct one in CRAWLAB_API_ADDRESS(will be localhostreplaced with <your_ip>), restart docker-compose up. Then type in the browser http://<your_ip>:8080.

After logging in you will see the Crawlab home page.

home

This article mainly introduces how to build the crawler management platform Crawlab, so it will not introduce how to use Crawlab in detail (may create another article to introduce in detail, those who are interested can pay attention). If you are confused, please check the relevant documentation to learn how to use it. At the same time, you can also add the author's WeChat (tikazyq1) and indicate Crawlab, the author will pull you into the discussion group, where you can answer your questions.

How to integrate crawlers such as Scrapy into Crawlab

As we all know, Scrapy is a very popular crawler framework, and its flexible framework design, high concurrency, ease of use and scalability have been widely adopted by many developers and enterprises. Almost all crawler management platforms on the market support Scrapy crawlers, and Crawlab is no exception, but Crawlab can run other crawlers such as puppeteer and selenium. The following will introduce how to run scrapy crawler in Crawlab.

Crawlab is the basic principle of executing crawler

Crawlab 执行爬虫的原理很简单，其实就是一个 shell 命令。用户在爬虫中输入执行爬虫的 shell 命令，例如scrapy crawl some_spider，Crawlab 执行器会读取这个命令，并在 shell 中直接执行。因此，每一次运行爬虫任务，就是执行了一次 shell 命令（当然，实际情况要比这个复杂很多，感兴趣的可以去参考官方文档）。Crawlab 是支持展示和导出爬虫结果的，不过这需要稍微多做一些工作。

编写 Pipeline

要集成 scrapy 爬虫，无非就是将爬虫抓取的数据存到 Crawlab 的数据库里，然后用任务 ID 关联起来。每次执行爬虫任务，任务 ID 会通过环境变量传到爬虫程序中，因此我们需要做的就是将任务 ID 加上结果存到数据库里（Crawlab 现在只支持 MongoDB，后期会开发 MySQL、SQL Server、Postgres 等关系型数据库，有需求的用户可以关注一下）。

在 Scrapy 中，我们需要编写储存逻辑。示意代码如下：

# 引入相关的库，pymongo 是标准连接 MongoDB 的库
import os
from pymongo import MongoClient

# MongoDB 配置参数
MONGO_HOST = '192.168.99.100'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'

class JuejinPipeline(object):
    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)  # mongo 连接实例
    db = mongo[MONGO_DB]  # 数据库实例
    col_name = os.environ.get('CRAWLAB_COLLECTION')  # 集合名称，通过环境变量 CRAWLAB_COLLECTION 传过来

	# 如果 CRAWLAB_COLLECTION 不存在，则默认集合名称为 test
    if not col_name:
        col_name = 'test'
        
    col = db[col_name]  # 集合实例

	# 每一个传入 item 会调用的函数，参数分别为 item 和 spider
    def process_item(self, item, spider):
        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')  # 将 task_id 设置为环境变量传过来的任务 ID
        self.col.save(item)  # 保存 item 在数据库中
        return item

同时，您也需要在items.py中加入task_id字段，已保证值能够被赋上（这很重要）。

上传并配置爬虫

在运行爬虫之前，您需要上传爬虫文件到主节点。步骤如下：

将爬虫文件打包成 zip（注意，要保证在根目录下直接打包）；
在侧边栏点击“爬虫”导航至爬虫列表，点击“添加爬虫”按钮，选择“自定义爬虫”；
点击“上传”按钮，选择刚刚打包好的 zip 文件
上传成功后，爬虫列表中会出现新添加的自定义爬虫，这样就算上传成功了。

可以在爬虫详情中点击“文件”标签，选择一个文件，可以在文件中编辑代码。 crawler file 接下来，您需要在“概览”标签中的“执行命令”一栏输入爬虫的 shell 执行命令。Crawlab 的 Docker 镜像里是内置了 scrapy 的，因此可以直接运行 scrapy 爬虫。命令就是scrapy crawl <some_spider>。点击“保存”按钮保存爬虫配置。

运行爬虫任务

然后就是运行爬虫任务了。其实很简单，在“概览”标签中点击“运行”按钮，爬虫任务就开始运行了。如果日志提示找不到 scrapy 命令，可以将scrapy改为绝对路径/usr/local/bin/scrapy，这样就会运行成功。

任务运行情况会在“任务”页面或者爬虫“概览”里展现，会每 5 秒钟更新一次，大家可以在这上面查看。而且在爬虫“结果”标签里，可以预览结果的详情，还可以导出数据成 CSV 文件。

构建持续集成（CI）工作流

对于企业来说，软件开发一般是一个自动化过程。它会经历需求、开发、部署、测试、上线这几个步骤。而这个流程一般是不断迭代（Iterative）的，需要不断更新和发布。

以爬虫为例，您上线了一个爬虫，这个爬虫会定期抓取网站数据。但突然有一天您发现数据抓不到了，您快速定位原因，发现原来是网站改版了，您需要更改爬虫抓取规则来应对网站的改版。总之，您需要发布一个代码更新。最快的做法是直接在线上更改代码。但这样做非常危险：第一，您无法测试您更新后的代码，只能通过不断调整线上代码来测试是否抓取成功；第二，您无法记录这次更改，后期如果出了问题您很可能会忽略掉这次更改，从而导致 bug。您需要做的，无非是将您的爬虫代码用版本管理工具管理起来。我们有很多版本管理工具，最常用的就是 git、subversion，版本管理平台包括 Gitlab、Bitbucket、自搭 Git 仓库等。

当我们更新了代码，我们需要将更新后的代码发布到线上服务器。这时您需要用自己写部署脚本，或者更方便的，用 Jenkins 作为持续集成（Continuous Integration）管理平台。Jenkins 是一个持续集成平台，可以通过获取版本库来更新部署代码，是非常实用的工具，在很多企业中都有用到。下图是如何将 Crawlab 爬虫应用到持续集成工作流程中的例子。

there

要在 Crawlab 中创建或更新爬虫有两种方式：

上传打包成后的 zip 文件；
通过更改主节点中目录CRAWLAB_SPIDER_PATH中的爬虫文件。

我们做持续集成，就是针对第二种方式。步骤如下：

用 Gitlab 或其他平台搭建好代码仓库；
在 Jenkins 中创建一个项目，在项目中将代码源指向之前创建的仓库；
在 Jenkins 项目中编写工作流，将发布地址指向 Crawlab 的CRAWLAB_SPIDER_PATH，如果是 Docker 注意将该地址挂载到宿主机文件系统；
Jenkins 项目的工作可以直接编写，也可以用 Jenkinsfile，具体可以查相关资料；
这样，每一次代码更新提交到代码仓库后，Jenkins 就会将更新后的代码发布到 Crawlab 里，Crawlab 主节点会将爬虫代码同步到工作节点，以待抓取。

总结

本篇文章主要介绍了爬虫管理平台的定义、如何选择爬虫管理平台，着重介绍了如何搭建开源爬虫管理平台 Crawlab，另外还讲到了如何集成 scrapy 爬虫以及如何打造持续集成工作流。本篇文章没有涉及到的内容还有很多，包括如何 Crawlab 的原理和架构详情、如何使用 Crawlab、如何编写大规模爬虫、如何使用 Jenkins 等等。这些内容可能会在其他文章中发布，请感兴趣的读者多多关注。另外，Crawlab 还有一些需要提升的地方，例如异常监控（零值、空值）、可配置爬虫、可视化抓取、日志集中收集等等。这些功能都将在以后陆续开发和发布，请大家也多多关注。

I hope this article will be helpful to your work and study. If you have any questions, please add the author's WeChat tikazyq1, or leave a message at the bottom to ask questions, and the author will try his best to answer them. Thanks!