Learn python from scratch (16) crawler cluster deployment

foreword

Today I will talk about the last part of the Python framework source code topic, crawler cluster deployment, and updated fifteen series of articles about learning python from scratch, which are: 1.
Compulsory programming grammar
2. Network programming
3. Multi-thread/multi-process /Coroutine
4.MySQL database
5.Redis database
6.MongoDB database 7.Machine
learning 8.Full
stack development
9.Numpy/pandas/matplotlib
10.Hadoop
11.Spark
12.Crawler engineer Article
13. Reptile Engineer Automation and Packet Capture Article
14. Scrapy Framework Article
15. Feapder Framework Article

This series of articles is based on the following learning routes:

Learn python from scratch to advanced advanced roadmap home page

Python resources suitable for zero-based learning and advanced people :
① Tencent certified python complete project practical tutorial notes PDF
② More than a dozen major factories python interview topics PDF
③ Python full set of video tutorials (zero-based-advanced JS reverse)
④ Hundreds A project + source code + notes
⑤ programming grammar - machine learning - full stack development - data analysis - reptiles - APP reverse and other full set of projects + documents

3. Reptile Cluster Deployment

One, scrapyd framework

1. Environment deployment

Scrapyd is a Twisted-based Python framework for deploying and running Scrapy crawlers. It provides a web service that can manage the deployment and operation of Scrapy crawlers through API. In Scrapyd, a crawler can be packaged into an egg file, and then uploaded to the Scrapyd server through the API for deployment and operation.

The following are the detailed steps of Scrapyd framework environment deployment:

Install Python and pip

Scrapyd is a Python-based framework, so Python and pip need to be installed first. You can download the Python installation package from the Python official website, and then use the command line to install pip.

Install Scrapy and Scrapyd

Install Scrapy and Scrapyd using pip:

pip install scrapy
pip install scrapyd

Configure Scrapyd

Scrapyd's configuration file is located at /etc/scrapyd/scrapyd.conf. The configuration file can be edited with the following command:

sudo nano /etc/scrapyd/scrapyd.conf

In the configuration file, you can set the port number of Scrapyd, the path of the log file, the path of the crawler project, etc.

Start Scrapyd

Start Scrapyd with the following command:

scrapyd

Scrapyd will start on the default port 6800. You can visit http://localhost:6800/ in your browser to view Scrapyd's web interface.

Deploy Scrapy crawler

Package the Scrapy crawler into an egg file and deploy it to the Scrapyd server with the following command:

curl -F project=myproject -F spider=myspider \
    -F [email protected] \
    http://localhost:6800/schedule.json -o result.json

Among them, the project and spider parameters specify the project where the crawler is located and the name of the crawler, the eggfile parameter specifies the path of the egg file to be uploaded, and the last URL is the API address of Scrapyd.

View crawler running status

You can view the running status of the crawler on the web interface of Scrapyd. You can also use the following command to view the running log of the crawler:

curl http://localhost:6800/logs/myproject/myspider/001

Among them, myproject and myspider are the project where the crawler is located and the name of the crawler, respectively, and 001 is the task ID of the crawler.

The above are the detailed steps of Scrapyd framework environment deployment.

2.scrapyd API processing crawlers

Scrapyd is a Python framework for deploying and running Scrapy crawlers. It provides an HTTP-based API through which the crawler can be managed and controlled. Through the Scrapyd API, you can communicate with the Scrapyd server and send commands to manage crawler startup, stop, and view crawler status.

Here is a detailed explanation of how the Scrapyd API handles crawlers:

Install Scrapyd :

First, the Scrapyd framework needs to be installed. You can use the pip command to install: pip install scrapyd

Start the Scrapyd server :

Start the Scrapyd server with the command scrapyd. By default, Scrapyd server will run on port 6800.

Create a Scrapy crawler :

Before using the Scrapyd API, you need to create a Scrapy crawler. You can use the Scrapy command line tool to create a new crawler project and write crawler code.

Deploy the crawler :

Run the command scrapyd-deploy in the project root directory to deploy the crawler to the Scrapyd server. This will generate a scrapy.cfg configuration file and upload the project to the Scrapyd server.

Using the Scrapyd API :

Scrapyd API provides a series of interfaces for managing crawlers, including starting crawlers, stopping crawlers, getting crawler status, etc.

  • Start a crawler : Use the /schedule.json interface to start a crawler. A crawler name and optional arguments are required. For example: http://localhost:6800/schedule.json -d project=myproject -d spider=myspider

  • Stop the crawler : Use the /cancel.json interface to stop the running crawler. The ID of the crawler task needs to be provided. For example: http://localhost:6800/cancel.json -d project=myproject -d job=12345

  • View crawler status : Use the /listjobs.json interface to get the list of currently running crawler tasks and their status. For example: http://localhost:6800/listjobs.json?project=myproject

Parse the API response :

The response of the Scrapyd API is data in JSON format. You can use Python's requests library or other HTTP request libraries to send API requests and parse the returned JSON data.

Through the Scrapyd API, you can manage and control the operation of Scrapy crawlers programmatically. This allows you to easily start and monitor crawler tasks remotely.

3.scrapyd multi-task management

In Scrapyd, multitasking management refers to the ability to run and manage multiple Scrapy crawler tasks at the same time. Scrapyd provides a set of APIs and tools to easily manage multiple crawler tasks, including starting, stopping, monitoring task status, and obtaining task results. The following is a detailed explanation of Scrapyd multitasking management:

Create multiple crawler projects :

First, you need to create multiple independent Scrapy crawler projects. Each project is in a separate directory and has its own crawler code, configuration files and dependencies.

Deploy the crawler project :

Use Scrapyd's deployment tools (such as the scrapyd-deploy command) to deploy each crawler project to the Scrapyd server. Make sure you give each project a unique project name.

Start multiple tasks :

Use the /schedule.json interface of the Scrapyd API to start multiple tasks. You can start multiple tasks at the same time by sending multiple HTTP requests, one for each task. In each request, specify the project name and the crawler name to start.

Monitor task status :

Use the /listjobs.json interface of the Scrapyd API to get the list of currently running jobs and their status. You can periodically send API requests to get the latest task status information. Based on the task status, you can tell whether the task is running, completed, or encountered errors.

Get task results :

When the task is completed, you can use the /listjobs.json interface or /jobq/{job_id}/items.json interface of the Scrapyd API to obtain the result data of the task. These interfaces will return the output data of crawler tasks, such as crawled data items or log information.

Stop task :

If you need to stop a running task, you can use the /cancel.json interface of the Scrapyd API. Provide the project name and task ID to stop the corresponding task.

Through Scrapyd's multi-task management capabilities, you can run and manage multiple independent crawler tasks at the same time. This allows you to handle large-scale crawling tasks, improving efficiency and reducing administrative costs.

Two, gerapy deployment crawler

1. Gerapy environment construction

Gerapy is a Scrapy-based distributed crawler management framework that can easily manage multiple Scrapy crawlers and provides a web interface for visual operations. The following is a detailed explanation of the Gerapy environment construction:

Install Python

Gerapy is developed based on Python, so Python needs to be installed first. You can download the Python installation package from the official website, or use the package management tool to install it.

Install Scrapy

Gerapy is based on Scrapy, so Scrapy needs to be installed first. It can be installed using pip:

pip install scrapy

Install Gerapy

It can be installed using pip:

pip install gerapy

Install Redis

Gerapy uses Redis as a distributed task queue and data storage, so Redis needs to be installed first. You can download the Redis installation package from the official website, or use the package management tool to install it.

Configuring Gerapy
The configuration file for Gerapy is located at ~/.gerapy/config.json and can be initialized with the following command:

gerapy init

Then edit the ~/.gerapy/config.json file to configure information such as the username and password of Redis and Gerapy.

Starting Gerapy
You can start Gerapy with the following command:

gerapy

Then visit http://localhost:8000 in the browser, and enter the user name and password to log in to the web interface of Gerapy.

Create a Scrapy project

In Gerapy's web interface, you can create a Scrapy project and create a crawler in the project. Gerapy will automatically add the crawler to the task queue, and the task status and logs can be viewed in the web interface.

2. Gerapy server deployment

Install Python and Scrapy

To install Python and Scrapy on the server, you can use the following commands:

sudo apt-get update
sudo apt-get install python3 python3-pip
sudo pip3 install scrapy

Install Gerapy

Install Gerapy with the following command:

sudo pip3 install gerapy

Initialize Gerapy

Initialize Gerapy with the following command:

gerapy init

This will create a folder named gerapy that contains configuration files for gerapy and other necessary files.

Configure Gerapy

In the gerapy folder, open the config.py file and configure related parameters of gerapy, such as database connection information, administrator account, etc.

Start Gerapy

Start Gerapy with the following command:

gerapy runserver

This will launch Gerapy's web interface, which can be accessed in a browser at http://localhost:8000 to manage the crawler.

Deploy the crawler

In Gerapy's web interface, crawlers can be added, edited, and deleted, and crawlers can be deployed on multiple servers to achieve distributed crawling.

3. gerapy packaging framework project

Gerapy is a Scrapy-based distributed crawler management framework, which can easily manage multiple Scrapy crawlers, and provides a web interface for operation and monitoring. In actual projects, we may need to package Gerapy into executable files for deployment and operation on other machines. This article will introduce how to package the Gerapy framework project.

Install pyinstaller

pyinstaller is a tool for packaging Python code into executable files, which can be installed via pip:

pip install pyinstaller

Package Gerapy

Execute the following command in the root directory of the Gerapy project:

pyinstaller -F gerapy.spec

Among them, gerapy.spec is a configuration file used to specify packaging parameters and options. If there is no such file, it can be generated by the following command:

pyinstaller --name=gerapy -y --clean --windowed --icon=gerapy.ico --add-data=gerapy.ico;. gerapy/__main__.py

This command will generate an executable named gerapy with the following arguments and options:

  • –name: Specify the name of the generated executable file to be gerapy;
  • -y: automatically overwrite the existing output directory;
  • –clean: clean the output directory before packaging;
  • --windowed: generate windowed applications, do not display the command line window;
  • –icon: specify the application icon;
  • --add-data: Pack the gerapy.ico file into the executable.

Run Gerapy

After the packaging is complete, an executable file named gerapy will be generated in the dist directory. Copy this file to another machine, and you can run the Gerapy framework project on that machine.

Three, feapder deployment

1. Feapder application scenarios and principles

Feapder is a lightweight distributed crawler framework developed based on Python, aiming to provide simple, easy-to-use and efficient crawler solutions. It has the following application scenarios and principles:

Application scenario:
  • Data collection : Feapder can be used to collect data from various websites and data sources. Whether it is crawling structured data or unstructured data, Feapder provides a wealth of functions and flexible configuration options to meet the needs of different data collection.

  • Website monitoring : Feapder can periodically monitor the changes of website content and remind users in time. This is very useful in situations where you need to monitor the target website in real time, such as news updates, price changes, etc.

  • Data cleaning and processing : Feapder supports custom processing functions and pipelines to clean and process the crawled data. You can use the data processing functions provided by Feapder, such as deduplication, encoding conversion, data filtering, etc., to convert the crawled raw data into usable structured data.

  • Data storage and export : Feapder provides a variety of data storage options, including database storage, file storage, and message queues. You can choose a suitable storage method according to your needs, and support data export to various formats, such as CSV, JSON, etc.

Principle analysis:

The core principle of Feapder is based on distributed asynchronous task scheduling and processing. The following is the principle analysis of Feapder:

  • Distributed architecture : Feapder uses a distributed architecture to improve crawling efficiency and scalability. Task scheduling and data processing are distributed on multiple nodes, and each node can run crawler tasks independently, and communicate and transmit data through message queues.

  • Asynchronous task scheduling : Feapder uses an asynchronous task scheduling framework (such as Celery) to implement concurrent execution of tasks. Each crawler task is encapsulated as an executable asynchronous task, which can run independently in the task scheduler, and receive and send task-related messages through the message queue.

  • Task scheduling and monitoring : Feapder provides task scheduling and monitoring functions, which can monitor the status, progress and error information of tasks in real time. You can start, stop, pause, and reschedule tasks through Feapder's management interface or API, and view task logs and statistics in real time.

  • Data processing and storage : Feapder supports custom data processing functions and processing pipelines, which can clean, transform and process crawled data. At the same time, Feapder provides a variety of data storage options, which can store the processed data in the database, file system or message queue, and support data export and import.

In summary, Feapder implements an efficient, flexible and scalable crawler framework through distributed asynchronous task scheduling and processing. It is designed so that users can easily configure and manage crawler tasks, and facilitate data processing and storage. Whether it is small-scale data collection or large-scale distributed crawling tasks, Feapder is a powerful choice.

2. feapder image pull

feapder is a distributed crawler framework based on Python, which can help users quickly build an efficient and stable crawler system. Before using feapder, you need to pull the image of feapder first.

The image pull command is as follows :

docker pull feapder/feapder

This command will pull the latest version image of feapder from Docker Hub. After the pull is complete, you can use the following command to view the pulled image:

docker images

The feapder image contains all required dependencies and configurations and can be used directly. When using feapder, you can run the feapder image through Docker, or deploy the image to a Kubernetes cluster.

The command to run the feapder image using Docker is as follows:

docker run -it --name feapder feapder/feapder

This command starts feapder in the Docker container and enters the container's interactive terminal. In the container, you can use the command line tools provided by feapder to create and manage crawler tasks.

In short, the image pull of feapder is very simple, only need to execute a command. At the same time, the image of feapder is also very convenient to use. It can run directly in a Docker container or be deployed in a Kubernetes cluster.

3.docker deployment feapder deployment environment

feapder is a Python-based distributed crawler framework that can be used to quickly develop various types of crawlers. When using feapder, you can choose to use docker for deployment in order to manage and deploy crawlers more conveniently.

The following are the detailed steps to deploy feapder using docker:

Install docker and docker-compose

Before starting, docker and docker-compose need to be installed. You can refer to the official documentation for installation.

Pull the feapder image

The feapder image can be pulled from Docker Hub with the following command:

docker pull feapder/feapder

Create docker-compose.yml file

Create a docker-compose.yml file locally to define the container and related configuration of feapder. Here is an example file:

version: '3'

services:
  redis:
    image: redis:latest
    ports:
      - "6379:6379"
    volumes:
      - ./redis-data:/data

  mysql:
    image: mysql:latest
    environment:
      MYSQL_ROOT_PASSWORD: root
      MYSQL_DATABASE: feapder
    ports:
      - "3306:3306"
    volumes:
      - ./mysql-data:/var/lib/mysql

  feapder:
    image: feapder/feapder
    environment:
      - REDIS_HOST=redis
      - MYSQL_HOST=mysql
      - MYSQL_USER=root
      - MYSQL_PASSWORD=root
      - MYSQL_DATABASE=feapder
    volumes:
      - ./feapder-data:/app/data
    depends_on:
      - redis
      - mysql

In this file, three services are defined: redis, mysql and feapder. Among them, redis and mysql are used to store the task queue and data of the crawler respectively, and feapder is the running environment of the crawler.

Start the container

In the local project directory, run the following command to start the container:

docker-compose up -d
This command will start all the services defined in the docker-compose.yml file and run them in the background.

Enter the feapder container

You can enter the feapder container with the following command:

docker exec -it feapder_feapder_1 /bin/bash

Among them, feapder_feapder_1 is the name of the container, which can be viewed by using the docker ps command.

run crawler

In the feapder container, you can use the feapder command to run the crawler. For example, a simple crawler can be run with:

feapder run spider demo

This command will run the crawler named demo.

The above are the detailed steps to deploy feapder using docker. By using docker, it is easier to manage and deploy feapder crawlers.

4. Feapder deploys scrapy project

Feapder is a distributed crawler framework developed based on the Scrapy framework, so deploying the Feapder project also requires deploying the Scrapy project first. The following are the detailed steps to deploy the Scrapy project:

1. Create a Scrapy project

Create a new Scrapy project using the Scrapy command line tool, for example:

scrapy startproject myproject

2. Write Spider

In the Scrapy project, Spider is the core part of the crawler, responsible for defining how to crawl the data of the website. In a Scrapy project, a Spider is usually a Python class that needs to inherit the Spider class provided by Scrapy and implement some necessary methods.

For example, here is a simple Spider example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        # 解析网页内容
        pass

3. Configure the Scrapy project

The configuration file of the Scrapy project is settings.py , which contains some Scrapy configuration options, such as the crawler's User-Agent, download delay, and so on. In the configuration file, you can also set the middleware, pipelines, etc. used by Scrapy.

For example, here is a simple sample configuration file:

BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

DOWNLOAD_DELAY = 3

ITEM_PIPELINES = {
    'myproject.pipelines.MyPipeline': 300,
}

4. Run the Spider

Run the spider using the Scrapy command line tool, for example:

scrapy crawl myspider

The above are the detailed steps to deploy the Scrapy project. When deploying the Feapder project, you can use the Scrapy project as a sub-project of Feapder, and then call the Spider of the Scrapy project in Feapder to complete specific crawling tasks.

Python resources suitable for zero-based learning and advanced people :
① Tencent certified python complete project practical tutorial notes PDF
② More than a dozen major factories python interview topics PDF
③ Python full set of video tutorials (zero-based-advanced JS reverse)
④ Hundreds A project + source code + notes
⑤ programming grammar - machine learning - full stack development - data analysis - reptiles - APP reverse and other full set of projects + documents

Guess you like

Origin blog.csdn.net/ch950401/article/details/132212270