Python: Spider crawler engineering entry to advanced (2) Use Spider Admin Pro to manage scrapy crawler projects

Python: Spider crawler engineering entry to advanced series:

scrapy-projectThis article needs to use the directory file mentioned above , which needs to be created in advance

Python: Spider crawler engineering entry to advanced (1) Create a Scrapy crawler project

This article involves 3 file directories, which can be created in advance

$ tree -L 1
.
├── scrapy-project
├── scrapyd-project
└── spider-admin-project

1. Use scrapyd to run the crawler

scrapyd can manage scrapy crawler projects

Installation environment preparation

# 创建目录,并进入
$ mkdir scrapyd-project && cd scrapyd-project

# 创建虚拟环境,并激活
$ python3 -m venv venv && source venv/bin/activate

install scrapyd

# 安装 scrapyd
$ pip install scrapyd

$ scrapyd --version
Scrapyd 1.4.2

Start the scrapyd service

$ scrapyd

Browser access: http://127.0.0.1:6800/

insert image description here

2. Deploy the Scrapy crawler project

2.1. Modify the configuration file

Go back to the crawler project directory scrapy-projectand modify the configuration filescrapy.cfg

Remove deploy.urlthe comment, 6800the port is the scrapyd port we started above

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = web_spiders.settings

[deploy]
# url = http://localhost:6800/
url = http://localhost:6800/
project = web_spiders

2.2. Deployment project

Install scrapyd-client

pip install scrapyd-client

deployment project

$ scrapyd-deploy

Packing version 1691131715
Deploying to project "web_spiders" in http://localhost:6800/addversion.json
Server response (200):
{
    
    "node_name": "bogon", "status": "ok", "project": "web_spiders", "version": "1691131715", "spiders": 1}

If you see the return "status": "ok", the deployment is successful

3. Use Spider Admin Pro to execute crawlers regularly

The Spider Admin Pro project uses the api interface provided by scrapyd to implement a visual crawler management platform, which is convenient for us to manage and schedule crawlers

3.1. Install Spider Admin Pro

At this point, we need to create a new directory: spider-admin-project

# 创建目录,并进入
$ mkdir spider-admin-project && cd spider-admin-project

# 创建虚拟环境,并激活
$ python3 -m venv venv && source venv/bin/activate

Install spider-admin-pro

pip3 install spider-admin-pro

Start spider-admin-pro

gunicorn 'spider_admin_pro.main:app'

Browser access: http://127.0.0.1:8000/

default

  • account admin
  • Password 123456

insert image description here

3.2. Add scheduled tasks

We click on the left tab column: 定时任务, add a task

Our project has only one crawler, and our crawler name will be selected by default

The cron expression means to execute once every minute

All are default, we just need to click 确定, because it is not running yet, so the logs are empty, we need to wait for a while

insert image description here

3.3. Check the scheduling log

Click on the left tab column: 调度日志, you will see that the crawler project has been executed after a while, and you can view the scheduling log here. It
insert image description here
should be noted that printthe printed content used in our code will not appear in the log file

We modify the code file to be printmodified asself.logger.debug

web_spiders/spiders/wallpaper.py

import scrapy
from scrapy.http import Response


class WallpaperSpider(scrapy.Spider):
    name = "wallpaper"

    allowed_domains = ["mouday.github.io"]

    # 替换爬虫开始爬取的地址为我们需要的地址
    # start_urls = ["https://mouday.github.io"]
    start_urls = ["https://mouday.github.io/wallpaper-database/2023/08/03.json"]

    # 将类型标注加上,便于我们在IDE中快速编写代码
    # def parse(self, response):
    def parse(self, response: Response, **kwargs):
        # 我们什么也不做,仅打印爬取的文本

        # 使用 `print` 打印的内容,并不会出现在日志文件中
        # print(response.text)
        self.logger.debug(response.text)

Redeploy

$ scrapyd-deploy

Wait for the crawler just deployed to finish running, and you can see the log

insert image description here

4. Collect crawler data

4.1, return the Item object

The data structure returned by our target website is as follows

{
    
    
    "date":"2023-08-03",
    "headline":"绿松石般的泉水",
    "title":"泽伦西自然保护区,斯洛文尼亚",
    "description":"泽伦西温泉位于意大利、奥地利和斯洛文尼亚三国的交界处,多个泉眼汇集形成了这个清澈的海蓝色湖泊。在这里,游客们可以尽情欣赏大自然色彩瑰丽的调色盘。",
    "image_url":"https://cn.bing.com/th?id=OHR.ZelenciSprings_ZH-CN8022746409_1920x1080.webp",
    "main_text":"泽伦西自然保护区毗邻意大利和奥地利边境,距离斯洛文尼亚的克拉尼斯卡戈拉不到5公里。"
}

Therefore, create the following Item objects according to the corresponding fields

web_spiders/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WebSpidersItem(scrapy.Item):
    # define the fields for your item here like:
    date = scrapy.Field()
    headline = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    image_url = scrapy.Field()
    main_text = scrapy.Field()


At the same time, modify the crawler file, wrap the data into the subclass WebSpidersItemobject of Item, and return

web_spiders/spiders/wallpaper.py

import json

import scrapy
from scrapy.http import Response

from web_spiders.items import WebSpidersItem


class WallpaperSpider(scrapy.Spider):
    name = "wallpaper"

    allowed_domains = ["mouday.github.io"]

    # 替换爬虫开始爬取的地址为我们需要的地址
    # start_urls = ["https://mouday.github.io"]
    start_urls = ["https://mouday.github.io/wallpaper-database/2023/08/03.json"]

    # 将类型标注加上,便于我们在IDE中快速编写代码
    # def parse(self, response):
    def parse(self, response: Response, **kwargs):
        # 我们什么也不做,仅打印爬取的文本

        # 使用 `print` 打印的内容,并不会出现在日志文件中
        # print(response.text)
        self.logger.debug(response.text)

        # 使用json反序列化字符串为dict对象
        data = json.loads(response.text)

        # 收集我们需要的数据
        item = WebSpidersItem()
        item['date'] = data['date']
        item['headline'] = data['headline']
        item['title'] = data['title']
        item['description'] = data['description']
        item['image_url'] = data['image_url']
        item['main_text'] = data['main_text']

        return item

Redeploy

$ scrapyd-deploy

It can be seen that in addition to the printed log, an additional data is printed, which is the Item object we just returned
insert image description here

4.2. Collect Item data

We can see that the running status column is all unknown, we need to know the running status of the crawler, whether it is successful or failed
insert image description here

scrapy-util can help us collect statistical data about the running of the program

Return to project scrapy-project

Install scrapy-util

pip install scrapy-util

Modify the configuration file web_spiders/settings.py

Add the following configuration to the configuration file, change the port number to the actual port number of spider-admin-pro, here is8000

# 设置收集运行日志的路径,会以post方式向 spider-admin-pro 提交json数据
# 注意:此处配置仅为示例,请设置为 spider-admin-pro 的真实路径
# 假设,我们的 spider-admin-pro 运行在http://127.0.0.1:8000
STATS_COLLECTION_URL = "http://127.0.0.1:8000/api/statsCollection/addItem"

# 启用数据收集扩展
EXTENSIONS = {
    
    
    # ===========================================
    # 可选:如果收集到的时间是utc时间,可以使用本地时间扩展收集
    'scrapy.extensions.corestats.CoreStats': None,
    'scrapy_util.extensions.LocaltimeCoreStats': 0,
    # ===========================================

    # 可选,打印程序运行时长
    'scrapy_util.extensions.ShowDurationExtension': 100,

    # 启用数据收集扩展
    'scrapy_util.extensions.StatsCollectorExtension': 100
}

Redeploy

$ scrapyd-deploy

We see that scrapyd's console outputs the following information

ModuleNotFoundError: No module named 'scrapy_util'

It shows that there is a problem, because we have not installed scrapy-util for the running environment of scrapyd

Stop scrapyd, install scrapy-util

pip install scrapy-util

After the installation is complete, restart scrapyd

Let the crawler execute for a while, we can see that there is more information in the scheduling log list, we can see

  • Running status: finished, not unknown
  • The number of items is 1, we returned 1 item object
  • error time and space, indicating that the program did not report an error
  • The duration is 1 second, it runs for a short time and ends quickly

insert image description here

5. Summary

This article uses a lot of third-party modules, and integrating these modules into our project can greatly improve work efficiency

third party library illustrate Documentation
scrapy Create an engineered crawler project github
scrappy Run the scrapy crawler githubdocs
scrapyd client Deploy scrapy crawler github
spider-admin-pro Scheduling scrapy crawlers github
scrapy-util Collect crawler running results github
gunicorn Execute the spider-admin-pro application docs

Guess you like

Origin blog.csdn.net/mouday/article/details/132104351