Python: Spider crawler engineering entry to advanced series:
- Python: Spider crawler engineering entry to advanced (1) Create a Scrapy crawler project
- Python: Spider crawler engineering entry to advanced (2) Use Spider Admin Pro to manage scrapy crawler projects
Table of contents
scrapy-project
This article needs to use the directory file mentioned above , which needs to be created in advance
Python: Spider crawler engineering entry to advanced (1) Create a Scrapy crawler project
This article involves 3 file directories, which can be created in advance
$ tree -L 1
.
├── scrapy-project
├── scrapyd-project
└── spider-admin-project
1. Use scrapyd to run the crawler
scrapyd can manage scrapy crawler projects
Installation environment preparation
# 创建目录,并进入
$ mkdir scrapyd-project && cd scrapyd-project
# 创建虚拟环境,并激活
$ python3 -m venv venv && source venv/bin/activate
install scrapyd
# 安装 scrapyd
$ pip install scrapyd
$ scrapyd --version
Scrapyd 1.4.2
Start the scrapyd service
$ scrapyd
Browser access: http://127.0.0.1:6800/
2. Deploy the Scrapy crawler project
2.1. Modify the configuration file
Go back to the crawler project directory scrapy-project
and modify the configuration filescrapy.cfg
Remove deploy.url
the comment, 6800
the port is the scrapyd port we started above
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html
[settings]
default = web_spiders.settings
[deploy]
# url = http://localhost:6800/
url = http://localhost:6800/
project = web_spiders
2.2. Deployment project
Install scrapyd-client
pip install scrapyd-client
deployment project
$ scrapyd-deploy
Packing version 1691131715
Deploying to project "web_spiders" in http://localhost:6800/addversion.json
Server response (200):
{
"node_name": "bogon", "status": "ok", "project": "web_spiders", "version": "1691131715", "spiders": 1}
If you see the return "status": "ok"
, the deployment is successful
3. Use Spider Admin Pro to execute crawlers regularly
The Spider Admin Pro project uses the api interface provided by scrapyd to implement a visual crawler management platform, which is convenient for us to manage and schedule crawlers
3.1. Install Spider Admin Pro
At this point, we need to create a new directory: spider-admin-project
# 创建目录,并进入
$ mkdir spider-admin-project && cd spider-admin-project
# 创建虚拟环境,并激活
$ python3 -m venv venv && source venv/bin/activate
Install spider-admin-pro
pip3 install spider-admin-pro
Start spider-admin-pro
gunicorn 'spider_admin_pro.main:app'
Browser access: http://127.0.0.1:8000/
default
- account admin
- Password 123456
3.2. Add scheduled tasks
We click on the left tab column: 定时任务
, add a task
Our project has only one crawler, and our crawler name will be selected by default
The cron expression means to execute once every minute
All are default, we just need to click 确定
, because it is not running yet, so the logs are empty, we need to wait for a while
3.3. Check the scheduling log
Click on the left tab column: 调度日志
, you will see that the crawler project has been executed after a while, and you can view the scheduling log here. It
should be noted that print
the printed content used in our code will not appear in the log file
We modify the code file to be print
modified asself.logger.debug
web_spiders/spiders/wallpaper.py
import scrapy
from scrapy.http import Response
class WallpaperSpider(scrapy.Spider):
name = "wallpaper"
allowed_domains = ["mouday.github.io"]
# 替换爬虫开始爬取的地址为我们需要的地址
# start_urls = ["https://mouday.github.io"]
start_urls = ["https://mouday.github.io/wallpaper-database/2023/08/03.json"]
# 将类型标注加上,便于我们在IDE中快速编写代码
# def parse(self, response):
def parse(self, response: Response, **kwargs):
# 我们什么也不做,仅打印爬取的文本
# 使用 `print` 打印的内容,并不会出现在日志文件中
# print(response.text)
self.logger.debug(response.text)
Redeploy
$ scrapyd-deploy
Wait for the crawler just deployed to finish running, and you can see the log
4. Collect crawler data
4.1, return the Item object
The data structure returned by our target website is as follows
{
"date":"2023-08-03",
"headline":"绿松石般的泉水",
"title":"泽伦西自然保护区,斯洛文尼亚",
"description":"泽伦西温泉位于意大利、奥地利和斯洛文尼亚三国的交界处,多个泉眼汇集形成了这个清澈的海蓝色湖泊。在这里,游客们可以尽情欣赏大自然色彩瑰丽的调色盘。",
"image_url":"https://cn.bing.com/th?id=OHR.ZelenciSprings_ZH-CN8022746409_1920x1080.webp",
"main_text":"泽伦西自然保护区毗邻意大利和奥地利边境,距离斯洛文尼亚的克拉尼斯卡戈拉不到5公里。"
}
Therefore, create the following Item objects according to the corresponding fields
web_spiders/items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class WebSpidersItem(scrapy.Item):
# define the fields for your item here like:
date = scrapy.Field()
headline = scrapy.Field()
title = scrapy.Field()
description = scrapy.Field()
image_url = scrapy.Field()
main_text = scrapy.Field()
At the same time, modify the crawler file, wrap the data into the subclass WebSpidersItem
object of Item, and return
web_spiders/spiders/wallpaper.py
import json
import scrapy
from scrapy.http import Response
from web_spiders.items import WebSpidersItem
class WallpaperSpider(scrapy.Spider):
name = "wallpaper"
allowed_domains = ["mouday.github.io"]
# 替换爬虫开始爬取的地址为我们需要的地址
# start_urls = ["https://mouday.github.io"]
start_urls = ["https://mouday.github.io/wallpaper-database/2023/08/03.json"]
# 将类型标注加上,便于我们在IDE中快速编写代码
# def parse(self, response):
def parse(self, response: Response, **kwargs):
# 我们什么也不做,仅打印爬取的文本
# 使用 `print` 打印的内容,并不会出现在日志文件中
# print(response.text)
self.logger.debug(response.text)
# 使用json反序列化字符串为dict对象
data = json.loads(response.text)
# 收集我们需要的数据
item = WebSpidersItem()
item['date'] = data['date']
item['headline'] = data['headline']
item['title'] = data['title']
item['description'] = data['description']
item['image_url'] = data['image_url']
item['main_text'] = data['main_text']
return item
Redeploy
$ scrapyd-deploy
It can be seen that in addition to the printed log, an additional data is printed, which is the Item object we just returned
4.2. Collect Item data
We can see that the running status column is all unknown
, we need to know the running status of the crawler, whether it is successful or failed
scrapy-util can help us collect statistical data about the running of the program
Return to project scrapy-project
Install scrapy-util
pip install scrapy-util
Modify the configuration file web_spiders/settings.py
Add the following configuration to the configuration file, change the port number to the actual port number of spider-admin-pro, here is8000
# 设置收集运行日志的路径,会以post方式向 spider-admin-pro 提交json数据
# 注意:此处配置仅为示例,请设置为 spider-admin-pro 的真实路径
# 假设,我们的 spider-admin-pro 运行在http://127.0.0.1:8000
STATS_COLLECTION_URL = "http://127.0.0.1:8000/api/statsCollection/addItem"
# 启用数据收集扩展
EXTENSIONS = {
# ===========================================
# 可选:如果收集到的时间是utc时间,可以使用本地时间扩展收集
'scrapy.extensions.corestats.CoreStats': None,
'scrapy_util.extensions.LocaltimeCoreStats': 0,
# ===========================================
# 可选,打印程序运行时长
'scrapy_util.extensions.ShowDurationExtension': 100,
# 启用数据收集扩展
'scrapy_util.extensions.StatsCollectorExtension': 100
}
Redeploy
$ scrapyd-deploy
We see that scrapyd's console outputs the following information
ModuleNotFoundError: No module named 'scrapy_util'
It shows that there is a problem, because we have not installed scrapy-util for the running environment of scrapyd
Stop scrapyd, install scrapy-util
pip install scrapy-util
After the installation is complete, restart scrapyd
Let the crawler execute for a while, we can see that there is more information in the scheduling log list, we can see
- Running status: finished, not unknown
- The number of items is 1, we returned 1 item object
- error time and space, indicating that the program did not report an error
- The duration is 1 second, it runs for a short time and ends quickly
5. Summary
This article uses a lot of third-party modules, and integrating these modules into our project can greatly improve work efficiency
third party library | illustrate | Documentation |
---|---|---|
scrapy | Create an engineered crawler project | github |
scrappy | Run the scrapy crawler | github、docs |
scrapyd client | Deploy scrapy crawler | github |
spider-admin-pro | Scheduling scrapy crawlers | github |
scrapy-util | Collect crawler running results | github |
gunicorn | Execute the spider-admin-pro application | docs |