Python Celery 分布式任务

本篇文章用案例介绍如何使用Celery来完成多个worker基于不同队列的任务消费的Demo

一、背景

公司业务涉及多云平台,对于资源同步,数据备份,爬虫等都需要执行大量且耗时较久的任务,单台机器的worker多进程已经无法满足大量任务的执行,这个时候就需要考虑衡量扩展的方式,来满足业务的需求。

二、Celery 分布式任务架构

2.1 角色介绍

celery beat:    用于分派任务到任务队列
celery worker:  用于从任务队列中消费任务,运行两个,一个消费queue_resource中的任务,另一个消费queue_monitor中的任务
celery flower:  监控任务结果信息

broker:  本例中使用redis,基于kombu,存储worker节点,注册的任务队列,未消费的任务,
backend: 本例中使用redis,保留任务执行结果

2.2 任务流程图

在这里插入图片描述

三、Celery 任务部署

3.1 安装依赖

python 3.9
pip install celery==5.2.7
pip install redis
pip install flower

3.2 启动 beat

可以看到,Celery beat已经开始分派任务: Scheduler: Sending due task …

./venv/bin/celery -A scheduler.celery beat -l INFO
celery beat v5.2.7 (dawn-chorus) is starting.
__    -    ... __   -        _
LocalTime -> 2023-04-08 17:00:40
Configuration ->
    . broker -> redis://localhost:6379/10
    . loader -> celery.loaders.app.AppLoader
    . scheduler -> celery.beat.PersistentScheduler
    . db -> celerybeat-schedule
    . logfile -> [stderr]@%INFO
    . maxinterval -> 5.00 minutes (300s)
[2023-04-08 17:00:40,273: INFO/MainProcess] beat: Starting...
[2023-04-08 17:00:40,314: INFO/MainProcess] Scheduler: Sending due task sync-monitor-001 (sync_monitor)
[2023-04-08 17:00:40,340: INFO/MainProcess] Scheduler: Sending due task sync-host-001 (sync_host)
[2023-04-08 17:00:40,343: INFO/MainProcess] Scheduler: Sending due task sync-database-001 (sync_database)
[2023-04-08 17:01:00,001: INFO/MainProcess] Scheduler: Sending due task sync-database-001 (sync_database)
[2023-04-08 17:01:00,011: INFO/MainProcess] Scheduler: Sending due task sync-host-001 (sync_host)
[2023-04-08 17:01:00,018: INFO/MainProcess] Scheduler: Sending due task sync-monitor-001 (sync_monitor)

观察任务队列

在这里插入图片描述

3.3 启动 worker

下面分别启动了两个worker,且各自receive自己的queue中的任务,最终运行完成,输出success

./venv/bin/celery -A scheduler.celery worker -l INFO -E -c 4 -Q queue_monitor


 -------------- celery@yuehuadeMacBook-Pro.local v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-11.5.2-arm64-arm-64bit 2023-04-08 17:03:38
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         celery_app:0x104308670
- ** ---------- .> transport:   redis://localhost:6379/10
- ** ---------- .> results:     redis://localhost:6379/11
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ---- .> task events: ON
--- ***** ----- 
 -------------- [queues]
                .> queue_monitor    exchange=queue_monitor(direct) key=queue_monitor
                

[tasks]
  . sync_database
  . sync_host
  . sync_monitor

[2023-04-08 17:03:38,768: INFO/MainProcess] Connected to redis://localhost:6379/10
[2023-04-08 17:03:38,780: INFO/MainProcess] mingle: searching for neighbors
[2023-04-08 17:03:39,809: INFO/MainProcess] mingle: all alone
[2023-04-08 17:03:39,852: INFO/MainProcess] celery@yuehuadeMacBook-Pro.local ready.
[2023-04-08 17:03:39,885: INFO/MainProcess] Task sync_monitor[fc4faff3-16ab-40ae-9c6f-fd8e25c5d933] received
[2023-04-08 17:03:39,889: INFO/MainProcess] Task sync_monitor[3348aac6-ffaa-4821-82e2-6cec6ab0a9e5] received
[2023-04-08 17:03:39,890: WARNING/ForkPoolWorker-2] this is a scheduled monitor task
[2023-04-08 17:03:39,890: WARNING/ForkPoolWorker-4] this is a scheduled monitor task
[2023-04-08 17:03:39,896: INFO/MainProcess] Task sync_monitor[467bada6-61aa-472e-bff2-6f27d81b6172] received
[2023-04-08 17:03:39,910: INFO/MainProcess] Task sync_monitor[2a267e9f-5d6e-439d-9698-83bcd0665c49] received
[2023-04-08 17:03:39,911: WARNING/ForkPoolWorker-1] this is a scheduled monitor task
[2023-04-08 17:03:39,911: WARNING/ForkPoolWorker-3] this is a scheduled monitor task
[2023-04-08 17:03:39,944: WARNING/ForkPoolWorker-2]  
[2023-04-08 17:03:39,944: WARNING/ForkPoolWorker-4]  
[2023-04-08 17:03:39,945: WARNING/ForkPoolWorker-2] 阿里云
[2023-04-08 17:03:39,945: WARNING/ForkPoolWorker-4] 阿里云
[2023-04-08 17:03:39,946: WARNING/ForkPoolWorker-3]  
[2023-04-08 17:03:39,947: WARNING/ForkPoolWorker-3] 阿里云
[2023-04-08 17:03:39,947: WARNING/ForkPoolWorker-1]  
[2023-04-08 17:03:39,947: WARNING/ForkPoolWorker-1] 阿里云
[2023-04-08 17:03:39,955: INFO/ForkPoolWorker-4] Task sync_monitor[3348aac6-ffaa-4821-82e2-6cec6ab0a9e5] succeeded in 0.06502870800000005s: []
[2023-04-08 17:03:39,955: INFO/ForkPoolWorker-2] Task sync_monitor[fc4faff3-16ab-40ae-9c6f-fd8e25c5d933] succeeded in 0.0649843750000001s: []
[2023-04-08 17:03:39,957: INFO/ForkPoolWorker-3] Task sync_monitor[2a267e9f-5d6e-439d-9698-83bcd0665c49] succeeded in 0.04604374999999994s: []
[2023-04-08 17:03:39,959: INFO/ForkPoolWorker-1] Task sync_monitor[467bada6-61aa-472e-bff2-6f27d81b6172] succeeded in 0.048174624999999915s: []



./venv/bin/celery -A scheduler.celery worker -l INFO -E -c 4 -Q queue_resource
  
 -------------- celery@yuehuadeMacBook-Pro.local v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-11.5.2-arm64-arm-64bit 2023-04-08 17:05:26
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         celery_app:0x102bd4670
- ** ---------- .> transport:   redis://localhost:6379/10
- ** ---------- .> results:     redis://localhost:6379/11
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ---- .> task events: ON
--- ***** ----- 
 -------------- [queues]
                .> queue_resource   exchange=queue_resource(direct) key=queue_resource
                

[tasks]
  . sync_database
  . sync_host
  . sync_monitor

[2023-04-08 17:05:26,987: INFO/MainProcess] Connected to redis://localhost:6379/10
[2023-04-08 17:05:26,993: INFO/MainProcess] mingle: searching for neighbors
[2023-04-08 17:05:28,021: INFO/MainProcess] mingle: all alone
[2023-04-08 17:05:28,059: INFO/MainProcess] celery@yuehuadeMacBook-Pro.local ready.
[2023-04-08 17:05:28,079: INFO/MainProcess] Task sync_host[ceb41611-cd7a-4d39-bf5e-aad1d4a31853] received
[2023-04-08 17:05:28,083: INFO/MainProcess] Task sync_database[f44e60a4-20bc-4bdc-a011-383f0dbd5593] received
[2023-04-08 17:05:28,084: WARNING/ForkPoolWorker-2] this is a scheduled host task
[2023-04-08 17:05:28,084: WARNING/ForkPoolWorker-4] this is a scheduled database task
[2023-04-08 17:05:28,088: INFO/MainProcess] Task sync_database[4cc7f4f2-6921-4417-91fb-10c80a3db2a6] received
[2023-04-08 17:05:28,095: INFO/MainProcess] Task sync_host[4eee9ac3-9f66-4239-802d-d59f59fc895b] received
[2023-04-08 17:05:28,095: WARNING/ForkPoolWorker-1] this is a scheduled database task
[2023-04-08 17:05:28,095: WARNING/ForkPoolWorker-3] this is a scheduled host task
[2023-04-08 17:05:28,101: INFO/MainProcess] Task sync_host[92dc3658-b3a0-4fb7-baea-a4c9d266e3ab] received
[2023-04-08 17:05:28,113: INFO/MainProcess] Task sync_database[d81e651f-31c0-4fb2-b5b2-11a02465c3f9] received
[2023-04-08 17:05:28,116: INFO/MainProcess] Task sync_host[894e4e4e-60b7-48f4-8594-78009e13185d] received
[2023-04-08 17:05:28,123: INFO/MainProcess] Task sync_database[c7c17430-cfc3-4e46-87dd-cf436612a04c] received
[2023-04-08 17:05:28,124: WARNING/ForkPoolWorker-2]  
[2023-04-08 17:05:28,124: WARNING/ForkPoolWorker-2] cn-zhangjiakou
[2023-04-08 17:05:28,125: WARNING/ForkPoolWorker-4]  
[2023-04-08 17:05:28,125: WARNING/ForkPoolWorker-4] cn-zhangjiakou
[2023-04-08 17:05:28,128: INFO/MainProcess] Task sync_database[8b47f800-bb6e-44cf-8d46-9e59798f37a1] received
[2023-04-08 17:05:28,131: INFO/MainProcess] Task sync_host[f9842aab-9a5c-4cf6-9f76-d38f9c86bcaf] received
[2023-04-08 17:05:28,131: INFO/ForkPoolWorker-2] Task sync_host[ceb41611-cd7a-4d39-bf5e-aad1d4a31853] succeeded in 0.047811792000000075s: []
[2023-04-08 17:05:28,132: INFO/ForkPoolWorker-4] Task sync_database[f44e60a4-20bc-4bdc-a011-383f0dbd5593] succeeded in 0.048283166000000044s: []
[2023-04-08 17:05:28,134: INFO/MainProcess] Task sync_database[7251dcf9-c304-44c8-9d33-727aac6c3958] received
[2023-04-08 17:05:28,136: WARNING/ForkPoolWorker-3]  
[2023-04-08 17:05:28,136: WARNING/ForkPoolWorker-3] cn-zhangjiakou
[2023-04-08 17:05:28,136: WARNING/ForkPoolWorker-1]  
[2023-04-08 17:05:28,136: WARNING/ForkPoolWorker-1] cn-zhangjiakou
[2023-04-08 17:05:28,138: INFO/MainProcess] Task sync_host[3150cd69-aca3-4836-b4cc-35c54dff44dc] received
[2023-04-08 17:05:28,139: WARNING/ForkPoolWorker-4] this is a scheduled database task
[2023-04-08 17:05:28,139: WARNING/ForkPoolWorker-2] this is a scheduled host task
[2023-04-08 17:05:28,144: INFO/ForkPoolWorker-1] Task sync_database[4cc7f4f2-6921-4417-91fb-10c80a3db2a6] succeeded in 0.048955582999999914s: []
[2023-04-08 17:05:28,144: INFO/ForkPoolWorker-3] Task sync_host[4eee9ac3-9f66-4239-802d-d59f59fc895b] succeeded in 0.04881654199999996s: []

我们再次观察broker,发现之前的任务队列已经被消费完毕,且两个worker分别绑定了自己的queue

在这里插入图片描述

3.4 启动 flower

flower应该优先启动,否则无法监控到已经消费过的任务结果

./venv/bin/celery -A scheduler.celery flower --port=5555

查看历史任务状态,由于这里我们的Worker都是启动在同一个节点上,所以Worker的名称都是一样的

在这里插入图片描述

3.5 broker & backend

redis db 10: broker 地址
redis db 11: backend 地址
broker

在这里插入图片描述

backend

可以看到任务已经执行完成,这里的时区问题是我docker redis的时区问题,非celery,celery中也可以通过timezone指定时区

在这里插入图片描述

三、参数配置

3.1 全局参数配置

broker_url

用来保存beat分配的任务队列,broker注册信息,broker主机的地址

task_serializer

任务保存的数据格式,默认为json

accept_content

接收内容的数据格式,默认为json

result_backend

任务结果保存后端地址

result_expires

任务结果的过期时间,默认为60 * 60 * 24

result_serializer

任务结果保存的数据格式,默认为json

timezone

时区设置,如果未配置,则使用UTC时区

enable_utc

启用utc时区

3.2 celery worker 配置

worker_concurrency

并发执行任务的子进程数, 默认为vCPU核心数, 启动worker的-c参数

worker_prefetch_multiplier

一次性每个子进程数预取的消息数, 默认为4, 表示单子进程处理4个任务消息, 具体情况看任务执行时间, 如果有某些执行耗时较久的任务, 则设置为1比较合适

worker_max_tasks_per_child

worker执行的最大任务数后替换为新的进程, 默认不限制, 用于释放内存

worker_max_memory_per_child

在一个worker被一个新的worker替换之前, 它可能消耗的最大常驻内存量(KB),如果单个任务导致一个worker超过了这个限制,则该任务完成后替换新的worker, 默认无限制

task_time_limit

worker处理任务的最长时间, 超过这个时间, 处理该任务的worker将被杀死并被一个新的worker取代, 单位秒, 默认无限制

在这里插入图片描述

刚启动worker的子进程时间

在这里插入图片描述
当任务超过硬限制时间后,子进程重新启动的时间

在这里插入图片描述

task_soft_time_limit

worker处理任务的最长时间, 超过这个时间, 则将会引发一个SoftTimeLimitExceeded, 可以在硬限制到来之前进行清理任务, 单位秒, 默认无限制
在这里插入图片描述

在这里插入图片描述

imports

导入任务模块, worker启动时要导入的一系列任务, 按配置顺序导入

3.3 celery queues/routes/annotations 配置

task_queues

tasks的queue队列配置

task_annotations

tasks的限速配置

task_routes

tasks的路由配置,本例中没有静态指定task_queues,所以是动态路由模式

3.4 celery events 配置

worker_send_task_events

任务状态监控, flower集成, 启动的-E参数, 默认关闭

3.5 celery beat 配置

beat_schedule

定时任务配置,用于根据schedule分配任务

四、完整代码

4.1 目录结构

celery.py:celery 主程序入口
celeryconfig.py:celery 配置文件
monitor:queue_monitor 执行的任务
tasks:quque_resource 执行的任务

在这里插入图片描述

4.2 完整代码

./celery.py
from celery import Celery

# Celery对象
celery_app = Celery('celery_app')

# 导入配置
celery_app.config_from_object('scheduler.celeryconfig')
./celeryconfig.py
from celery.schedules import crontab
from settings import config


# -------------------- 全局参数配置 -------------------- #
# Broker地址
broker_url = config.CELERY_BROKER

# 任务结果保存后端
result_backend = config.CELERY_BACKEND

# 任务保存的数据格式
task_serializer = 'json'

# 任务结果保存的数据格式
result_serializer = 'json'
# redis backend中的任务结果保存时间, 默认1天, 单位为秒
result_expires = 60 * 60 * 24 * 3

# 接收的内容数据格式
accept_content = ['json']

# 启用UTC
# enable_utc = True
# 如果未设置, 则使用UTC时区
timezone = 'Asia/Shanghai'



# --------------------celery worker 配置-------------------- #
# 并发执行任务的子进程数, 默认为vCPU核心数, 启动worker的-c参数
worker_concurrency = 4

# 一次性每个子进程数预取的消息数, 默认为4, 表示单子进程处理4个任务消息, 具体情况看任务执行时间, 如果有某些执行耗时较久的任务, 则设置为1比较合适
worker_prefetch_multiplier = 4

# worker执行的最大任务数后替换为新的进程, 默认不限制, 用于释放内存
worker_max_tasks_per_child = 500

# 在一个worker被一个新的worker替换之前, 它可能消耗的最大常驻内存量(KB),如果单个任务导致一个worker超过了这个限制,则该任务完成后替换新的worker, 默认无限制
worker_max_memory_per_child = 50000

# worker处理任务的最长时间, 超过这个时间, 处理该任务的worker将被杀死并被一个新的worker取代, 单位秒, 默认无限制
task_time_limit = 3600

# worker处理任务的最长时间, 超过这个时间, 则将会引发一个SoftTimeLimitExceeded, 可以在硬限制到来之前进行清理任务, 单位秒, 默认无限制
task_soft_time_limit = 1800

# 导入任务模块, worker启动时要导入的一系列任务, 按配置顺序导入
imports = (
    'scheduler.tasks.sync_host',
    'scheduler.tasks.sync_database',
    'scheduler.monitor.sync_monitor'
)


# -------------------- celery queues/routes/annotations 配置 -------------------- #
# task_queues: 分流
# task_annotations: 限速
task_routes = {
    
    
    'scheduler.tasks.*': {
    
    
        'queue': 'queue_resource',
    },
    'scheduler.monitor.*': {
    
    
        'queue': 'queue_monitor',
    }
}


# -------------------- celery events 配置 -------------------- #
# 任务状态监控, flower集成, 启动的-E参数, 默认关闭
worker_send_task_events = True


# -------------------- celery beat 配置 -------------------- #
beat_schedule = {
    
    
    'sync-host-001': {
    
    
        'task': 'sync_host',
        'schedule': crontab(minute="*/1"),
        'options': {
    
    'queue': 'queue_resource'}
    },
    'sync-database-001': {
    
    
        'task': 'sync_database',
        'schedule': crontab(minute="*/1"), 
        'options': {
    
    'queue': 'queue_resource'}
    },
    'sync-monitor-001': {
    
    
        'task': 'sync_monitor',
        'schedule': crontab(minute='*/1'),
        'options': {
    
    'queue': 'queue_monitor'}
    }
}
./monitor/sync_monitor.py
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from models.database import CloudDatabase
from scheduler.celery import celery_app
from settings import config


engine = create_engine(config.DATABASE_URL, echo=False)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)


@celery_app.task(name="sync_monitor", bind=True)
def sync_monitors(self):
    print("this is a scheduled monitor task")
    db = SessionLocal()
    objs = db.query(CloudDatabase).filter_by(platform="阿里云").first()
    print(objs.account_name, objs.platform)
    return []
./tasks/sync_host.py
from scheduler.celery import celery_app
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from models.database import CloudDatabase
from settings import config


engine = create_engine(config.DATABASE_URL, echo=False)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)


@celery_app.task(name="sync_host", bind=True)
def sync_hosts(self):
    print("this is a scheduled host task")
    db = SessionLocal()
    objs = db.query(CloudDatabase).filter_by(platform="阿里云").first()
    print(objs.account_name, objs.region)
    return []
./tasks/sync_database.py
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from models.database import CloudDatabase
from scheduler.celery import celery_app
from settings import config


engine = create_engine(config.DATABASE_URL, echo=False)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)


@celery_app.task(name="sync_database", bind=True)
def sync_databases(self):
    print("this is a scheduled database task")
    db = SessionLocal()
    objs = db.query(CloudDatabase).filter_by(platform="阿里云").first()
    print(objs.account_name, objs.region)
    return []

猜你喜欢

转载自blog.csdn.net/wanglei_storage/article/details/130029916