Why use celery
Celery is a distributed task scheduling module developed in Python, so for a large number of systems built with Python, it can be said to be seamless and easy to use. Celery focuses on real-time processing tasks, and also supports timing scheduling of tasks. Therefore, it is suitable for scheduling scenarios such as real-time asynchronous tasks and timing tasks. Celery needs to rely on RabbitMQ as a message broker, and also supports Redis and even Mysql, Mongo, etc. Of course, the official default recommendation is RabbitMQ.
Broker selection
Although there are many officially supported brokers, including RabbitMQ, Redis and even databases, it is not recommended to use databases, because databases need to continuously access the disk. When your workload is large, it will cause serious performance problems. At the same time, your application is very Probably also using the same database, which could cause your application to crash. If the business environment is relatively simple, you can choose Redis. If it is more complicated, choose RabbitMQ, because RabbitMQ is officially recommended, but it is more complicated to operate than Redis. My choice is to use RabbitMQ for broker and Redis for backend
celery cannot start with root user problem C_FORCE_ROOT environment
If you use the root user to start celery, you will encounter the following problems
Running a worker with superuser privileges when the
worker accepts messages serialized with pickle is a very bad idea!
If you really want to continue then you have to set the C_FORCE_ROOT
environment variable (but please think about this before you do).
Solution:
from celery import Celery, platforms
platforms.C_FORCE_ROOT = True #加上这一行
task repetition
Celery encountered the problem of repeated execution when executing scheduled tasks. At that time, redis was used as broker and backend.
There are relevant descriptions in the official documentation .
If a task is not acknowledged within the Visibility Timeout the task will
be redelivered to another worker and executed.This causes problems with ETA/countdown/retry tasks where the time to execute exceeds the visibility timeout; in fact if that happens it will be executed again, and again in a loop.
So you have to increase the visibility timeout to match the time of the longest ETA you are planning to use.
Note that Celery will redeliver messages at worker shutdown, so having a long visibility timeout will only delay the redelivery of ‘lost’ tasks in the event of a power failure or forcefully terminated workers.
Periodic tasks will not be affected by the visibility timeout, as this is a concept separate from ETA/countdown.
You can increase this timeout by configuring a transport option with the same name:
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 43200}
The value must be an int describing the number of seconds.
That is to say, when we set a task whose ETA time is longer than visibility_timeout, every time the visibility_timeout time passes, celery will think that the task has not been successfully executed by the worker, and reassign it to other workers for execution.
The solution is to increase the visibility_timeout parameter, which is larger than the time difference of our ETA. The positioning of celery itself is mainly a real-time asynchronous queue. For such long-term timing execution, the support is not very good.
But it was repeated the next day. . .
Finally, my solution is to write a unique key corresponding to a timestamp in redis after each scheduled task is executed. When the next task is executed, get the value corresponding to this key in redis, and do it with the current time. For comparison, it will only be executed when our timing frequency requirements are met, which ensures that the same task will only be executed once within the specified time.
use a different queue
When you have a lot of tasks to be executed, don't be lazy and only use the default queue, which will affect each other and slow down the execution of tasks, causing important tasks to not be executed quickly. Everyone knows that eggs cannot be put in the same basket.
There is an easy way to set the queue
Automatic routing
The simplest way to do routing is to use the CELERY_CREATE_MISSING_QUEUES setting (on by default).
With this setting on, a named queue that is not already defined in CELERY_QUEUES will be created automatically. This makes it easy to perform simple routing tasks.
Say you have two servers, x, and y that handles regular tasks, and one server z, that only handles feed related tasks. You can use this configuration:
CELERY_ROUTES = {'feed.tasks.import_feed': {'queue': 'feeds'}}
With this route enabled import feed tasks will be routed to the “feeds” queue, while all other tasks will be routed to the default queue (named “celery” for historical reasons).
Now you can start server z to only process the feeds queue like this:
user@z:/$ celery -A proj worker -Q feeds
You can specify as many queues as you want, so you can make this server process the default queue as well:
user@z:/$ celery -A proj worker -Q feeds,celery
Use directly
CELERY_ROUTES = {'feed.tasks.import_feed': {'queue': 'feeds'}}
user@z:/$ celery -A proj worker -Q feeds,celery
Specify routes, the corresponding queue will be automatically generated, and then use -Q to specify the queue to start celery. The default queue name is celery. You can refer to the official documentation to modify the name of the default queue.
Start multiple workers to perform different tasks
On the same machine, it is best to start different workers to execute tasks with different priorities, such as separating real-time tasks and timed tasks, and separating tasks with high execution frequency and tasks with low execution frequency, which is conducive to ensuring high priority. Tasks with higher levels can get more system resources, and high-frequency real-time task logs will also affect the log viewing of real-time tasks. They can be recorded in different log files separately for easy viewing.
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1.%h
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker2.%h
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker3.%h
You can start different workers like this. %h can specify hostname. For details, you can check the official documentation
. High-priority tasks can allocate more concurrency, but it is not that the more workers and the number of methods, the better, and it is good to ensure that tasks do not accumulate. .
Whether to pay attention to the task execution status
This depends on the specific business scenario. If you don't care about the result, or the execution of the task itself will affect the data, you can know the result of the execution by judging the data, then you do not need to return the exit status of the celery task, you can set
CELERY_IGNORE_RESULT = True
or
@app.task(ignore_result=True)
def mytask(…):
something()
However, if the business needs to respond to the status of the task execution, it should not be set in this way.
memory leak
Memory leaks may occur when Celery runs for a long time, which can be set as follows
CELERYD_MAX_TASKS_PER_CHILD = 40 # 每个worker执行了多少任务就会死掉
Author: Li Junwei_ Link: http://www.jianshu.com/p/9e422d9f1ce2 Source: Jianshu Copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.