Celery use experience summary

        

Why use celery

Celery is a distributed task scheduling module developed in Python, so for a large number of systems built with Python, it can be said to be seamless and easy to use. Celery focuses on real-time processing tasks, and also supports timing scheduling of tasks. Therefore, it is suitable for scheduling scenarios such as real-time asynchronous tasks and timing tasks. Celery needs to rely on RabbitMQ as a message broker, and also supports Redis and even Mysql, Mongo, etc. Of course, the official default recommendation is RabbitMQ.

Broker selection

Although there are many officially supported brokers, including RabbitMQ, Redis and even databases, it is not recommended to use databases, because databases need to continuously access the disk. When your workload is large, it will cause serious performance problems. At the same time, your application is very Probably also using the same database, which could cause your application to crash. If the business environment is relatively simple, you can choose Redis. If it is more complicated, choose RabbitMQ, because RabbitMQ is officially recommended, but it is more complicated to operate than Redis. My choice is to use RabbitMQ for broker and Redis for backend

celery cannot start with root user problem C_FORCE_ROOT environment

If you use the root user to start celery, you will encounter the following problems

Running a worker with superuser privileges when the
worker accepts messages serialized with pickle is a very bad idea!
If you really want to continue then you have to set the C_FORCE_ROOT
environment variable (but please think about this before you do).

Solution:

from celery import Celery, platforms

platforms.C_FORCE_ROOT = True  #加上这一行

task repetition

Celery encountered the problem of repeated execution when executing scheduled tasks. At that time, redis was used as broker and backend.
There are relevant descriptions in the official documentation .

If a task is not acknowledged within the Visibility Timeout the task will
be redelivered to another worker and executed.

This causes problems with ETA/countdown/retry tasks where the time to execute exceeds the visibility timeout; in fact if that happens it will be executed again, and again in a loop.

So you have to increase the visibility timeout to match the time of the longest ETA you are planning to use.

Note that Celery will redeliver messages at worker shutdown, so having a long visibility timeout will only delay the redelivery of ‘lost’ tasks in the event of a power failure or forcefully terminated workers.

Periodic tasks will not be affected by the visibility timeout, as this is a concept separate from ETA/countdown.

You can increase this timeout by configuring a transport option with the same name:

BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 43200}

The value must be an int describing the number of seconds.

That is to say, when we set a task whose ETA time is longer than visibility_timeout, every time the visibility_timeout time passes, celery will think that the task has not been successfully executed by the worker, and reassign it to other workers for execution.
The solution is to increase the visibility_timeout parameter, which is larger than the time difference of our ETA. The positioning of celery itself is mainly a real-time asynchronous queue. For such long-term timing execution, the support is not very good.
But it was repeated the next day. . .

Finally, my solution is to write a unique key corresponding to a timestamp in redis after each scheduled task is executed. When the next task is executed, get the value corresponding to this key in redis, and do it with the current time. For comparison, it will only be executed when our timing frequency requirements are met, which ensures that the same task will only be executed once within the specified time.

use a different queue

When you have a lot of tasks to be executed, don't be lazy and only use the default queue, which will affect each other and slow down the execution of tasks, causing important tasks to not be executed quickly. Everyone knows that eggs cannot be put in the same basket.
There is an easy way to set the queue

Automatic routing

The simplest way to do routing is to use the CELERY_CREATE_MISSING_QUEUES setting (on by default).

With this setting on, a named queue that is not already defined in CELERY_QUEUES will be created automatically. This makes it easy to perform simple routing tasks.

Say you have two servers, x, and y that handles regular tasks, and one server z, that only handles feed related tasks. You can use this configuration:

CELERY_ROUTES = {'feed.tasks.import_feed': {'queue': 'feeds'}}

With this route enabled import feed tasks will be routed to the “feeds” queue, while all other tasks will be routed to the default queue (named “celery” for historical reasons).

Now you can start server z to only process the feeds queue like this:

user@z:/$ celery -A proj worker -Q feeds

You can specify as many queues as you want, so you can make this server process the default queue as well:

user@z:/$ celery -A proj worker -Q feeds,celery

Use directly

CELERY_ROUTES = {'feed.tasks.import_feed': {'queue': 'feeds'}}
user@z:/$ celery -A proj worker -Q feeds,celery

Specify routes, the corresponding queue will be automatically generated, and then use -Q to specify the queue to start celery. The default queue name is celery. You can refer to the official documentation to modify the name of the default queue.

Start multiple workers to perform different tasks

On the same machine, it is best to start different workers to execute tasks with different priorities, such as separating real-time tasks and timed tasks, and separating tasks with high execution frequency and tasks with low execution frequency, which is conducive to ensuring high priority. Tasks with higher levels can get more system resources, and high-frequency real-time task logs will also affect the log viewing of real-time tasks. They can be recorded in different log files separately for easy viewing.

$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1.%h
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker2.%h
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker3.%h

You can start different workers like this. %h can specify hostname. For details, you can check the official documentation
. High-priority tasks can allocate more concurrency, but it is not that the more workers and the number of methods, the better, and it is good to ensure that tasks do not accumulate. .

Whether to pay attention to the task execution status

This depends on the specific business scenario. If you don't care about the result, or the execution of the task itself will affect the data, you can know the result of the execution by judging the data, then you do not need to return the exit status of the celery task, you can set

CELERY_IGNORE_RESULT = True

or

@app.task(ignore_result=True)
def mytask(…):
    something()

However, if the business needs to respond to the status of the task execution, it should not be set in this way.

memory leak

Memory leaks may occur when Celery runs for a long time, which can be set as follows

CELERYD_MAX_TASKS_PER_CHILD = 40 # 每个worker执行了多少任务就会死掉

Author: Li Junwei_
Link: http://www.jianshu.com/p/9e422d9f1ce2
Source: Jianshu
Copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326180923&siteId=291194637