Disjob—distributed task scheduling framework

Introduction

Disjob is a distributed task scheduling framework, naturally designed to support distributed long task execution. In addition to regular task scheduling functions, it also provides: task splitting and distributed parallel execution, pausing and canceling running tasks. Tasks, resume execution of suspended tasks, retry failed task execution, save task execution snapshots (Savepoint), task dependencies, task orchestration (DAG), broadcast tasks and other capabilities. The following is the overall flow chart of Disjob:

Framework flow chart

Examples of application scenarios

To give a simple example: count (0,1万亿]the number of prime numbers in the interval. If it is a single-machine single-threaded CPU, it will take a long time to calculate. Here we can use the Disjobdistributed parallel execution capability provided by the framework to solve this type of problem.

  1. Split tasks

First, decide the number of split tasks based on the current machine resources. For example, we have 5 machines and each has 2 core CPUs (prime number statistics are CPU-intensive), and we decide to split them into 10 tasks.

  1. dispatch tasks

The Supervisor uses the specified routing algorithm to dispatch the 10 split subtasks to these Worker machines.

  1. receive tasks

After the Worker receives the subtask, it will be submitted to the thread pool defined by the framework for execution.

  1. Distributed parallel execution

During execution, we can use the batch method (through code loop) to count. Here we specify task-1statistics in the first loop (0, 1亿], statistics in the second loop (10亿, 11亿], and so on in the last loop (9990亿, 9991亿]. In the same way, other tasks perform distributed parallel statistics in the same way.

Ps It can be seen from the Riemann Hypothesis that the distribution of prime numbers is generally uniform. There are many methods to determine whether a number is prime, such as the Ehrlich sieve method, the Euler sieve method, and the Miller Rabin primality test. Here we can use the primality test provided by the Guava library.

  1. Savepoint

What should I do if the machine goes down during the statistical process? Should we start the statistics from scratch again? No No No! SavepointWe can use this to save the current task-1execution snapshot every 10 times in the loop (or every time it executes for more than 1 minute) . This snapshot data will be read when the task is restarted after a crash, and statistics will be continued from the last state. The following is task-1a sample snapshot data saved by the task

{
  "next": 4000000001, // 下一次循环时要统计的区间为(40亿, 41亿]
  "count": 19819734,  // 已经统计到了 19819734 个质数
  "finished": false   // 当前任务是否已经统计完成:true-是;false-否;
}
  1. Pause and resume

If our machine resources need to do other things temporarily, we want to suspend the current statistical tasks for a period of time. No problem! The framework is supported . You only need to find the task 暂停执行中的任务on the management background page and click the button. The task will receive an interrupt signal when paused. When receiving the interrupt signal, it can also be used in the code to save the current execution snapshot.调度实例暂停Savepoint

After other things are processed, we can 调度实例find the suspended task on the management background page and click 恢复the button. At this time, the task will resume from the last saved state and continue execution.

  1. Exception

If a subtask throws the framework's PauseTaskException during execution , it will 暂停correspond to all 10 subtasks under the instance (including tasks dispatched to different machines). Similarly, if a CancelTaskException is thrown , it will 取消correspond to all 10 subtasks under the instance. If other types of exceptions are thrown, only 取消the current subtask will be affected, and other subtasks under the corresponding instance will not be affected.

  1. Task orchestration

Now the total task of prime number statistics has been completed, with a total of 10 subtasks, and each subtask has counted its part of the results. Can Disjob automatically summarize the results for me? Yes! The framework provides very powerful and convenient expressions to orchestrate tasks, such as: A->B,C,(D->E)->D,F->GNow we can create a summary task and then orchestrate the two tasks together.

The following is the job data of prime number statistics in this example. Only some of the main fields are listed, including the job_handlerarrangement of these two task processors (the code is in the project source code)

{
  "jobGroup": "default",
  "jobName": "prime-count-dag",
  "jobState": 1, // job状态:0-禁用;1-启用;
  "jobType": 2,  // job类型:1-普通(Normal);2-工作流(DAG);
  "jobHandler": "cn.ponfee.disjob.test.handler.PrimeCountJobHandler -> cn.ponfee.disjob.test.handler.PrimeAccumulateJobHandler",
  "jobParam": "{\"m\":1,\"n\":10000000000,\"blockSize\":100000000,\"parallel\":10}",
  "triggerType": 2,
  "triggerValue": "2023-09-02 18:00:00"
}

The prime number statistics flow chart in this example is as follows

Sample flow chart

project address

gitee : https://gitee.com/dromara/disjob

github: https://github.com/dromara/disjob

Qt 6.6 is officially released. The pop-up window on the lottery page of Gome App insults its founder . Ubuntu 23.10 is officially released. You might as well take advantage of Friday to upgrade! RISC-V: not controlled by any single company or country. Ubuntu 23.10 release episode: ISO image was urgently "recalled" due to containing hate speech. Russian companies produce computers and servers based on Loongson processors. ChromeOS is a Linux distribution using Google Desktop Environment 23-year - old PhD student fixes 22-year-old "ghost bug" in Firefox TiDB 7.4 released: officially compatible with MySQL 8.0 Microsoft launches Windows Terminal Canary version
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4192694/blog/10117757