Apache Dolphin Scheduler 2.0.6 released, with new Master recall strategy

Version release  2022/7/12 

Recently, Apache Dolphin Scheduler ushered in the release of version 2.0.6. The new version has made important changes to the dependencies and task distribution functions, and has fixed bugs for 2.0.5. For details, see the following updates.

1  major change

Significant changes 

First of all, this version has made changes to solve the problem of uneven distribution of dependencies and tasks.

01  Dependency related issues

In the previous version, when relying on the entire workflow, if a task in the workflow was started, and the task was completed and successful, the entire workflow would be judged to be successful at this time, which lacked judgment on other tasks in the workflow; when relying on When a specific task is started, if another task in the workflow is started, the task is completed and successful. At this time, the dependency will be stuck waiting because it cannot find the dependent task in the last startup workflow, and the user does not know the reason for waiting. In this version, we have rewritten the dependency decision logic:

  • Added logic to check whether dependencies are completed every 5s

  • Print out the reason for the dependency wait and failure

  • When the entire workflow is dependent, no matter how many tasks are started by the last workflow, the last status of all tasks in the workflow in the dependency cycle will be determined. There are the following situations: when a task is not running, the dependency fails; when a task fails, the dependency fails; when there is a task running, the dependency waits.

  • When relying on a task in the workflow, it will directly judge the last state of the dependent task in the dependency cycle

02  The problem of unbalanced task distribution

In the previous version, when there were multiple worker nodes, such as 3 worker nodes, the corresponding loads were 0.1, 0.2, and 0.2 respectively. According to the default logic of assigning tasks by load, if 100 tasks were started at a time, During the heartbeat cycle of starting the task, the task may be directly assigned to the worker with a load of 0.1, while the other two workers cannot be assigned tasks. When the worker's concurrency is 10 tasks, the other 90 tasks are queued on the same node. , which prolongs the overall running time of the task. Based on this problem, the new version adds the Master recall strategy.

  • On the worker side, there are queues waiting to be allocated (this queue will delay the execution queue), queues waiting to be executed, and queues of execution. In the previous version, the queues waiting to be allocated and waiting to be executed were infinite queues, which were limited to be equal to the execution queues.

  • When the master selects a worker, if the worker's waiting execution queue is empty, it will assign the task to the worker with the least load; if the worker's waiting queue is not empty, it will assign the task to the worker with the smallest waiting queue. If it is empty and equal, the worker will be selected according to the load; if the waiting queue of all workers is full, the worker will be selected again after blocking for 1 second.

  • Since the worker will update the size of the waiting queue in zk every heartbeat, if a large number of tasks are started in one heartbeat cycle (the waiting queue has not been updated in this heartbeat cycle), the worker will put the task into the waiting queue first when it gets the task , the waiting allocation queue will assign tasks to the execution queue. When the execution queue is full, it will be placed in the waiting execution queue. When both the execution queue and the waiting execution queue are full, the waiting allocation queue will try to allocate a task every 1 second. When the waiting allocation queue is also full, the master recall strategy will be triggered, the worker will return the task to the master, and the master will reassign it.

2  Bug fixes

Bug Fix

Secondly, version 2.0.6 also focuses on fixing the problems left over from the previous version, including:

  • Fix the problem that the name of the resource re-upload prompt is repeated

  • Fixed the problem of jumping directly to the list page after saving the workflow

  • Fix the problem of failure to save when the resource name is long

  • Fix LDAP login failure problem

  • Fix the problem of dividing line in email alert template

  • Fix the problem that task retry does not work when failover occurs

  • Fix the problem that the workflow succeeds when the workflow has failed tasks and the recovery fails

  • Fixed the problem that the master/worker repeated log printing more frequently

  • Fixed the problem that the sub-workflow and dependent nodes cannot be stopped when the workflow is stopped

  • Fixed the problem that the task execution status could not be obtained occasionally

  • Fix the problem that the sub-workflow gets stuck after the occasional workflow execution is completed

  • Fix the problem that the task is forced to succeed, but the workflow state does not change

  • Fix dependency related issues

  • When fixing multiple worker nodes, the task distribution is very uneven

3 Release Note

https://github.com/apache/dolphinscheduler/releases/tag/2.0.6

4  Resource download

Resource download

https://dolphinscheduler.apache.org/en-us/download/download.html

5  Thanks

Acknowledgement

Apache DolphinScheduler version 2.0.6 has undergone practical tests in the production environment of Zhengcaiyun for nearly four months. The Zhengcaiyun big data platform team fixed a series of core problems and submitted the corresponding PR (https://github.com/apache/ dolphinscheduler/pull/10541). Here, the community would like to express its gratitude to Zhengcaiyun and Zhengcaiyun Big Data Platform Group, and also to all the contributors of version 2.0.6 (in alphabetical order), it is your unremitting efforts to make the community continue to improve!

Amy0104, calvinjiang, caishunfeng, JinyLeeChina, liqingwang, Hou-Shuaishhuai, songjianet, Tianqi-Dotes, weeway, zhanqian-1993, zwZjut

Participate and contribute

With the rapid rise of open source in China, the Apache DolphinScheduler community has ushered in vigorous development. In order to make more usable and easy-to-use scheduling, we sincerely welcome partners who love open source to join the open source community and contribute to the rise of open source in China. , let local open source go global.

There are many ways to participate in the DolphinScheduler community, including:

Contribute the first PR (documentation, code) We also hope that it is simple, the first PR is used to get familiar with the submission process and community collaboration and to feel the friendliness of the community.

Come on, the DolphinScheduler open source community needs your participation and contribute to the rise of China's open source. Even if it is just a small tile, the combined power is huge.

Come on, the open source community is looking forward to your participation.

Guess you like

Origin www.oschina.net/news/202752/apache-dolphinscheduler-2-0-6-released