The supporter behind Volcano Engine DataLeap - workflow orchestration and scheduling system FlowX

For more technical exchanges and job opportunities, please follow the WeChat official account of ByteDance Data Platform and reply [1] to enter the official communication group

Background introduction

Business scene

In our daily work, we need to schedule certain logic repeatedly from time to time, and then we need a scheduling system. According to different scheduling requirements, they can be broadly divided into two categories:

Timing scheduling

Tasks are scheduled repeatedly according to a certain period. This type of task is relatively easy to implement, and usually a crontab can schedule tasks regularly. However, there are some challenges in applying simple crontab tasks in actual production, including failure handling, monitoring and deployment, cross-machine deployment, retry, etc.

Depend on scheduling

Dependent scheduling type usually means that the triggering of a certain logic needs to occur after a specific "event" occurs. This event can be the completion of an upstream task, the readiness of data on a specified path, or other external triggers. The dependencies between tasks will form a Worflow. A typical simple WorkFlow is as shown below:
In the above figure, "calculate user retention rate" needs to wait for "data preprocessing" to be completed, then "calculate user retention rate" has a dependence on the "data preprocessing" task. Dependencies between tasks may require "business time offset". For example, "calculate retention rate" needs to be calculated based on today's data and data 7 days ago, then this node needs to also rely on the "data preprocessing" task of the current business date. Instances and task instances from 7 days ago. Only when the instances on both business dates are successful will the "calculate user retention rate" task for that day be triggered to avoid the generation of dirty data.

Industry choice

There are already many solutions for scheduling systems in the industry, and related open source scheduling systems were also investigated in the early stage. Mainly include the following

Airflow

Airflow was first developed by Airbnb and then contributed to a scheduling system in Apache. It is currently used more and the community is more active. Users can define workflow and scheduling frequency through Python. Airflow positioning is a general scheduling system that supports single-node and multi-node deployment. The overall architecture diagram is as follows
The main logic of scheduling is in the Scheduler module. The Scheduler pulls the tasks that need to be run from the database through "polling" and hands them to the Worker to run. In multi-node mode, Scheduler distributes tasks to multiple Workers through Celery. One thing to note is that even in multi-node mode, the Scheduler itself is a single point of failure.

Azkaban/Oozie

Azkaban and Oozie are open source scheduling systems developed by LinkedIn and Apache respectively. They focus on the scheduling of Hadoop Batch and better integrate Hadoop related functions, so that users can easily run Spark/Hive and other tasks. The difference from Airflow is that Azkaban and Oozie configure DAG through configuration/DSL. There is a certain gap compared with Airflow in terms of community activity.

Other open source systems

Other open source systems include DolphinScheduler, etc. Since there are so many open source systems, why did we decide to build our own wheel - FlowX?
  • The scheduling system we need is positioned as a universal scheduling system that can handle multiple node types;
  • Highly available and scalable. This scheduling system will carry some core links such as basic data warehouses, and needs to ensure high availability of scheduling. At the same time, as the company's business continues to develop, the number of scheduled tasks is expected to increase rapidly, requiring horizontal expansion;
  • Easy secondary development. The company's business will have some customized requirements for the scheduling system, such as supporting custom mirroring, adding control nodes, adding automatic retry over time, etc., and needs to be able to modify the system at a low cost;
  • Easy to integrate. As a centralized scheduling system, it is planned to be integrated with other systems of the company. For example, it can provide data lineage functions based on task dependencies for use by data map tools;

Introduction to scheduling capabilities

Functional

  • Support regular scheduling (minute level, hour level, day level, certain days of the week or month)
  • Support dependency execution - dependency between tasks - external HDFS/Hive partition dependency - task self-dependence (depending on the instance of the previous business time) - support task dependencies of different periods, for example, hour-level tasks can depend on day-level tasks Task--Supports dependence on business time offset (for example, the current instance depends on the upstream task instance n days ago, or the upstream task instance during a certain period of time in history)
  • Supports pausing and canceling running instances, automatic retries and alarms on failure
  • Historical data replenishment
  • You can rerun the specified node in Worflow and all downstream to fix problems caused by data quality.
  • Control of task parallelism
  • Dependency recommendation--the system will automatically extract the required upstream tables based on the user's SQL logic--if the upstream table is generated by a task within the scheduling system, then the upstream task will be recommended--if the upstream table is not within the system If the task output is generated, the Sensor probe task will be recommended.

Non-functional

  • Ensure high availability, scalability and fault recovery accuracy, without missing or duplicate scheduling
  • Scheduling delay in seconds
  • UI and API multiple configuration methods

Technical realization

basic concept

DAY

The full name of DAG is Directed Acyclic Graph. In the scheduling system, a DAG represents a set of related tasks, and the dependency between tasks is represented by a directed edge. As shown in the figure below, there is an edge from A to B, which means that A is the predecessor task of B, that is, task B depends on the running of task A.
As shown in the figure, a valid execution sequence is A -> G -> B -> D -> C -> E -> F.

Task

Each node in the DAG in the scheduling system represents a task and a piece of logic. Users can implement different business logic in the task.

Example

The system generates an instance based on the business date specified for each task. The instance is the basic unit of scheduling. At the same time, the dependencies between tasks will eventually be transformed into dependencies between instances.

System architecture diagram

Module parsing

WebService

WebService serves as the main entrance for external systems to interact with users. Users interact through WebService through operations such as creating tasks through the UI/API. The main functions are as follows:
  • Permission check
  • Task development and operation and maintenance
  • Instance operation and maintenance
  • Obtain log information
  • project management

Master

Master is the "heart" of the system. Currently, Master's disaster recovery is performed through ZK. The main functions of Master include task dependency graph management, scheduling priority management, Quota management, instance distribution, and Scheduler and Worker monitoring.
  • Task dependency graph management
    • Maintain dependencies between tasks and provide services to other modules, such as querying the upstream and downstream information of a task.
    • Generate a plan/rerun instance and send the INSTANCE_CREATE event to the scheduler. At the same time, the Master will regularly generate instances in advance that need to be run in the future.
  • Scheduling priority management
    • Use yarn's fair scheduling algorithm for reference to solve the problem of scheduling order under high load conditions. Divide priority queues through task attributes to ensure that tasks are scheduled in an orderly manner according to priority to achieve flow control & weighted balance.
  • Quota management
    • Through multi-dimensional indicators + forward/reverse matching + time interval restrictions, we can flexibly match target tasks and limit the corresponding concurrency to achieve "guaranteing system scheduling resources in the early morning and ensuring backtracking and re-running data resources during the day" or "limiting eval task occupation" "Multiple resources" and other purposes to improve system resource utilization.
  • Instance distribution
    • Instances that pass the dependency check and reach the planned time will be distributed by the master
    • Depending on the task type, the Master will decide whether to hand it over to the worker for execution or directly submit it to K8s.
  • Module monitoring
    • Maintain the currently active Scheduler list, and the created instances will be handed over to the corresponding Scheduler for scheduling checks.
    • Maintain the currently active Worker list and distribute instances to the corresponding Workers/k8s for execution.
    • Monitor scheduler and worker status, and actively distribute instances to other nodes when the status is abnormal.

Scheduler

The Scheduler part mainly contains three sub-modules
  • Dependency Checker
    • Obtain the events distributed by the Master from the event queue and check the upstream dependencies of the corresponding instances. If all dependencies are satisfied, the event will be thrown into the next queue.
    • If the dependencies are not satisfied at this time, the event will be discarded, and the current instance will be actively triggered by the upstream success event to avoid taking up a lot of resources to poll the upstream status.
  • Time Checker
    • Remove the events (instances) that pass the dependency check and reach the run time from the queue (DelayedQueue). If it is a normal task type, it will be handed over to the master for distribution and execution. If it is a Sensor probe type task, it will be thrown to the Sensor Processor to check the readiness of external data.
  • Sensor Processor
    • Currently, two types of Sensor checks are implemented, HDFS path and Hive table/partition.
    • The Sensor will check whether the corresponding HDFS/Hive data is ready, and if so, trigger the downstream process. If it is not ready, the Sensor will not poll continuously during a detection. Instead, it will use the automatic task retry mechanism and wait for the specified time (currently 5 minutes) before checking again. Until the external data is ready or the number of retries is exceeded.
Scheduler will also register itself in ZK. The Master uses Zk to sense which Schedulers are available. When the Scheduler is restarted, the tasks being processed by the Scheduler will be retrieved from the database for recovery.

Worker

Worker is the module specifically responsible for task execution. Instances that pass the dependency check will be distributed by the Master to the Worker for execution and monitoring of the task running status. The Worker will start sub-threads to submit and monitor tasks, and actively report status to the Master and retry operations such as failure. The Worker will also register itself in ZK so that the Master can perceive it.

Zookeeper

The ZK used in the system is mainly for the following purposes:
  • Master selection: Master is selected by ZK to implement master and backup to achieve high system availability.
  • Exploration: Master uses ZK to perceive the available list of Scheduler and Worker.
  • Service discovery: Scheduler and Worker will discover the Master's listening IP and port through ZK.

future plan

In the future, this scheduling system will mainly be improved for "functional enhancement" and "ease of use". mainly include:
  • Provide more interactive methods, including CLI and configuration files
  • Improve node types (such as control nodes)
  • Connect to more systems, such as the company’s Cronjob and FaaS platform
  • Lightweight deployment

Summarize

The current self-developed scheduling system FlowX already has relatively complete functions and has been provided to the outside world through the volcano engine DataLeap. After more than a year of polishing, system stability has been guaranteed. The system already carries many basic data links and multi-directional business applications. For business, it truly integrates "data generation, data transmission, data processing, and business processes". In terms of interaction mode, in addition to accessing through Web UI operation, it also has certain API access capabilities.
 
Click to jump to Big Data R&D Governance Suite DataLeap to learn more
Microsoft launches new "Windows App" Xiaomi officially announces that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Vite 5. Alibaba Cloud 11.12 is officially released. The cause of the failure is exposed: Access Key service (Access Key) anomaly. GitHub report: TypeScript replaces Java and becomes the third most popular. The language operator’s miraculous operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems ByteDance: using AI to automatically tune Linux kernel parameters Microsoft open source Terminal Chat Spring Framework 6.1 officially GA OpenAI former CEO and president Sam Altman & Greg Brockman joins Microsoft
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5588928/blog/10123214