Overall Process Analysis of Taier, a Distributed Visual DAG Task Scheduling System

As one of the open source projects of Kangaroo Cloud , Taier is a distributed and visualized DAG task scheduling system . It aims to reduce the cost of ETL development, improve the stability of the big data platform, and allow big data developers to directly develop business logic in Taier without worrying about the intricate dependencies of tasks and the architecture of the underlying big data platform. The focus is more on business.

This article will analyze and discuss Taier 's overall process from three aspects: Taier's process brief, structural analysis, and extensible points.

Brief description of the Taier process

Taier master-slave division

Taier is a separate application, the process has no master-slave division, and the master-slave division is realized through ZK when multiple instances are running . Based on the implementation of LeaderLatch, the node that grabs the lock at startup is the master (Master), and the node that does not grab the lock is the slave (Worker). There will be a situation of one master and multiple slaves.

If other workers hear in ZK that the master has hung up, then the workers will compete for the lock again, and the one who grabs the lock becomes the master.

In Taier, the main responsibilities of the master include periodic instance generation , instance pre-distribution, worker node task disaster recovery , instance submission, etc., and the slave is mainly responsible for instance submission.

Taier cycle instance (T+1)

Periodic instance is Taier's exclusive term. It refers to a task running once according to the configured scheduling time, which is an instance. Now the mechanism of periodic instance is T+1.

The periodic instances corresponding to all tasks tomorrow will be pre-generated today, that is, today’s periodic instances will be generated yesterday. By default, Taier generates the periodic instance of the next day at 22:00. If the task is submitted to the scheduling system after 22:00 , the task will not be generated as a periodic instance.

The configuration item is job.graph.build.cron=22:00:00, and the time can be adjusted by yourself.

file

Taier task submission

The specific flow chart of Taier task submission is as follows:

file

The pre-judgment of the task process submission is mainly divided into two parts. One is the dependency verification of the upstream and downstream of the task, which will judge whether the upstream of the periodic instance has been completed. If the upstream operation fails, then the task is not ready to be submitted; the second is Resource verification , because the tasks are all running on the cluster and occupy a lot of resources, so the verification before submission will be performed to determine whether the resources of the current cluster are sufficient. If the resources are insufficient, the action of delaying the submission will be carried out.

The specific code submitted by the Taier task is as follows:

file

Taier structure analysis

The structure of Taier is mainly divided into three parts: UI, application layer and plug-ins. Among the plug-ins, Taier worker-plugin and Taier datasource-plugin are the two most important plug-ins.

file

Taier worker-plugin

The main uses of the Taier worker-plugin include:

· Task resource judgment

· Task submission

· Obtain task status

· Task log acquisition

· Kill task

The specific code is as follows:

file file

Related classes include:

· com.dtstack.taier.common.client.ClientFactory

· com.dtstack.taier.common.client.ClientProxy

· com.dtstack.taier.common.client.ClientCache

Taier datasource-plugin

The main uses of Taier datasource-plugin include:

· Connectivity check

· Execute SQL

· Get ​​Schema information

· Get ​​Table list

· Get ​​table metadata

· Download data

· Download logs

The specific code is as follows:

file file file file

Taier task submission plugin

Reference Code:

com.dtstack.taier.scheduler.jobdealer.JobSubmitDealer#submitJob

file

Taier extensibility points

Taier is currently open source for its core functions, and developers of other functions can extend it by themselves, including:

· The task generates an instance immediately

· ChunJun wizard mode extension

· DataX supports wizard mode configuration

· Data source plug-in version extension

· New calculation engine

· Hadoop multi-cluster version support

· Increased instance distribution strategy

Video course & PPT acquisition

Video lessons:

https://www.bilibili.com/video/BV1wP411z7rf/?spm_id_from=333.999.0.0

Courseware acquisition:

https://www.dtstack.com/resources/1047

"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/10097261