As one of the open source projects of Kangaroo Cloud , Taier is a distributed and visualized DAG task scheduling system . It aims to reduce the cost of ETL development, improve the stability of the big data platform, and allow big data developers to directly develop business logic in Taier without worrying about the intricate dependencies of tasks and the architecture of the underlying big data platform. The focus is more on business.
This article will analyze and discuss Taier 's overall process from three aspects: Taier's process brief, structural analysis, and extensible points.
Brief description of the Taier process
Taier master-slave division
Taier is a separate application, the process has no master-slave division, and the master-slave division is realized through ZK when multiple instances are running . Based on the implementation of LeaderLatch, the node that grabs the lock at startup is the master (Master), and the node that does not grab the lock is the slave (Worker). There will be a situation of one master and multiple slaves.
If other workers hear in ZK that the master has hung up, then the workers will compete for the lock again, and the one who grabs the lock becomes the master.
In Taier, the main responsibilities of the master include periodic instance generation , instance pre-distribution, worker node task disaster recovery , instance submission, etc., and the slave is mainly responsible for instance submission.
Taier cycle instance (T+1)
Periodic instance is Taier's exclusive term. It refers to a task running once according to the configured scheduling time, which is an instance. Now the mechanism of periodic instance is T+1.
The periodic instances corresponding to all tasks tomorrow will be pre-generated today, that is, today’s periodic instances will be generated yesterday. By default, Taier generates the periodic instance of the next day at 22:00. If the task is submitted to the scheduling system after 22:00 , the task will not be generated as a periodic instance.
The configuration item is job.graph.build.cron=22:00:00, and the time can be adjusted by yourself.
Taier task submission
The specific flow chart of Taier task submission is as follows:
The pre-judgment of the task process submission is mainly divided into two parts. One is the dependency verification of the upstream and downstream of the task, which will judge whether the upstream of the periodic instance has been completed. If the upstream operation fails, then the task is not ready to be submitted; the second is Resource verification , because the tasks are all running on the cluster and occupy a lot of resources, so the verification before submission will be performed to determine whether the resources of the current cluster are sufficient. If the resources are insufficient, the action of delaying the submission will be carried out.
The specific code submitted by the Taier task is as follows:
Taier structure analysis
The structure of Taier is mainly divided into three parts: UI, application layer and plug-ins. Among the plug-ins, Taier worker-plugin and Taier datasource-plugin are the two most important plug-ins.
Taier worker-plugin
The main uses of the Taier worker-plugin include:
· Task submission
· Obtain task status
· Task log acquisition
· Kill task
The specific code is as follows:
Related classes include:
· com.dtstack.taier.common.client.ClientFactory
· com.dtstack.taier.common.client.ClientProxy
· com.dtstack.taier.common.client.ClientCache
Taier datasource-plugin
The main uses of Taier datasource-plugin include:
· Connectivity check
· Execute SQL
· Get Schema information
· Get Table list
· Get table metadata
· Download data
· Download logs
The specific code is as follows:
Taier task submission plugin
Reference Code:
com.dtstack.taier.scheduler.jobdealer.JobSubmitDealer#submitJob
Taier extensibility points
Taier is currently open source for its core functions, and developers of other functions can extend it by themselves, including:
· The task generates an instance immediately
· ChunJun wizard mode extension
· DataX supports wizard mode configuration
· Data source plug-in version extension
· New calculation engine
· Hadoop multi-cluster version support
· Increased instance distribution strategy
Video course & PPT acquisition
Video lessons:
https://www.bilibili.com/video/BV1wP411z7rf/?spm_id_from=333.999.0.0
Courseware acquisition:
https://www.dtstack.com/resources/1047
"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm
"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg
At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack
The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name