ETL scheduling system and common tools comparison: Azkaban, Oozie, the number of Qiyun

Recently met a lot of studying and ETL tool classmates complained to us: the same are using Kettle, obviously not the starting point of difference, but why do people ETL so good so fast that he did not cut off the pit?

In fact, similar to the open-source tools such as Kettle, has covered most of the daily work required function, and directly deploy a can address the basic needs of the enterprise. However, in actual use, we will find, kettle appearance comes as a phone text messaging-enabled smart phones, less intelligent App with different functions, and can only receive calls and older machines are no different.

Today we take on one of the more fiery "App" - scheduling tool, make a simple comparison of the evaluation, to help you quickly unlock new posture to do with open source ETL tool.

First, why the need for scheduling system?

Let's start to literacy.

We all know computationally data, analysis and processing, typically by a plurality of task units (Hive, Sparksql, Spark, Shell, etc.), each unit completes the task specific data processing logic.

A plurality of tasks between units often have strong dependencies, and successfully upstream task execution, the task can only be performed downstream. After completion of the upstream task to get such a result A, a downstream task to be combined in order to output the results A B result so the downstream task must be started before they can start to run after the success of an upstream task to get the results.

In order to ensure the accuracy of the results of data processing must require these tasks orderly and efficient performed in dependence on the downstream. A more basic approach is the estimated time required for each processing task, in accordance with the order, each task is performed to calculate the beginning and ending time, by periodically running tasks, which allows the system to maintain stable operation.

A complete data analysis tasks at least once, the amount of data in less dependence relatively simple low-frequency data processing, scheduling this way can meet the demand. However, in the enterprise scene, more of a need to do every day, if a larger number of tasks at the start time computing tasks will spend a lot of time, if there is another long exceeded the original estimated time or run upstream exception of task execution question, the above-mentioned approach will be completely unable to deal with, will result in repeated loss of manpower and resources, therefore, for the development process for enterprise data, a complete and efficient workflow scheduling systems will play a crucial role.

Second, the scheduling system comparison tools

After a lot of students to use ETL work, it should be the first to come into contact with linux command execution of the program comes on a regular basis Crontab, easy to use, stable, operating system after the installation is complete, the default will start this command. Easy to get started, but also has its own shortcomings, such as when the task increases can not be managed, crontab on the machine, not the backup, no hook. So here we do not do too much introduction to the crontab, aimed at a more mature workflow scheduling tool: be Hengping Apache Oozie, Azkaban, number Qiyun.

1, Oozie

Oozie: Training Elephant Man (scheduling mapreduce). A framework based on open source workflow engine, java servlet oozie need to deploy to run, mainly used for scheduling the timing between the multiple tasks in a logical order of execution schedule.

Oozie Download: https://oozie.apache.org

It has the following features:

Hadoop common unified system of mr task starts, hdfs operation, shell scheduling, hive operation;
make complex dependencies, time-triggered, event triggered using xml language expression, increased development efficiency (this is not necessarily personal hate xml, I think that efficiency is not high ...);
a set of tasks using a DAG said that the use of graphical representation, the process clear;
support a variety of task scheduling, hadoop can do most of the tasks;
program-defined functions and constants support EL expression rich;
Oozie send an email notification provisions after the completion of the work;
Azkaban use Web operation. Oozie support Web, RestApi, Java API operation;

2、Azkaban

Azkaban is a batch Linkedin open source workflow task scheduler. And for operating a set of operational processes to a particular order in a workflow. KV Azkaban defines a file format to create dependencies between tasks, and provides an easy to use web user interface to maintain and track your workflow.

Azkaban Download: https://azkaban.github.io/downloads.html

It has the following features:

Compatible with any version of hadoop;
easy-to-use web interface;
upload a simple workflow;
convenient configuration dependencies between tasks;
scheduling workflow;
modular and removable plug-in mechanism;
authentication / authorization;
able to kill and re start workflow;
failure and success of e-mail alerts;

3, the number 栖云

Lan number based on the number of its product technology Habitat 4.0 deployed in the cloud, for personal, one-stop large data business owners and independent application developers provide data tools platform and community. Basic package permanent free! Through several habitat platform, individuals and businesses do not need too much attention to the underlying complexity of installing large data storage and computing engines, tedious configuration and daily operation and maintenance, can be its own system of multi-source data integration and business development, the formation data assets, and empowerment in their own business scenarios, to easily build their own data sets in the cloud.

Number of Seiun product page: http://dtcloud.dtwave.com

Seiun Online Registration number using the address: http://shuqi.dtwave.com

Number Seiun scheduling function as follows:

20 kinds of source data to complete the adaptation schedule: Mysql, Oracle, Hive, HBase , Redis, MongoDB, ODPS, Postgresql, ElasticSearch, API and the like;
modular plug and pluggable mechanism;
supports visual workflow configuration;
support tasks warning: mail, telephone, text messaging;
scheduling diverse types: normal schedule, run or labor suspend scheduling;
supports task priority configuration;
scheduling cycle configuration is simple: a mouse click away;
support assembly between workflow and workflow;
support workflow test run;
can be completed in the workflow interface view: View code running log, heavy run, both run successfully set downstream, the downstream operations such as re-running;
error rapid positioning tasks;

ETL scheduling system and common tools comparison: Azkaban, Oozie, the number of Qiyun

(Oozie, Azkaban, number Seiun feature comparison)

Third, a wave of summary

Apache Oozie is a heavyweight task scheduling system, full-featured, but deployment and configuration would be more trouble, there will be some difficulty to get started from crontab to Oozie. Azkaban is a tool between oozie and Crontab, but on a less secure than Oozie, and if a failure occurs, you will lose all of Azkaban workflow, Oozie you can continue to run. Seiun number compared to the above two in terms of tools, configure and deploy solve complex problems easily scalable, workflow also have more to facilitate the development of operation and maintenance and other functions.

ETL scheduling system and common tools comparison: Azkaban, Oozie, the number of Qiyun
Number of Seiun Advantages

Of course, the number of Seiun just a full-featured workflow scheduling tool, as a one-stop platform for big data, it also covers the following function, whether it is a simple ETL work, or complex data sets of building work, the number of habitat use clouds can be done. Permanent free Basic Edition! No matter what problems can find customer service to solve than the open-source product experience 100 times better tools, not to try to determine the thing?

For more details, please click on the link for: http://dtcloud.dtwave.com
or directly enter the number of Seiun started: http://shuqi.dtwave.com

Guess you like

Origin blog.51cto.com/14463231/2452286