Data scheduling component: coordinate execution of time series tasks based on Azkaban

1. Overview of Azkaban

1. Task timing

In the business scenario of data services, a very common business process is that log files are analyzed by big data, and then the result data is output to the business; in this process, there will be many tasks to be performed, and it is difficult to accurately grasp the end time of the task execution. But I also hope that the entire task chain ends as soon as possible to release resources.

Data scheduling component: coordinate execution of time series tasks based on Azkaban

The approximate order of execution is as follows:

  • The business log files are synchronized to the HDFS file system;
  • Execute the analysis and calculation process through Hadoop;
  • The result data is stored in the imported data warehouse;
  • Finally, the data in the data warehouse needs to be synchronized to the business database;

Such a process does not require task scheduling in the business, and the time is basically predictable. Just keep enough task interval time. The task link of big data usually needs one to end and start the other directly, so as to reduce the time cost and first entry. In a data service company, there have been cases where the execution of the synchronization task was completed but the final individual CSV data file was not generated, which caused nearly one million analysis data to fail to synchronize and update the business database.

2. Introduction to Azkaban

Azkaban is a scheduler launched by Linkedin that can manage batch workflow tasks. It is used to run a set of jobs and processes in a specific order within a workflow. Azkaban uses job configuration files to establish dependencies between tasks and provides an easy-to-use web user interface to maintain and track your workflow.

Azkaban features and advantages

  • Provide a clear and easy-to-use Web UI interface;
  • Simple job configuration, clear task and job dependencies;
  • Provide scalable components;
  • Based on Java language development, easy to secondary development;

Compared with the Oozie configuration workflow process, which is to write a large amount of XML configuration, and its code complexity is relatively high, it is not easy to secondary development, Azkaban is lightweight, and its functions and usage are relatively simple and easy to use.

Two, service installation

1. Core package

Web service

azkaban-web-server-2.5.0.tar.gz

Executive service

azkaban-executor-server-2.5.0.tar.gz

SQL script

azkaban-sql-script-2.5.0.tar.gz

2. Installation path

Upload the above three installation packages and unzip them.

[root@hop01 azkaban]# pwd
/opt/azkaban
[root@hop01 azkaban]# tar -zxvf azkaban-web-server-2.5.0.tar.gz
[root@hop01 azkaban]# tar -zxvf azkaban-executor-server-2.5.0.tar.gz
[root@hop01 azkaban]# tar -zxvf azkaban-sql-script-2.5.0.tar.gz
[root@hop01 azkaban]# mv azkaban-web-2.5.0/ server
[root@hop01 azkaban]# mv azkaban-executor-2.5.0/ executor

3. MySQL import script

[root@hop01 ~]# mysql -uroot -p123456
mysql> create database azkaban_test;
mysql> use azkaban_test;
mysql> source /opt/azkaban/azkaban-2.5.0/create-all-sql-2.5.0.sql

View table

Data scheduling component: coordinate execution of time series tasks based on Azkaban

4. SSL configuration

[root@hop01 opt]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA

Generate file:keystore

Copy to the AzkabanWeb server directory:

[root@hop01 opt]# mv keystore /opt/azkaban/server/

5. Web service configuration

Basic configuration

[root@hop01 conf]# pwd
/opt/azkaban/server/conf
[root@hop01 conf]# vim azkaban.properties

Core modifications: MySQL and Jetty.

default.timezone.id=Asia/Shanghai

# Azkaban MySQL server properties.
database.type=mysql
mysql.port=3306
mysql.host=localhost
mysql.database=azkaban_test
mysql.user=root
mysql.password=123456
mysql.numconnections=100

# Azkaban Jetty server properties.
jetty.maxThreads=25
jetty.ssl.port=8443
jetty.port=8081
jetty.keystore=keystore
jetty.password=123456
jetty.keypassword=123456
jetty.truststore=keystore
jetty.trustpassword=123456

The configuration here can meet the local configuration parameters.

User configuration

[root@hop01 conf]# vim azkaban-users.xml

Add an administrator user:

<azkaban-users>
    <user username="admin" password="admin" roles="admin,metrics" />
</azkaban-users>

Data scheduling component: coordinate execution of time series tasks based on Azkaban

6. Executor service configuration

[root@hop01 conf]# pwd
/opt/azkaban/executor/conf
[root@hop01 conf]# vim azkaban.properties

Core changes: MySQL and time zone.

default.timezone.id=Asia/Shanghai

# Azkaban MySQL server properties.
database.type=mysql
mysql.port=3306
mysql.host=localhost
mysql.database=azkaban_test
mysql.user=root
mysql.password=123456
mysql.numconnections=100

7, start the server

Web service

[root@hop01 bin]# pwd
/opt/azkaban/server/bin
[root@hop01 bin]# ll
total 16
-rwxr-xr-x 1 root root  161 Apr 21  2014 azkaban-web-shutdown.sh
-rwxr-xr-x 1 root root 1275 Apr 21  2014 azkaban-web-start.sh

Here are the startup and shutdown scripts.

[root@hop01 bin]# /opt/azkaban/server/bin/azkaban-web-start.sh

Executor service

[root@hop01 bin]# /opt/azkaban/executor/bin/azkaban-executor-start.sh

Startup log

The key end-line logs of the two services:

Azkaban Server running on ssl port 8443.
Azkaban Executor Server started on port 12321

login interface

Note that this is based on the https protocol:

https://hop01:8443/

Data scheduling component: coordinate execution of time series tasks based on Azkaban

Three, operation case

1. Introductory case

Create command type job

[root@hop01 flow_01]# pwd
/opt/azkaban/testJob/flow_01
[root@hop01 flow_01]# vim simple.job

type=command
command=echo 'mySimpleJob'

Zip package

[root@hop01 flow_01]# zip -q -r simpleJob.zip simple.job

Create project

Data scheduling component: coordinate execution of time series tasks based on Azkaban

Upload task package

Data scheduling component: coordinate execution of time series tasks based on Azkaban

Perform task

Data scheduling component: coordinate execution of time series tasks based on Azkaban

2. Task sequence execution

Create task A

[root@hop01 flow_02]# vim simpleA.job

type=command
command=echo 'simplejobA'

Create task B

[root@hop01 flow_02]# vim simpleB.job

type=command
dependencies=simpleA
command=echo 'simplejobB'

Packaging task

[root@hop01 flow_02]# zip -q -r simpleTwoJob.zip simpleA.job simpleB.job

Data scheduling component: coordinate execution of time series tasks based on Azkaban

In the same way of operation, the two tasks are placed in a zip package, uploaded through the web service, and the effect of the execution can be observed.

Fourth, the source code address

GitHub·地址
https://github.com/cicadasmile/big-data-parent
GitEE·地址
https://gitee.com/cicadasmile/big-data-parent

Read the label

[ Java Foundation ] [ Design Pattern ] [ Structure and Algorithm ] [ Linux System ] [ Database ]

[ Distributed architecture ] [ micro service ] [ big data components ] [ SpringBoot Advanced ] [ Spring & Boot foundation ]

[ Data Analysis ] [ Technical Map ] [ Workplace ]

Data scheduling component: coordinate execution of time series tasks based on Azkaban

Guess you like

Origin blog.51cto.com/14439672/2676904