1: Azkaban Overview
Azkaban is a distributed workflow manager at LinkedIn on implementation to solve Hadoop jobs depend issues. We need to press the job to run sequentially , from ETL work to data analysis product.
2: Why workflow scheduling systems
1) A complete data analysis system is usually a large number of task units:
shell script, the Java program, MapReduce program, Hive scripts.
2 ) the time between each task unit has front and back dependencies.
3 ) In order to well organize the implementation of such a complex program requires a workflow scheduling system to schedule the execution.
For example, we might have such a demand, the system generates a business day, 20G original data, we have subjected to daily treatment, the processing steps are as follows:
( 1 ) by Hadoop first original data sync to HDFS on;
( 2 ) by means of MapReduce computing framework of the original data are calculated to generate a plurality of data storage in the form of partition table Hive table;
( 3 ) the need for Hive data is performed in a plurality of tables JOIN to give a detailed data Hive large table ;
( 4 ) the details of complex statistical data analysis, results reporting information;
( 5 result data) needs to be synchronized to the statistical analysis of the resulting business systems, calls for business use.
As shown below:
2: Features:
1 ) provides users with a very friendly visual interface -> web interface
2) very convenient upload workflow - "labeled archive
3 ) Set the relationship between tasks
4 ) permission settings - "to delete the library on foot
5 ) Modular
6 ) at any time to stop and start the task
7 ) You can view the log records
3: and Oozie Comparison
And Oozie contrast, azkaban is a lightweight scheduling tool.
Function enterprise applications are not a small minority of the functions can be used Azkaban.
1 ) Function
Two task flow scheduler can schedule use mr, java, script workflow tasks
Can be timed schedule ...
2) Use
az direct parameter passing
Oozie direct mass participation, support EL expressions ...
3) Timing
az scheduled task on time
Oozie tasks based on time and data
4 ) Resources
az strict access control
Oozie no access control Accord
4: Azkaban installation deployment
Ready to work
1 ) Snapshot
2 ) Upload the installation package
alt + p
3) unpack rename
tar -zxvf
mv
4) mysql in azkaban script imports
source /root/hd/azkaban/azkaban-2.5.0/create-all-sql-2.5.0.sql
Installation and deployment
1 ) Create a SSL ( secure connection ) Configuration
Server requires a certificate
keytool -keystore keystore -alias jetty -genkey -keyalg RSA
2 ) Time synchronization setting
Generate a time zone file
tzselect generation
5->9->1->yes
Time zone file copy
cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
Cluster Time Synchronization
crt in turn sends an interactive window
sudo date -s '2018-11-28 20:41:33'
3 ) modify the configuration file
4 ) Start web server
bin/azkaban-web-start.sh
5) Start Actuator
bin/azkaban-executor-start.sh
6) access web
Here the installation steps I write very rough, you can refer to this article to install deployment
https://www.cnblogs.com/chenmingjun/p/10506488.html
Combat operations
Case I: the Command type of single job
Create a job description file:
Then packaged into a zip file uploaded to azkaban in
Case II: the Command type as many job Case
Creating f.job
Creating b.job
Wherein b dependent f
Then these 2 Ge job file is packaged in a zip and upload it to azkaban in.
Case III: HDFS task operation
Create a job file
Note the use of the hdfs command must be the full path of the command in Linux
Then this package job into a zip file and upload it to azkaban in.
Case 4 : Running MapReduce program
这里我们用的是hadoop自带的一个例子程序。下面编写job文件
将单词计数的jar文件 和job文件打包上传到azkaban任务中
案例5 hive脚本任务
1:创建hive脚本
2:编写job文件
然后将这个job文件和sql文件打包成zip并上传到azkaban中。
这里有毒啊!!不知道为什么运行hive有问题
Execute报错信息
WebServer报错信息