Azkaban 总结

官网：http://azkaban.github.io/

概述

Azkaban是由Linkedin开源的一个批量工作流任务调度器。用于在一个工作流内以一个特定的顺序运行一组工作和流程。

Azkaban定义了一种KV文件(properties)格式来建立任务之间的依赖关系，并提供一个易于使用的web用户界面维护和跟踪你的工作流。

它有如下功能特点：

1、Web用户界面

2、方便上传工作流

3、方便设置任务之间的关系

4、调度工作流

扫描二维码关注公众号，回复： 3045182 查看本文章

5、认证/授权(权限的工作)

6、能够杀死并重新启动工作流

7、模块化和可插拔的插件机制

8、项目工作区

9、工作流和任务的日志记录和审计

其他调度器对比

特性	Hamake	Oozie	Azkaban	Cascading
工作流描述语言	XML	XML (xPDL based)	text file with key/value pairs	Java API
依赖机制	data-driven	explicit	explicit	explicit
是否要web容器	No	Yes	Yes	No
进度跟踪	console/log messages	web page	web page	Java API
Hadoop job调度支持	no	yes	yes	yes
运行模式	command line utility	daemon	daemon	API
Pig支持	yes	yes	yes	yes
事件通知	no	no	no	yes
需要安装	no	yes	yes	no
支持的hadoop版本	0.18+	0.20+	currently unknown	0.18+
重试支持	no	workflownode evel	yes	yes
运行任意命令	yes	yes	yes	yes
Amazon EMR支持	yes	no	currently unknown	yes

为什么需要工作流调度系统

1、一个完整的数据分析系统通常都是由大量任务单元组成：

shell脚本程序，java程序，mapreduce程序、hive脚本等

2、各任务单元之间存在时间先后及前后依赖关系

3、为了很好地组织起这样的复杂执行计划，需要一个工作流调度系统来调度执行；

实现的方式

简单的任务调度：直接使用linux的crontab来定义；

复杂的任务调度：开发调度平台或使用现成的开源调度系统，比如ooize、azkaban等

Azkaban安装

软件下载：链接：http://pan.baidu.com/s/1b4mJWq 密码：jh75 如果无法下载请联系作者。

1-1）、安装

[root@hadoop1 azkaban]# ls

azkaban-2.5.0 azkaban-executor-2.5.0 azkaban-web-2.5.0

[root@hadoop1 azkaban]# mv azkaban-executor-2.5.0 executor

[root@hadoop1 azkaban]# mv azkaban-web-2.5.0 webserver

1-2）、创建数据库

[root@hadoop1 ~]# mysql -uroot -p

mysql> create database azkaban;

Query OK, 1 row affected (0.00 sec)

mysql> use azkaban;

Database changed

mysql> source /usr/local/azkaban/azkaban-2.5.0/create-all-sql-2.5.0.sql;

Query OK, 0 rows affected (0.25 sec)

*******

mysql> show tables;

+------------------------+

| Tables_in_azkaban |

+------------------------+

| active_executing_flows |

| active_sla |

| execution_flows |

| execution_jobs |

| execution_logs |

| project_events |

| project_files |

| project_flows |

| project_permissions |

| project_properties |

| project_versions |

| projects |

| properties |

| schedules |

| triggers |

+------------------------+

15 rows in set (0.00 sec)

1-3）、创建SSL配置

参考地址: http://docs.codehaus.org/display/JETTY/How+to+configure+SSL

命令: keytool -keystore keystore -alias jetty -genkey -keyalg RSA

运行此命令后,会提示输入当前生成 keystor的密码及相应信息,输入的密码请劳记,信息如下:

[root@hadoop1 azkaban]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA

//请输入密码

Enter keystore password:

// 请输入确认密码

Re-enter new password:

// 您的名字与姓氏是什么？

What is your first and last name?

[Unknown]:

// 您的组织单位名称是什么？

What is the name of your organizational unit?

[Unknown]:

//您的组织名称是什么？

What is the name of your organization?

[Unknown]:

// 您所在的城市或区域名称是什么？

What is the name of your City or Locality?

[Unknown]:

// 您所在的州或省份名称是什么？

What is the name of your State or Province?

[Unknown]:

// 该单位的两字母国家代码是什么

What is the two-letter country code for this unit?

[Unknown]: CN

// 正确吗？

Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=CN correct?

[no]: Y

// 输入<jetty>的主密码

Enter key password for <jetty>

（如果和 keystore 密码相同，按回车）：

(RETURN if same as keystore password):

再次输入新密码:

Re-enter new password:

[root@hadoop1 azkaban]# ls

azkaban-2.5.0 executor webserver jobs keystore

因为web支持SSL协议，所以配置SSL协议。

[root@hadoop1 azkaban]# mv keystore webserver

提示界面：

[root@hadoop2 azkaban-2.5.0]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA

Enter keystore password:

Re-enter new password:

What is your first and last name?

[Unknown]:

What is the name of your organizational unit?

[Unknown]:

What is the name of your organization?

[Unknown]:

What is the name of your City or Locality?

[Unknown]:

What is the name of your State or Province?

[Unknown]:

What is the two-letter country code for this unit?

[Unknown]: CN

Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=CN correct?

[no]: y

Enter key password for <jetty>

(RETURN if same as keystore password):

Re-enter new password:

1-4）、配置时区

[root@hadoop1 conf]# tzselect

Please identify a location so that time zone rules can be set correctly.

Please select a continent or ocean.

1) Africa

2) Americas

3) Antarctica

4) Arctic Ocean

5) Asia

6) Atlantic Ocean

7) Australia

8) Europe

9) Indian Ocean

10) Pacific Ocean

11) none - I want to specify the time zone using the Posix TZ format.

#? 5

Please select a country.

1) Afghanistan 18) Israel 35) Palestine

2) Armenia 19) Japan 36) Philippines

3) Azerbaijan 20) Jordan 37) Qatar

4) Bahrain 21) Kazakhstan 38) Russia

5) Bangladesh 22) Korea (North) 39) Saudi Arabia

6) Bhutan 23) Korea (South) 40) Singapore

7) Brunei 24) Kuwait 41) Sri Lanka

8) Cambodia 25) Kyrgyzstan 42) Syria

9) China 26) Laos 43) Taiwan

10) Cyprus 27) Lebanon 44) Tajikistan

11) East Timor 28) Macau 45) Thailand

12) Georgia 29) Malaysia 46) Turkmenistan

13) Hong Kong 30) Mongolia 47) United Arab Emirates

14) India 31) Myanmar (Burma) 48) Uzbekistan

15) Indonesia 32) Nepal 49) Vietnam

16) Iran 33) Oman 50) Yemen

17) Iraq 34) Pakistan

#? 9

Please select one of the following time zone regions.

1) Beijing Time

2) Xinjiang Time

#? 1

The following information has been given:

China

Beijing Time

Therefore TZ='Asia/Shanghai' will be used.

Local time is now: Tue Sep 27 10:13:25 CST 2016.

Universal Time is now: Tue Sep 27 02:13:25 UTC 2016.

Is the above information OK?

1) Yes

2) No

#? yes

Please enter 1 for Yes, or 2 for No.

#? 1

You can make this change permanent for yourself by appending the line

TZ='Asia/Shanghai'; export TZ

to the file '.profile' in your home directory; then log out and log in again.

Here is that TZ value again, this time on standard output so that you

can use the /usr/bin/tzselect command in shell scripts:

Asia/Shanghai

一个很不友好的设计、、、、

[root@hadoop1 conf]# cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

1-5）、修改文件

A）、修改azkaban-web-2.5.0文件

[root@hadoop1 conf]# cat azkaban.properties

#Azkaban Personalization Settings

azkaban.name=Test

azkaban.label=My Local Azkaban

azkaban.color=#FF3601

azkaban.default.servlet.path=/index

web.resource.dir=web/

default.timezone.id=Asia/Shanghai

#Azkaban UserManager class

user.manager.class=azkaban.user.XmlUserManager

user.manager.xml.file=conf/azkaban-users.xml

#Loader for projects

executor.global.properties=conf/global.properties

azkaban.project.dir=projects

database.type=mysql

mysql.port=3306

mysql.host=localhost

mysql.database=azkaban

mysql.user=it

mysql.password=it

mysql.numconnections=100

# Velocity dev mode

velocity.dev.mode=false

# Azkaban Jetty server properties.

jetty.maxThreads=25

jetty.ssl.port=8443

jetty.port=8081

jetty.keystore=keystore

jetty.password=123456

jetty.keypassword=123456

jetty.truststore=keystore

jetty.trustpassword=123456

# Azkaban Executor settings

executor.port=12321

# mail settings

mail.sender=

mail.host=

job.failure.email=

job.success.email=

lockdown.create.projects=false

cache.directory=cache

主要配置以上标红的部分，Jetty的密码为以上keystore 生成的密码。

B）、azkaban-web-2.5.0 文件

[root@hadoop1 conf]# vi azkaban-users.xml

<azkaban-users>

</azkaban-users>

添加以上标红的部分

C）、azkaban-executor-2.5.0文件

[root@hadoop1 conf]# cat azkaban.properties

#Azkaban

default.timezone.id=Asia/Shanghai

# Azkaban JobTypes Plugins

azkaban.jobtype.plugin.dir=plugins/jobtypes

#Loader for projects

executor.global.properties=conf/global.properties

azkaban.project.dir=projects

database.type=mysql

mysql.port=3306

mysql.host=localhost

mysql.database=azkaban

mysql.user=it

mysql.password=it

mysql.numconnections=100

# Azkaban Executor settings

executor.maxThreads=50

executor.port=12321

executor.flow.threads=30修改以上标红的部分

1-6）启动

A）、启动executor服务器

[root@hadoop1 executor]# ./bin/azkaban-executor-start.sh

Using Hadoop from /usr/local/hadoop-2.6.4

Using Hive from

./bin/..

B）、启动Web服务器

[root@hadoop1 webserver]# ./bin/azkaban-web-start.sh

Using Hadoop from /usr/local/hadoop-2.6.4

Using Hive from

./bin/..

先启动executor再启动webserver

C）、后端启动

[root@hadoop1 azkaban-web-2.5.0]# nohup bin/azkaban-web-start.sh 1>/tmp/azstd.out 2>/tmp/azerr.out &

D）、错误处理

[root@hadoop1 azkaban-web-2.5.0]# 2016/09/26 20:47:47.686 -0700 ERROR [AzkabanWebServer] [Azkaban] Starting Jetty Azkaban Executor...

请先开启Jetty 服务

org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory

(Access denied for user 'it'@'localhost' (using password: YES))

mysql> GRANT ALL PRIVILEGES ON *.* TO 'it'@'localhost' IDENTIFIED BY 'it' WITH GRANT OPTION;

Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;

Query OK, 0 rows affected (0.00 sec)

E）、访问

https://hadoop1:8443/

注意访问的是HTTPS,用户名与密码是admin，以上配置的

Azkaban 实例

Azkaba内置的任务类型支持command、java

1-1）、创建job描述文件

现在win上创建test.job文件

#command.job

type=command

command=echo "helloword"

格式一定位utf-8 bom格式否则无法识别

上传过程如下：

其它的造作建议多看看、、、、、

1-2）、Command类型多job工作流flow

A）、创建文件夹

[root@hadoop1 azkaban]# mkdir test

[root@hadoop1 azkaban]# touch test.text

[root@hadoop1 azkaban]# cat test

cat: test: Is a directory

[root@hadoop1 azkaban]# cat test.text

[root@hadoop1 azkaban]#

B）、在win上写脚本

test.sh

#!/bin/bash

echo "1234567890" > /usr/local/azkaban/test.text

command.job

#command.job

type=command

command=sh test.sh

C）、打成一个压缩包

command.zip

按照以上步骤执行、、、

D)、查看执行结果

[root@hadoop1 azkaban]# cat test.text

1234567890

1-3）、HDFS操作任务

A）、配置文件 fs.job

# fs.job

type=command

command=hadoop fs -mkdir /azkabanTest

B）、打包成zip文件

Hdfs.zip

按照以上步骤执行、、、、

C）、查看结果

[root@hadoop1 azkaban]# hadoop fs -ls /

Found 7 items

drwxr-xr-x - root supergroup 0 2016-09-28 10:59 /azkabanTest

drwxr-xr-x - root supergroup 0 2016-09-25 04:50 /data

drwxr-xr-x - root supergroup 0 2016-09-26 00:28 /flume

drwxr-xr-x - root supergroup 0 2016-09-28 10:53 /hadoopTest

drwxr-xr-x - root supergroup 0 2016-09-26 02:56 /home

drwx-wx-wx - root supergroup 0 2016-09-24 00:39 /tmp

drwxr-xr-x - root supergroup 0 2016-09-24 20:00 /user

1-4）、MapReduce任务

A）、上传文件

[root@hadoop1 hadoop]# hadoop fs -put /usr/local/hadoop-2.6.4/etc/hadoop/*.xml /wordcount

B）、写配置文件mapReduce.job

# mapReduce.job

type=command

command=hadoop jar /usr/local/azkaban/hadoop-mapreduce-examples-2.6.4.jar wordcount /wordcount /wordcountOuput

C）、打包成zip文件

mapReduce.zip

按照以上步骤执行、、、、

D）、查看结果

[root@hadoop1 azkaban]# hadoop fs -ls /wordcountOuput

Found 2 items

-rw-r--r-- 3 root supergroup 0 2016-09-28 11:36 /wordcountOuput/_SUCCESS

-rw-r--r-- 3 root supergroup 10544 2016-09-28 11:36 /wordcountOuput/part-r-00000

[root@hadoop1 azkaban]# hadoop fs -cat /wordcountOuput/part-r-00000

"*" 18

"AS 8

"License"); 8

"alice,bob 18

"kerberos". 1

"simple" 1

'HTTP/' 1

'none' 1

'random' 1

'sasl' 1

'string' 1

'zookeeper' 2

****************

1-5）、Azkaban与Hive

A）、执行显示数据库

1-1）、写配置文件 azkaban-hive.job

# azkaban-hive.job

type=command

command=/usr/local/hive/bin/hive -e "show databases"

1-2）、在win上压缩

azkaban-hive.zip

依照以上步骤上传zip文件、、、、

1-3）、查看结果

B）、复杂的hive操作

1-1）、准备数据

[root@hadoop1 testData]# vi test.text

1,dsdefe

2,dfegf

3,edfgrgrg

4,fhthty

5,ghjyjyj

6,fhgththjy

1-2）、写azkaban配置

hive.job

# hive.job

type=command

command=/usr/local/hive/bin/hive -f 'test.sql'

1-3）、test.sql 配置

create database azkabanHive2;

use azkabanHive2;

drop table hive2;

create table hive2(id int,name string) row format delimited fields terminated by ',';

load data inpath '/usr/local/hive/testData/test.text' into table hive2;

create table hive3 as select id from hive2;

1-4）、上传数据

[root@hadoop1 testData]# hadoop fs -put test.text /azkabanTest

[root@hadoop1 testData]# hadoop fs -cat /azkabanTest/test.text

1 dsdefe

2 dfegf

3 edfgrgrg

4 fhthty

5 ghjyjyj

6 fhgththjy

5、查看结果

hive> show databases;

azkabanhive2

Time taken: 0.212 seconds, Fetched: 5 row(s)

hive> use azkabanhive2;

Time taken: 0.079 seconds

hive> show tables;

hive2

hive3

Time taken: 0.084 seconds, Fetched: 2 row(s)

hive> select * from hive2;

1 dsdefe

2 dfegf

3 edfgrgrg

4 fhthty

5 ghjyjyj

6 fhgththjy

Time taken: 0.413 seconds, Fetched: 6 row(s)

hive> select * from hive3;

1 dsdefe

2 dfegf

3 edfgrgrg

4 fhthty

5 ghjyjyj

6 fhgththjy

Time taken: 0.167 seconds, Fetched: 6 row(s)

快学Big Data -- Azkaban （十六）

Azkaban 总结

概述

其他调度器对比

为什么需要工作流调度系统

实现的方式

Azkaban安装

1-1）、安装

1-2）、创建数据库

1-3）、创建SSL配置

1-4）、配置时区

1-5）、修改文件

1-6）启动

Azkaban 实例

1-1）、创建job描述文件

1-2）、Command类型多job工作流flow

1-3）、HDFS操作任务

1-4）、MapReduce任务

1-5）、Azkaban与Hive

猜你喜欢