Distributed scheduling framework Azkaban - Flow 1.0 is used

I. Introduction

Azkaban mainly through the interface to upload the configuration file to schedule tasks. It has two important concepts:

  • The Job : scheduling tasks you need to perform;
  • Flow : acquiring a plurality of charts and Job consisting of dependency between them is called the Flow.

Currently Azkaban 3.x supports Flow 1.0 and Flow 2.0, this paper explains the use of Flow 1.0, the next article will explain the use of Flow 2.0.

Second, the basic task scheduling

2.1 New Project

You can create the corresponding item in Azkaban main interface:

2.2 Task Configuration

New job profiles Hello-Azkaban.job, as follows. The task here is simply to output a 'Hello Azkaban!':

#command.job
type=command
command=echo 'Hello Azkaban!'

2.3 package upload

It will be Hello-Azkaban.jobpackaged as a zipcompressed file:

The Web UI upload:

After a successful upload you can see the corresponding Flows:

2.4 mission

Click on a page Execute Flowto perform tasks:

2.5 execution results

Click detailcan view the task execution log:

Third, the multi-task scheduling

3.1 dependent on configuration

It is assumed that we have five tasks (TaskA - TaskE), D tasks need to be performed after the A, B, C task execution is completed, and E is required to perform the task after task execution D, you need to use in this case dependenciesproperties define its dependencies. Each task configuration is as follows:

Task-A.job :

type=command
command=echo 'Task A'

Task-B.job :

type=command
command=echo 'Task B'

Task-C.job :

type=command
command=echo 'Task C'

Task-D.job :

type=command
command=echo 'Task D'
dependencies=Task-A,Task-B,Task-C

Task-E.job :

type=command
command=echo 'Task E'
dependencies=Task-D

3.2 压缩上传

压缩后进行上传,这里需要注意的是一个 Project 只能接收一个压缩包,这里我还沿用上面的 Project,默认后面的压缩包会覆盖前面的压缩包:

3.3 依赖关系

多个任务存在依赖时,默认采用最后一个任务的文件名作为 Flow 的名称,其依赖关系如图:

3.4 执行结果

从这个案例可以看出,Flow1.0 无法通过一个 job 文件来完成多个任务的配置,但是 Flow 2.0 就很好的解决了这个问题。

四、调度HDFS作业

步骤与上面的步骤一致,这里以查看 HDFS 上的文件列表为例。命令建议采用完整路径,配置文件如下:

type=command
command=/usr/app/hadoop-2.6.0-cdh5.15.2/bin/hadoop fs -ls /

执行结果:

五、调度MR作业

MR 作业配置:

type=command
command=/usr/app/hadoop-2.6.0-cdh5.15.2/bin/hadoop jar /usr/app/hadoop-2.6.0-cdh5.15.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.2.jar pi 3 3

执行结果:

六、调度Hive作业

作业配置:

type=command
command=/usr/app/hive-1.1.0-cdh5.15.2/bin/hive -f 'test.sql'

其中 test.sql 内容如下,创建一张雇员表,然后查看其结构:

CREATE DATABASE IF NOT EXISTS hive;
use hive;
drop table if exists emp;
CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
-- 查看 emp 表的信息
desc emp;

打包的时候将 job 文件与 sql 文件一并进行打包:

执行结果如下:

七、在线修改作业配置

在测试时,我们可能需要频繁修改配置,如果每次修改都要重新打包上传,这会比较麻烦。所以 Azkaban 支持配置的在线修改,点击需要修改的 Flow,就可以进入详情页面:

在详情页面点击 Eidt 按钮可以进入编辑页面:

在编辑页面可以新增配置或者修改配置:

附:可能出现的问题

如果出现以下异常,多半是因为执行主机内存不足,Azkaban 要求执行主机的可用内存必须大于 3G 才能执行任务:

Cannot request memory (Xms 0 kb, Xmx 0 kb) from system for job

如果你的执行主机没办法增大内存,那么可以通过修改 plugins/jobtypes/ 目录下的 commonprivate.properties 文件来关闭内存检查,配置如下:

memCheck.enabled=false

更多大数据系列文章可以参见 GitHub 开源项目大数据入门指南

Guess you like

Origin www.cnblogs.com/heibaiying/p/11441369.html
Recommended