A, Flow 2.0 Introduction
1.1 Flow 2.0 generation
Azkaban currently supports Flow 1.0 and Flow2.0, but the official documents more recommended Flow 2.0, because the Flow 1.0 will be removed in a future release. Flow 2.0 The main idea is to provide 1.0 class definition does not flow in. Users can belong to all given stream job / properties
files are merged into a single stream definition file, the contents of which is YAML syntax defined, and also supports further defined in the stream flow, referred to as an embedded stream or sub-stream.
1.2 Basic Structure
Comprising a plurality of program streams YAML zip file, a program file and optionally YAML libraries and source code. The basic structure of Flow YAML file is as follows:
- Each Flow YAML are defined in a single file;
- Flow stream files to name names, such as:
my-flow-name.flow
; - All nodes in the DAG comprising;
- Each node may be a process or operation;
- Each node can have name, type, config, dependsOn nodes sections, and other attributes;
- Dependencies specified by listing the nodes in the list of the parent node dependsOn;
- Other configurations comprising a stream-related;
- All common file attributes the current flowing properties are migrated to a part of each stream YAML config file.
Official provides a fairly complete sample configuration, as follows:
config:
user.to.proxy: azktest
param.hadoopOutData: /tmp/wordcounthadoopout
param.inData: /tmp/wordcountpigin
param.outData: /tmp/wordcountpigout
# This section defines the list of jobs
# A node can be a job or a flow
# In this example, all nodes are jobs
nodes:
# Job definition
# The job definition is like a YAMLified version of properties file
# with one major difference. All custom properties are now clubbed together
# in a config section in the definition.
# The first line describes the name of the job
- name: AZTest
type: noop
# The dependsOn section contains the list of parent nodes the current
# node depends on
dependsOn:
- hadoopWC1
- NoOpTest1
- hive2
- java1
- jobCommand2
- name: pigWordCount1
type: pig
# The config section contains custom arguments or parameters which are
# required by the job
config:
pig.script: src/main/pig/wordCountText.pig
- name: hadoopWC1
type: hadoopJava
dependsOn:
- pigWordCount1
config:
classpath: ./*
force.output.overwrite: true
input.path: ${param.inData}
job.class: com.linkedin.wordcount.WordCount
main.args: ${param.inData} ${param.hadoopOutData}
output.path: ${param.hadoopOutData}
- name: hive1
type: hive
config:
hive.script: src/main/hive/showdb.q
- name: NoOpTest1
type: noop
- name: hive2
type: hive
dependsOn:
- hive1
config:
hive.script: src/main/hive/showTables.sql
- name: java1
type: javaprocess
config:
Xms: 96M
java.class: com.linkedin.foo.HelloJavaProcessJob
- name: jobCommand1
type: command
config:
command: echo "hello world from job_command_1"
- name: jobCommand2
type: command
dependsOn:
- jobCommand1
config:
command: echo "hello world from job_command_2"
Two, YAML syntax
Want to use Flow 2.0 is configured workflow, you first need to understand YAML. YAML is a simple, non-markup language, has a strict format requirements, if you format configuration fails, upload to Azkaban when parsing throws an exception.
2.1 Basic rules
- Case Sensitive;
- Use indention hierarchy;
- Indent length is not limited as long as the elements are aligned on one level indicates that these elements belong;
- Use # indicates a comment;
- String Default not add single or double quotation marks, but the single and double quotes can be used, double quotation marks do not need to escape special characters;
- YAML provides a variety of constant structure, comprising: integer, floating point, string, NULL, date, boolean, time.
2.2 wording object
# value 与 : 符号之间必须要有一个空格
key: value
2.3 map wording
# 写法一 同一缩进的所有键值对属于一个map
key:
key1: value1
key2: value2
# 写法二
{key1: value1, key2: value2}
2.3 wording array
# 写法一 使用一个短横线加一个空格代表一个数组项
- a
- b
- c
# 写法二
[a,b,c]
2.5 single or double quotation marks
It supports single and double quotes, double quotes but does not escape special characters:
s1: '内容\n 字符串'
s2: "内容\n 字符串"
转换后:
{ s1: '内容\\n 字符串', s2: '内容\n 字符串' }
2.6 special symbols
A YAML file can include multiple documents, using ---
segmentation.
2.7 Configuration reference
Flow 2.0 is recommended to define common parameters config
under, and through the ${}
referenced.
Third, the simple task scheduling
3.1 Task Configuration
New flow
profile:
nodes:
- name: jobA
type: command
config:
command: echo "Hello Azkaban Flow 2.0."
In the current version, Azkaban Flow 1.0 supports and Flow 2.0, if you want to run as 2.0, you need to create a new project
file, indicate whether you are using a Flow 2.0:
azkaban-flow-version: 2.0
3.2 package upload
3.3 execution results
Since version 1.0 has been introduced by using the Web UI, not repeat them here. For versions 1.0 and 2.0, only configured differently, otherwise upload to perform are the same. Execution results are as follows:
Fourth, multi-task scheduling
And case given as 1.0, assuming we have five tasks (jobA - jobE), D tasks need to be performed after the A, B, C task execution is completed, and E tasks you need to complete the task execution after D the implementation of the relevant configuration file should look like. As seen in 1.0, respectively, we need to define the five profiles, and in 2.0 we only need a configuration file to complete the configuration.
nodes:
- name: jobE
type: command
config:
command: echo "This is job E"
# jobE depends on jobD
dependsOn:
- jobD
- name: jobD
type: command
config:
command: echo "This is job D"
# jobD depends on jobA、jobB、jobC
dependsOn:
- jobA
- jobB
- jobC
- name: jobA
type: command
config:
command: echo "This is job A"
- name: jobB
type: command
config:
command: echo "This is job B"
- name: jobC
type: command
config:
command: echo "This is job C"
Fifth, the embedded stream
Flow2.0 supports defining another in a Flow in Flow called inline flow or subflow. Here is an example of the embedded flow, which is Flow
configured as follows:
nodes:
- name: jobC
type: command
config:
command: echo "This is job C"
dependsOn:
- embedded_flow
- name: embedded_flow
type: flow
config:
prop: value
nodes:
- name: jobB
type: command
config:
command: echo "This is job B"
dependsOn:
- jobA
- name: jobA
type: command
config:
command: echo "This is job A"
FIG embedded DAG stream as follows:
Implementation of the following:
Reference material
更多大数据系列文章可以参见 GitHub 开源项目: 大数据入门指南