Distributed scheduling framework Azkaban - Flow 2.0 is used

A, Flow 2.0 Introduction

1.1 Flow 2.0 generation

Azkaban currently supports Flow 1.0 and Flow2.0, but the official documents more recommended Flow 2.0, because the Flow 1.0 will be removed in a future release. Flow 2.0 The main idea is to provide 1.0 class definition does not flow in. Users can belong to all given stream job / propertiesfiles are merged into a single stream definition file, the contents of which is YAML syntax defined, and also supports further defined in the stream flow, referred to as an embedded stream or sub-stream.

1.2 Basic Structure

Comprising a plurality of program streams YAML zip file, a program file and optionally YAML libraries and source code. The basic structure of Flow YAML file is as follows:

  • Each Flow YAML are defined in a single file;
  • Flow stream files to name names, such as: my-flow-name.flow;
  • All nodes in the DAG comprising;
  • Each node may be a process or operation;
  • Each node can have name, type, config, dependsOn nodes sections, and other attributes;
  • Dependencies specified by listing the nodes in the list of the parent node dependsOn;
  • Other configurations comprising a stream-related;
  • All common file attributes the current flowing properties are migrated to a part of each stream YAML config file.

Official provides a fairly complete sample configuration, as follows:

config:
  user.to.proxy: azktest
  param.hadoopOutData: /tmp/wordcounthadoopout
  param.inData: /tmp/wordcountpigin
  param.outData: /tmp/wordcountpigout

# This section defines the list of jobs
# A node can be a job or a flow
# In this example, all nodes are jobs
nodes:
 # Job definition
 # The job definition is like a YAMLified version of properties file
 # with one major difference. All custom properties are now clubbed together
 # in a config section in the definition.
 # The first line describes the name of the job
 - name: AZTest
   type: noop
   # The dependsOn section contains the list of parent nodes the current
   # node depends on
   dependsOn:
     - hadoopWC1
     - NoOpTest1
     - hive2
     - java1
     - jobCommand2

 - name: pigWordCount1
   type: pig
   # The config section contains custom arguments or parameters which are
   # required by the job
   config:
     pig.script: src/main/pig/wordCountText.pig

 - name: hadoopWC1
   type: hadoopJava
   dependsOn:
     - pigWordCount1
   config:
     classpath: ./*
     force.output.overwrite: true
     input.path: ${param.inData}
     job.class: com.linkedin.wordcount.WordCount
     main.args: ${param.inData} ${param.hadoopOutData}
     output.path: ${param.hadoopOutData}

 - name: hive1
   type: hive
   config:
     hive.script: src/main/hive/showdb.q

 - name: NoOpTest1
   type: noop

 - name: hive2
   type: hive
   dependsOn:
     - hive1
   config:
     hive.script: src/main/hive/showTables.sql

 - name: java1
   type: javaprocess
   config:
     Xms: 96M
     java.class: com.linkedin.foo.HelloJavaProcessJob

 - name: jobCommand1
   type: command
   config:
     command: echo "hello world from job_command_1"

 - name: jobCommand2
   type: command
   dependsOn:
     - jobCommand1
   config:
     command: echo "hello world from job_command_2"

Two, YAML syntax

Want to use Flow 2.0 is configured workflow, you first need to understand YAML. YAML is a simple, non-markup language, has a strict format requirements, if you format configuration fails, upload to Azkaban when parsing throws an exception.

2.1 Basic rules

  1. Case Sensitive;
  2. Use indention hierarchy;
  3. Indent length is not limited as long as the elements are aligned on one level indicates that these elements belong;
  4. Use # indicates a comment;
  5. String Default not add single or double quotation marks, but the single and double quotes can be used, double quotation marks do not need to escape special characters;
  6. YAML provides a variety of constant structure, comprising: integer, floating point, string, NULL, date, boolean, time.

2.2 wording object

# value 与 : 符号之间必须要有一个空格
key: value

2.3 map wording

# 写法一 同一缩进的所有键值对属于一个map
key: 
    key1: value1
    key2: value2

# 写法二
{key1: value1, key2: value2}

2.3 wording array

# 写法一 使用一个短横线加一个空格代表一个数组项
- a
- b
- c

# 写法二
[a,b,c]

2.5 single or double quotation marks

It supports single and double quotes, double quotes but does not escape special characters:

s1: '内容\n 字符串'
s2: "内容\n 字符串"

转换后:
{ s1: '内容\\n 字符串', s2: '内容\n 字符串' }

2.6 special symbols

A YAML file can include multiple documents, using ---segmentation.

2.7 Configuration reference

Flow 2.0 is recommended to define common parameters configunder, and through the ${}referenced.

Third, the simple task scheduling

3.1 Task Configuration

New flowprofile:

nodes:
  - name: jobA
    type: command
    config:
      command: echo "Hello Azkaban Flow 2.0."

In the current version, Azkaban Flow 1.0 supports and Flow 2.0, if you want to run as 2.0, you need to create a new projectfile, indicate whether you are using a Flow 2.0:

azkaban-flow-version: 2.0

3.2 package upload

3.3 execution results

Since version 1.0 has been introduced by using the Web UI, not repeat them here. For versions 1.0 and 2.0, only configured differently, otherwise upload to perform are the same. Execution results are as follows:

Fourth, multi-task scheduling

And case given as 1.0, assuming we have five tasks (jobA - jobE), D tasks need to be performed after the A, B, C task execution is completed, and E tasks you need to complete the task execution after D the implementation of the relevant configuration file should look like. As seen in 1.0, respectively, we need to define the five profiles, and in 2.0 we only need a configuration file to complete the configuration.

nodes:
  - name: jobE
    type: command
    config:
      command: echo "This is job E"
    # jobE depends on jobD
    dependsOn: 
      - jobD
    
  - name: jobD
    type: command
    config:
      command: echo "This is job D"
    # jobD depends on jobA、jobB、jobC
    dependsOn:
      - jobA
      - jobB
      - jobC

  - name: jobA
    type: command
    config:
      command: echo "This is job A"

  - name: jobB
    type: command
    config:
      command: echo "This is job B"

  - name: jobC
    type: command
    config:
      command: echo "This is job C"

Fifth, the embedded stream

Flow2.0 supports defining another in a Flow in Flow called inline flow or subflow. Here is an example of the embedded flow, which is Flowconfigured as follows:

nodes:
  - name: jobC
    type: command
    config:
      command: echo "This is job C"
    dependsOn:
      - embedded_flow

  - name: embedded_flow
    type: flow
    config:
      prop: value
    nodes:
      - name: jobB
        type: command
        config:
          command: echo "This is job B"
        dependsOn:
          - jobA

      - name: jobA
        type: command
        config:
          command: echo "This is job A"

FIG embedded DAG stream as follows:

Implementation of the following:

Reference material

  1. Azkaban Flow 2.0 Design
  2. Getting started with Azkaban Flow 2.0

更多大数据系列文章可以参见 GitHub 开源项目大数据入门指南

Guess you like

Origin www.cnblogs.com/heibaiying/p/11442420.html