3.4.3 Azkaban workflow scheduling system, overview, installation and deployment, use (shell scheduling, job dependency, HDFS scheduling, MR scheduling, HIVE script scheduling)

table of Contents

Work flow scheduling system Azkaban

Section 1 Overview

1.1 Workflow scheduling system

1.2 Implementation of Workflow Scheduling

1.3 Conversation between Azkaban and Oozie

Section 2 Introduction to Azkaban

Section 3 Azkaban Installation and Deployment

3.1 Installation preparations for Azkaban

3.2 solo-server mode deployment

3.3 Multiple-executor mode deployment

Section 4 Use of Azkaban

1 shell command scheduling

2 job dependent scheduling

3 HDFS task scheduling

4 MAPREDUCE task scheduling

5 HIVE script task scheduling


 

Work flow scheduling system Azkaban

Section 1 Overview

1.1 Workflow scheduling system

A complete data analysis system is usually composed of a large number of task units:

  • shell script
  • java program
  • mapreduce program
  • hive script etc.

There is a time sequence and a dependency relationship between each task unit. In order to organize such a complex execution plan well, a work flow scheduling system is needed to schedule task execution.

Suppose, I have such a requirement, a certain business system produces 20G raw data every day, and processes it every day. The processing steps are as follows:

  • Sync the original data to HDFS through Hadoop;
  • The original data is converted with the help of the MapReduce computing framework, and the resulting data is stored in multiple Hive tables in the form of partition tables;
  • You need to perform JOIN processing on the data of multiple tables in Hive to get a detailed data Hive table;
  • Perform various statistical analyses on the detailed data to obtain the result report information;
  • The result data obtained from statistical analysis needs to be synchronized to the business system for business invocation.

 

1.2 Implementation of Workflow Scheduling

Simple task scheduling

  • Use crontab of linux directly;

Complex task scheduling

  • Develop scheduling platforms or use ready-made open source scheduling systems, such as Ooize, Azkaban, Airflow, etc.

 

1.3 Conversation between Azkaban and Oozie

Compare and analyze the two most popular schedulers in the market. In general, Ooize is a heavyweight task scheduling system compared to Azkaban, with full functions but more complicated configuration and use (xml). If you don't care about the lack of certain features, the lightweight scheduler Azkaban is a good candidate.

Features

  • Both can schedule mapreduce, pig, java, scripts as stream tasks
  • Both can perform workflow tasks regularly

Work flow definition

  • Azkaban uses the Properties file to define the workflow
  • Oozie uses XML files to define workflow

As a reference

  • Azkaban supports direct parameter passing, such as ${input}
  • Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}

Timed execution

  • Azkaban's scheduled tasks are based on time
  • Oozie's scheduled tasks are based on time and input data

Resource management

  • Azkaban has strict permission control, such as user read/write/execute operations on the workflow
  • Oozie temporarily has no strict permission control

Work stream execution

  • Azkaban has two modes of operation, namely solo server mode (executor server and web server are deployed on the same node) and multi server mode (executor server and web server can be deployed on different nodes)
  • Oozie operates as a work streaming server, supporting multi-user and multi-work streaming

 

Section 2 Introduction to Azkaban

Azkaban is a batch workflow task scheduler launched by LinkedIn (LinkedIn), which is used to run a set of tasks and processes in a specific order within a workflow. Azkaban uses job configuration files to build dependencies between tasks, and provides an easy-to-use web user interface to maintain and track your workflow

Azkaban defines a KV file (properties) format to build dependencies between tasks, and provides an easy-to-use web user interface to maintain and track your workflow.

Has the following features

  • Web user interface
  • Then upload the work stream
  • Easy to set the relationship between tasks
  • Scheduling work flow

Architecture

mysql server: store metadata, such as project name, project description, project permissions, task status, SLA rules, etc.

AzkabanWebServer: Provides external web services, allowing users to manage through the web page. Responsibilities include project management, authority authorization, task scheduling, and monitoring executors.

AzkabanExecutorServer: Responsible for the submission and execution of specific workflows.

 

Section 3 Azkaban Installation and Deployment

3.1 Installation preparations for Azkaban

1 Compile
This option is to use the azkaban3.51.0 version to recompile, and after the compilation is complete, we will get the installation package we need for installation

cd / opt / lagou / software /

wget https://github.com/azkaban/azkaban/archive/3.51.0.tar.gz

tar -zxvf 3.51.0.tar.gz -C ../servers/

cd /opt/lagou/servers/azkaban-3.51.0/

yum -y install git

yum -y install gcc-c++

./gradlew build installDist -x test

Gradle is an automated project build tool based on Apache Ant and Apache Maven. -x test Skip the test. (Note that the online jar download may fail and slow)

2 Upload the compiled installation file

Create a directory on the linux122 node

mkdir /opt/lagou/servers/azkaban

 

3.2 solo-server mode deployment

1 Single service mode installation

1 Unzip

The solo server of azkaban uses a single-node mode to start the service. It only needs an installation package of azkaban-soloserver-0.1.0-SNAPSHOT.tar.gz to start, and all data information is Is stored in the default data of azkaban, H2,

tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C ../../servers/azkaban

2 Modify the configuration file

Modify the time zone configuration file

cd /opt/lagou/servers/azkaban/azkaban-solo-server-0.1.0-SNAPSHOT/conf

vim azkaban.properties

default.timezone.id=Asia/Shanghai

Modify the commonprivate.properties configuration file

cd /opt/lagou/servers/azkaban-solo-server-0.1.0-SNAPSHOT/plugins/jobtypes

vim commonprivate.properties

execute.as.user=false
memCheck.enabled=false

azkaban requires 3G of memory by default, and if the remaining memory is insufficient, an exception will be reported.

3 Start solo-server

cd /opt/lagou/servers/azkaban-solo-server-0.1.0-SNAPSHOT

bin/start-solo.sh

4 Browser page access

Browser page access

http://linux122:8081/index

login information

User name: azkaban
Password: azkaban

 

2 Single service mode use

Requirements: Use azkaban to schedule our shell scripts and execute linux shell commands

Specific steps to
develop job file
Create a normal text file foo.job, the content of the file is as follows

type=command
command=echo 'hello world'

Zip

Upload the compressed package to Azkaban

Create project

Specify project name and description information

Azkaban uploads our compressed package

View the work flow plan and execute

Operation results page

Stop the program

bin/shutdown-solo.sh

 

3.3 Multiple-executor mode deployment

1 Install the required software

Azkaban web service installation package
azkaban-web-server-0.1.0-SNAPSHOT.tar.gz

Azkaban execution service installation package
azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz

sql script

Node planning

 

2 Database preparation

linux123

Enter the mysql client and execute the following command

mysql -uroot -p

Execute the following command:

SET GLOBAL validate_password_length=5;

SET GLOBAL validate_password_policy=0;

CREATE USER 'azkaban'@'%' IDENTIFIED BY 'azkaban'; 

GRANT all privileges ON azkaban.* to 'azkaban'@'%' identified by 'azkaban' WITH GRANT OPTION;

CREATE DATABASE azkaban;

use azkaban;

[root@linux123 software]mkdir /opt/lagou/servers/azkaban

[root@linux122 software]# scp azkaban-db-0.1.0-SNAPSHOT.tar.gz linux123:/opt/lagou/servers/azkaban/

#Unzip the database script

tar -zxvf azkaban-db-0.1.0-SNAPSHOT.tar.gz -C /opt/lagou/servers/azkaban

 

#Load initialization sql create table

mysql> source /opt/lagou/servers/azkaban/azkaban-db-0.1.0-SNAPSHOT/create-all-sql-0.1.0-SNAPSHOT.sql;

3 Configure Azkaban-web-server

Enter linux122 node

Unzip azkaban-web-server

[root@linux122 software]# tar -zxvf azkaban-web-server-0.1.0-SNAPSHOT.tar.gz -C /opt/lagou/servers/azkaban

Go to the root directory of azkaban-web-server

[root@linux122 software]# cd /opt/lagou/servers/azkaban/azkaban-web-server-0.1.0-SNAPSHOT

#⽣成ssl证书:
[root@linux122 azkaban-web-server-0.1.0-SNAPSHOT]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA

 

# Password directly azkaban other enter key skip

Note: After running this command, you will be prompted to enter the password and corresponding information for the current keystore. Please remember the password you entered (all passwords are entered in azkaban)

Modify the configuration file of azkaban-web-server

cd /opt/lagou/servers/azkaban/azkaban-web-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties

# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai # 时区注意后⾯不要有空格

# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml

# Azkaban Jetty server properties. 开启使⽤ssl 并且知道端⼝
jetty.use.ssl=true
jetty.port=8443
jetty.maxThreads=25

# KeyStore for SSL ssl相关配置 注意密码和证书路径
jetty.keystore=keystore
jetty.password=azkaban
jetty.keypassword=azkaban
jetty.truststore=keystore
jetty.trustpassword=azkaban

# Azkaban mysql settings by default. Users should configure their own username and password.
database.type=mysql
mysql.port=3306
mysql.host=linux123
mysql.database=azkaban
mysql.user=root
mysql.password=12345678
mysql.numconnections=100

#Multiple Executor 设置为false
azkaban.use.multiple.executors=true
#azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
azkaban.executorselector.comparator.Memory=1
azkaban.executorselector.comparator.LastDispatched=1
azkaban.executorselector.comparator.CpuUsage=1

Add attributes

mkdir -p plugins / jobtypes

cd plugins / jobtypes /
vim commonprivate.properties

azkaban.native.lib=false
execute.as.user=false
memCheck.enabled=false

4 Configure Azkaban-exec-server

linux123 node, upload the exec installation package to /opt/lagou/software

tar -zxvf azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz –C /opt/lagou/servers/azkaban/

Modify the configuration file of azkaban-exec-server

cd /opt/lagou/servers/azkaban/azkaban-exec-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties

# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai

# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml

# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects

# Where the Azkaban web server is located
azkaban.webserver.url=https://linux122:8443

# Azkaban mysql settings by default. Users should configure their own usernameand password.
database.type=mysql
mysql.port=3306
mysql.host=linux123
mysql.database=azkaban
mysql.user=root
mysql.password=12345678
mysql.numconnections=100

# Azkaban Executor settings
executor.maxThreads=50
executor.port=12321
executor.flow.threads=30

Distribute exec-server to linux121 node

cd / opt / lagou / servers

scp -r azkaban linux121:$PWD

 

5 Start the service

Start exec-server first

Restart web-server

# linux121, 123 start exec-server
bin/start-exec.sh

# linux122 start web-server
bin/start-web.sh

Activate exec-server

After the webServer is started, the process fails and disappears. You can check the corresponding startup log in the root directory of the installation package.

Need to manually activate the executor

cd /opt/lagou/servers/azkaban/azkaban-exec-server-0.1.0-SNAPSHOT

curl -G "linux121:$(<./executor.port)/executor?action=activate" && echo

curl -G "linux123:$(<./executor.port)/executor?action=activate" && echo

Each restart needs to execute the above

Visit address:
https://linux122:8443

 

Section 4 Use of Azkaban

1 shell command scheduling

Create job description file
vi command.job

type=command
command=echo 'hello'

Package job resource files into zip files

zip command.job

Create a project and upload the job compression package through azkaban's web management platform

⾸Create Project first

Upload the zip package

Start to perform the job

 

2 job dependent scheduling

Create multiple job descriptions with dependencies

The first job: foo.job

type=command
command=echo 'foo'

The second job: bar.job depends on foo.job

type=command
dependencies=foo
command=echo 'bar'

Type all job resource files into a zip package

Create a process in azkaban's web management field and upload the zip package
Start the work flow

 

3 HDFS task scheduling

Create job description file
fs.job

type=command
command=/opt/lagou/servers/hadoop-2.9.2/bin/hadoop fs -mkdir /azkaban

Package job resource files into zip files

Create a project and upload the job compression package through azkaban's web management platform

Start to perform the job

 

4 MAPREDUCE task scheduling

The mr task can still be executed using the job type of command

Create job description file and mr program jar package (use the example jar that comes with hadoop directly in the example)

mrwc.job

type=command
command=/opt/lagou/servers/hadoop-2.9.2/bin/hadoop jar hadoop-mapreduce-examples-2.9.2.jar wordcount /wordcount/input /wordcount/azout

Type all job resource files into a zip package

Create a process and upload the zip package in the web management field of azkaban

Start job

In case of insufficient virtual machine memory:

1. Increase the machine memory
2. Use the clear system cache command to temporarily release some memory

[root@linux123 mapreduce]# echo 1 >/proc/sys/vm/drop_caches
[root@linux123 mapreduce]# echo 2 >/proc/sys/vm/drop_caches
[root@linux123 mapreduce]# echo 3 >/proc/sys/vm/drop_caches

 

5 HIVE script task scheduling

Create job description file and hive script
Hive script: test.sql

use default;

drop table aztest;

create table aztest(id int,name string) 
row format delimited fields terminatedby ',';

Job description file: hivef.job
hivef.job

type=command
command=/opt/lagou/servers/hive-2.3.7/bin/hive -f 'test.sql'

Type all job resource files into a zip package to create a process and upload the zip package to start the job

 

6 Timing task scheduling

In addition to the manual execution of workflow tasks, azkaban also supports the configuration of timed task scheduling. The opening method is as follows:

Select the project to be processed and
select the schedule on the left to configure the timing scheduling information, and select execute on the right to execute the workflow task immediately.

Guess you like

Origin blog.csdn.net/chengh1993/article/details/112390515
Recommended