table of Contents
Work flow scheduling system Azkaban
1.1 Workflow scheduling system
1.2 Implementation of Workflow Scheduling
1.3 Conversation between Azkaban and Oozie
Section 2 Introduction to Azkaban
Section 3 Azkaban Installation and Deployment
3.1 Installation preparations for Azkaban
3.2 solo-server mode deployment
3.3 Multiple-executor mode deployment
Work flow scheduling system Azkaban
Section 1 Overview
1.1 Workflow scheduling system
A complete data analysis system is usually composed of a large number of task units:
- shell script
- java program
- mapreduce program
- hive script etc.
There is a time sequence and a dependency relationship between each task unit. In order to organize such a complex execution plan well, a work flow scheduling system is needed to schedule task execution.
Suppose, I have such a requirement, a certain business system produces 20G raw data every day, and processes it every day. The processing steps are as follows:
- Sync the original data to HDFS through Hadoop;
- The original data is converted with the help of the MapReduce computing framework, and the resulting data is stored in multiple Hive tables in the form of partition tables;
- You need to perform JOIN processing on the data of multiple tables in Hive to get a detailed data Hive table;
- Perform various statistical analyses on the detailed data to obtain the result report information;
- The result data obtained from statistical analysis needs to be synchronized to the business system for business invocation.
1.2 Implementation of Workflow Scheduling
Simple task scheduling
- Use crontab of linux directly;
Complex task scheduling
- Develop scheduling platforms or use ready-made open source scheduling systems, such as Ooize, Azkaban, Airflow, etc.
1.3 Conversation between Azkaban and Oozie
Compare and analyze the two most popular schedulers in the market. In general, Ooize is a heavyweight task scheduling system compared to Azkaban, with full functions but more complicated configuration and use (xml). If you don't care about the lack of certain features, the lightweight scheduler Azkaban is a good candidate.
Features
- Both can schedule mapreduce, pig, java, scripts as stream tasks
- Both can perform workflow tasks regularly
Work flow definition
- Azkaban uses the Properties file to define the workflow
- Oozie uses XML files to define workflow
As a reference
- Azkaban supports direct parameter passing, such as ${input}
- Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}
Timed execution
- Azkaban's scheduled tasks are based on time
- Oozie's scheduled tasks are based on time and input data
Resource management
- Azkaban has strict permission control, such as user read/write/execute operations on the workflow
- Oozie temporarily has no strict permission control
Work stream execution
- Azkaban has two modes of operation, namely solo server mode (executor server and web server are deployed on the same node) and multi server mode (executor server and web server can be deployed on different nodes)
- Oozie operates as a work streaming server, supporting multi-user and multi-work streaming
Section 2 Introduction to Azkaban
Azkaban is a batch workflow task scheduler launched by LinkedIn (LinkedIn), which is used to run a set of tasks and processes in a specific order within a workflow. Azkaban uses job configuration files to build dependencies between tasks, and provides an easy-to-use web user interface to maintain and track your workflow
Azkaban defines a KV file (properties) format to build dependencies between tasks, and provides an easy-to-use web user interface to maintain and track your workflow.
Has the following features
- Web user interface
- Then upload the work stream
- Easy to set the relationship between tasks
- Scheduling work flow
Architecture
mysql server: store metadata, such as project name, project description, project permissions, task status, SLA rules, etc.
AzkabanWebServer: Provides external web services, allowing users to manage through the web page. Responsibilities include project management, authority authorization, task scheduling, and monitoring executors.
AzkabanExecutorServer: Responsible for the submission and execution of specific workflows.
Section 3 Azkaban Installation and Deployment
3.1 Installation preparations for Azkaban
1 Compile
This option is to use the azkaban3.51.0 version to recompile, and after the compilation is complete, we will get the installation package we need for installation
cd / opt / lagou / software /
wget https://github.com/azkaban/azkaban/archive/3.51.0.tar.gz
tar -zxvf 3.51.0.tar.gz -C ../servers/
cd /opt/lagou/servers/azkaban-3.51.0/
yum -y install git
yum -y install gcc-c++
./gradlew build installDist -x test
Gradle is an automated project build tool based on Apache Ant and Apache Maven. -x test Skip the test. (Note that the online jar download may fail and slow)
2 Upload the compiled installation file
Create a directory on the linux122 node
mkdir /opt/lagou/servers/azkaban
3.2 solo-server mode deployment
1 Single service mode installation
1 Unzip
The solo server of azkaban uses a single-node mode to start the service. It only needs an installation package of azkaban-soloserver-0.1.0-SNAPSHOT.tar.gz to start, and all data information is Is stored in the default data of azkaban, H2,
tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C ../../servers/azkaban
2 Modify the configuration file
Modify the time zone configuration file
cd /opt/lagou/servers/azkaban/azkaban-solo-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties
default.timezone.id=Asia/Shanghai
Modify the commonprivate.properties configuration file
cd /opt/lagou/servers/azkaban-solo-server-0.1.0-SNAPSHOT/plugins/jobtypes
vim commonprivate.properties
execute.as.user=false
memCheck.enabled=false
azkaban requires 3G of memory by default, and if the remaining memory is insufficient, an exception will be reported.
3 Start solo-server
cd /opt/lagou/servers/azkaban-solo-server-0.1.0-SNAPSHOT
bin/start-solo.sh
4 Browser page access
Browser page access
login information
User name: azkaban
Password: azkaban
2 Single service mode use
Requirements: Use azkaban to schedule our shell scripts and execute linux shell commands
Specific steps to
develop job file
Create a normal text file foo.job, the content of the file is as follows
type=command
command=echo 'hello world'
Zip
Upload the compressed package to Azkaban
Create project
Specify project name and description information
Azkaban uploads our compressed package
View the work flow plan and execute
Operation results page
Stop the program
bin/shutdown-solo.sh
3.3 Multiple-executor mode deployment
1 Install the required software
Azkaban web service installation package
azkaban-web-server-0.1.0-SNAPSHOT.tar.gz
Azkaban execution service installation package
azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz
sql script
Node planning
2 Database preparation
linux123
Enter the mysql client and execute the following command
mysql -uroot -p
Execute the following command:
SET GLOBAL validate_password_length=5; SET GLOBAL validate_password_policy=0; CREATE USER 'azkaban'@'%' IDENTIFIED BY 'azkaban'; GRANT all privileges ON azkaban.* to 'azkaban'@'%' identified by 'azkaban' WITH GRANT OPTION; CREATE DATABASE azkaban; use azkaban;
[root@linux123 software]mkdir /opt/lagou/servers/azkaban
[root@linux122 software]# scp azkaban-db-0.1.0-SNAPSHOT.tar.gz linux123:/opt/lagou/servers/azkaban/
#Unzip the database script
tar -zxvf azkaban-db-0.1.0-SNAPSHOT.tar.gz -C /opt/lagou/servers/azkaban
#Load initialization sql create table
mysql> source /opt/lagou/servers/azkaban/azkaban-db-0.1.0-SNAPSHOT/create-all-sql-0.1.0-SNAPSHOT.sql;
3 Configure Azkaban-web-server
Enter linux122 node
Unzip azkaban-web-server
[root@linux122 software]# tar -zxvf azkaban-web-server-0.1.0-SNAPSHOT.tar.gz -C /opt/lagou/servers/azkaban
Go to the root directory of azkaban-web-server
[root@linux122 software]# cd /opt/lagou/servers/azkaban/azkaban-web-server-0.1.0-SNAPSHOT
#⽣成ssl证书:
[root@linux122 azkaban-web-server-0.1.0-SNAPSHOT]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA
# Password directly azkaban other enter key skip
Note: After running this command, you will be prompted to enter the password and corresponding information for the current keystore. Please remember the password you entered (all passwords are entered in azkaban)
Modify the configuration file of azkaban-web-server
cd /opt/lagou/servers/azkaban/azkaban-web-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties# Azkaban Personalization Settings azkaban.name=Test azkaban.label=My Local Azkaban azkaban.color=#FF3601 azkaban.default.servlet.path=/index web.resource.dir=web/ default.timezone.id=Asia/Shanghai # 时区注意后⾯不要有空格 # Azkaban UserManager class user.manager.class=azkaban.user.XmlUserManager user.manager.xml.file=conf/azkaban-users.xml # Azkaban Jetty server properties. 开启使⽤ssl 并且知道端⼝ jetty.use.ssl=true jetty.port=8443 jetty.maxThreads=25 # KeyStore for SSL ssl相关配置 注意密码和证书路径 jetty.keystore=keystore jetty.password=azkaban jetty.keypassword=azkaban jetty.truststore=keystore jetty.trustpassword=azkaban # Azkaban mysql settings by default. Users should configure their own username and password. database.type=mysql mysql.port=3306 mysql.host=linux123 mysql.database=azkaban mysql.user=root mysql.password=12345678 mysql.numconnections=100 #Multiple Executor 设置为false azkaban.use.multiple.executors=true #azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1 azkaban.executorselector.comparator.Memory=1 azkaban.executorselector.comparator.LastDispatched=1 azkaban.executorselector.comparator.CpuUsage=1
Add attributes
mkdir -p plugins / jobtypes
cd plugins / jobtypes /
vim commonprivate.properties
azkaban.native.lib=false
execute.as.user=false
memCheck.enabled=false
4 Configure Azkaban-exec-server
linux123 node, upload the exec installation package to /opt/lagou/software
tar -zxvf azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz –C /opt/lagou/servers/azkaban/
Modify the configuration file of azkaban-exec-server
cd /opt/lagou/servers/azkaban/azkaban-exec-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties
# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
# Where the Azkaban web server is located
azkaban.webserver.url=https://linux122:8443
# Azkaban mysql settings by default. Users should configure their own usernameand password.
database.type=mysql
mysql.port=3306
mysql.host=linux123
mysql.database=azkaban
mysql.user=root
mysql.password=12345678
mysql.numconnections=100
# Azkaban Executor settings
executor.maxThreads=50
executor.port=12321
executor.flow.threads=30
Distribute exec-server to linux121 node
cd / opt / lagou / servers
scp -r azkaban linux121:$PWD
5 Start the service
Start exec-server first
Restart web-server
# linux121, 123 start exec-server
bin/start-exec.sh# linux122 start web-server
bin/start-web.sh
Activate exec-server
After the webServer is started, the process fails and disappears. You can check the corresponding startup log in the root directory of the installation package.
Need to manually activate the executor
cd /opt/lagou/servers/azkaban/azkaban-exec-server-0.1.0-SNAPSHOT
curl -G "linux121:$(<./executor.port)/executor?action=activate" && echo
curl -G "linux123:$(<./executor.port)/executor?action=activate" && echo
Each restart needs to execute the above
Visit address:
https://linux122:8443
Section 4 Use of Azkaban
1 shell command scheduling
Create job description file
vi command.job
type=command
command=echo 'hello'
Package job resource files into zip files
zip command.job
Create a project and upload the job compression package through azkaban's web management platform
⾸Create Project first
Upload the zip package
Start to perform the job
2 job dependent scheduling
Create multiple job descriptions with dependencies
The first job: foo.job
type=command
command=echo 'foo'
The second job: bar.job depends on foo.job
type=command
dependencies=foo
command=echo 'bar'
Type all job resource files into a zip package
Create a process in azkaban's web management field and upload the zip package
Start the work flow
3 HDFS task scheduling
Create job description file
fs.job
type=command
command=/opt/lagou/servers/hadoop-2.9.2/bin/hadoop fs -mkdir /azkaban
Package job resource files into zip files
Create a project and upload the job compression package through azkaban's web management platform
Start to perform the job
4 MAPREDUCE task scheduling
The mr task can still be executed using the job type of command
Create job description file and mr program jar package (use the example jar that comes with hadoop directly in the example)
mrwc.job
type=command
command=/opt/lagou/servers/hadoop-2.9.2/bin/hadoop jar hadoop-mapreduce-examples-2.9.2.jar wordcount /wordcount/input /wordcount/azout
Type all job resource files into a zip package
Create a process and upload the zip package in the web management field of azkaban
Start job
In case of insufficient virtual machine memory:
1. Increase the machine memory
2. Use the clear system cache command to temporarily release some memory
[root@linux123 mapreduce]# echo 1 >/proc/sys/vm/drop_caches
[root@linux123 mapreduce]# echo 2 >/proc/sys/vm/drop_caches
[root@linux123 mapreduce]# echo 3 >/proc/sys/vm/drop_caches
5 HIVE script task scheduling
Create job description file and hive script
Hive script: test.sql
use default;
drop table aztest;
create table aztest(id int,name string)
row format delimited fields terminatedby ',';
Job description file: hivef.job
hivef.job
type=command
command=/opt/lagou/servers/hive-2.3.7/bin/hive -f 'test.sql'
Type all job resource files into a zip package to create a process and upload the zip package to start the job
6 Timing task scheduling
In addition to the manual execution of workflow tasks, azkaban also supports the configuration of timed task scheduling. The opening method is as follows:
Select the project to be processed and
select the schedule on the left to configure the timing scheduling information, and select execute on the right to execute the workflow task immediately.