Summary: Personal recently been using oozie, from a variety of awkward just now beginning to feel more and more interesting under the circumstances, to sort out knowledge about oozie, sorting out a oozie series, originally on the market of oozie information is relatively small, I hope after finishing can form their own unique understanding of oozie and strengthen the integrity of grasp.
A. Common scheduling framework
1.1.crontab timer
linux comes with a timer, there is no web interface, is not conducive to monitoring tasks and schedule tasks, under the workload is relatively small, it is recommended to use the linux command crontab Timing
##crongtab 命令
* * * * * 后面接调度 job 的命令
分 时 日 月 周
##简单实例(每天0点11分执行)
11 0 * * * /home/hduser/lubians/intelligentDevice/intelligentDevice.sh
1.2.Azkaban scheduling
Open source projects, key / value to configure, easy to operate, with a web interface
1.3.Oozie scheduling
apache project, xml configuration files, operating a little difficulty with web viewer interface, commonly used in hadoop-related tasks scheduling
II. Use background
The company's technology infrastructure upgrade in the second half, the whole big data cluster management processes, scale, introduces more technology components, of which there are Oozie.
2.1. Before using scheduling techniques
Before scheduling tools used by the company mainly TaskCtl and Kettle, TaskCtl divided into three layers, Manage, Server and Agent.
It can be understood as a hierarchical scheduling.
TASKCTL main complete serial, parallel, dependent, mutually exclusive, program execution, timing, fault-tolerant, loops, conditional branching, remote, load balancing, and other custom criteria different core scheduling function.
Depending on the functional classification, TASKCTL client into Admin (management platform), Designer (process integrated development environment), Monitor (process monitoring and management) three different sets of software.
Admi: platform management node, the type of task management, project management, application settings, global variables and process management import and export functions.
Designer: platform code information flow management, code editing design, process graphics editor, timely detection and rule syntax compiler release and other functions.
Monitor: graphical monitoring, statistical multi-angle monitor, start and stop the flow reset, lock the task, the task redo, the object information inquiries.
2.2. Why Oozie
TaskCtl biggest problem is a scheduling system requires a separate scheduling server and Hadoop ecosystem and product mix is not very good, so consider alternative options to use scheduling tool on the Hadoop cluster.
The reason for using Oozie is because the company is Ambari use of cluster management tools, comes Oozie plug-in installed, and Oozie Java API supports scheduling, because of the Java language will be used at work, chose Oozie.
Three .Oozie Introduction
What 3.1. Oozie is
oozie is a Workflow (workflow) coordinate system, the contribution to the Apache Cloudera company, mainly used to manage Hadoop jobs (Job). belongs web application, consists oozie Client and Server oozie two components.
oozie server running a web application to a java servlet container (Tomcat) in.
3.2. Why do we need Oozie
① For more complex Hadoop operating systems, simply rely on shell script mode, manual mode scheduling process is more difficult to control.
② algorithmic complexity system requires many different operations (e.g., mr, Java programs, shell scripts, hivesql, sqoop, spark, etc.) in a particular order, serial to parallel, at different times, different execution conditions, such scheduling requires oozie systems do support, will simplify complex issues.
3.3. Oozie What can bring
① the hadoop ecosystem mr common task is started, hdfs operation, shell scheduling, hive operation by scheduling a coherent unified way.
② complex dependencies, time-triggered, event triggered using xml language expression, improve development efficiency.
③ use a set of tasks the DAG (Directed Acyclic Graph) to said graphical expression, process logic clearer.
④ supports a variety of task scheduling, can do most of the hadoop-tasking.
⑤ EL program support defined constants and functions, and has written a small shell script partners did not use difficult.
Four .Oozie Chart
The Internet to find a oozie architecture diagram, as follows:
oozie includes four service components:
workflow: support action directed acyclic graph (DAG) design and implementation, may be performed mr, hive and the shell nodes in a particular order.
coordinator: a timing schedule for a specific workflow execution may be performed automatically based on an event, there is a resource, transmission parameters.
bundle: a group coordinator to perform batch setting.
SLA (Service Level Agreement, oozie server level agreement): log is used during program execution trace.
4.1.Oozie simple architecture
As FIG, mr Oozie schedule itself is a program that starts execution, end or failure, easy to understand.
So we can think about when oozie scheduling mr program, in fact, at the same time is running two mr, one is scheduled in itself, it is a task.
4.2. A directed acyclic graph
Task itself is a directed acyclic graph (DAG)
FIG fork behind the label and MR job Hive job is executed in parallel, are incorporated by the join node successfully.
4.3.coordinator life cycle
a coordinator is a timing service, is fixed by the frequency of the timing tasks, where the function is similar to crontab.
4.4.bundle Job
Setting a plurality of action coordinator bundle is performed in a batch time service, it is also formed such that a plurality of tasks DAG.
Five .Oozie installation and configuration
5.1.Oozie installation
Separate installation: to install client-side and server-side
Components installed: oozie add components for use Ambari (use HA)
Note: If you use CDH cluster management tool, but also a key configuration because I was directly modular installation, do not go into detail here, there is little need partners can contact me, look at the situation to write about ambari configuration oozie.
5.2.Oozie arrangement
Node memory configuration:
A node in this memory configuration may involve oozie scheduling problems blocked, at this time there after finishing at the whole phenomenon as well as solutions to the problem, here's a look at
#(节点并发),决定了你可以同时执行几个action
oozie.service.callablequeueservice.callable.concurrency
#(队列大小)
oozie.service.callablequeueservice.queue.size
#(扩展)一些扩展相关
oozie.service.ActionService.executor.ext.classes
5.3.oozie metadata changes
ambari configuration metadata oozie
Ambari default database for the Derby.
When we configure, the absence of special circumstances demand, the general default selection mysql
Select the type of database, library name, user name, url connection string, a drive, a password
It can test the connection is successful.
Add 5.4.ext2.2
Oozie into the folder
The ext-2.2.tar.gz extract the directory into ./libext/ext-2.2
5.5. Adding third-party jar package
- Runtime shared directory (under the HDFS)
- libserver directory
- libtools directory
Six .Oozie management
6.1.Oozie Administrative Web Interface
Here sometimes appear oozieUI interface can not access the problem, after updating the article, briefly explain.
6.2.oozie use
- Task List View
- Task Status View
- The flow returns information
- Node View
- Information flow chart
- Log Viewer
- View system information and configuration
6.3. Recognition Status
status | Meaning Description |
---|---|
PREP | A workflow Job creation will be the first time in PREP state, it represents the workflow Job has been defined, but not running. |
RUNNING | When a Job workflow has been created started, it is in the RUNNING state. It does not reach the end of the state, because only the end of an error, or is suspended. |
SUSPENDED | A workflow Job RUNNING state SUSPENDED state will become, and it will remain in that state unless the Job workflow is re-started or be killed. |
SUCCEEDED | When a workflow Job RUNNING state reaches the end of the node, it becomes SUCCEEDED final completion status. |
KILLED | When a workflow Job in the state after being created, or is RUNNING, SUSPENDED state, was killed, the workflow Job KILLED of state to state. |
FAILED | When a workflow Job unexpected errors failures terminated, it will become FAILED state. |
I am Lu side, 2020 peace and love
Do not be surprised, this year's theme is love and peace, I wish I could continue using it ...
Routinely routinely, my personal public number: Lu Fabian Society, welcomed the attention