Oozie task scheduling framework Detailed Description and use (a)

Summary: Personal recently been using oozie, from a variety of awkward just now beginning to feel more and more interesting under the circumstances, to sort out knowledge about oozie, sorting out a oozie series, originally on the market of oozie information is relatively small, I hope after finishing can form their own unique understanding of oozie and strengthen the integrity of grasp.

A. Common scheduling framework

1.1.crontab timer

linux comes with a timer, there is no web interface, is not conducive to monitoring tasks and schedule tasks, under the workload is relatively small, it is recommended to use the linux command crontab Timing

##crongtab 命令 
*   *   *   *   *    后面接调度 job 的命令  
分  时  日   月  周  
##简单实例(每天0点11分执行) 
11 0 * * * /home/hduser/lubians/intelligentDevice/intelligentDevice.sh 
1.2.Azkaban scheduling

Open source projects, key / value to configure, easy to operate, with a web interface

Azkaban open source website

1.3.Oozie scheduling

apache project, xml configuration files, operating a little difficulty with web viewer interface, commonly used in hadoop-related tasks scheduling

Oozie official website

II. Use background

The company's technology infrastructure upgrade in the second half, the whole big data cluster management processes, scale, introduces more technology components, of which there are Oozie.

2.1. Before using scheduling techniques

Before scheduling tools used by the company mainly TaskCtl and Kettle, TaskCtl divided into three layers, Manage, Server and Agent.

It can be understood as a hierarchical scheduling.

TASKCTL main complete serial, parallel, dependent, mutually exclusive, program execution, timing, fault-tolerant, loops, conditional branching, remote, load balancing, and other custom criteria different core scheduling function.

Depending on the functional classification, TASKCTL client into Admin (management platform), Designer (process integrated development environment), Monitor (process monitoring and management) three different sets of software.

Admi: platform management node, the type of task management, project management, application settings, global variables and process management import and export functions.

Designer: platform code information flow management, code editing design, process graphics editor, timely detection and rule syntax compiler release and other functions.

Monitor: graphical monitoring, statistical multi-angle monitor, start and stop the flow reset, lock the task, the task redo, the object information inquiries.

2.2. Why Oozie

TaskCtl biggest problem is a scheduling system requires a separate scheduling server and Hadoop ecosystem and product mix is ​​not very good, so consider alternative options to use scheduling tool on the Hadoop cluster.

The reason for using Oozie is because the company is Ambari use of cluster management tools, comes Oozie plug-in installed, and Oozie Java API supports scheduling, because of the Java language will be used at work, chose Oozie.

Three .Oozie Introduction

What 3.1. Oozie is

oozie is a Workflow (workflow) coordinate system, the contribution to the Apache Cloudera company, mainly used to manage Hadoop jobs (Job). belongs web application, consists oozie Client and Server oozie two components.

oozie server running a web application to a java servlet container (Tomcat) in.

image_1akhmftbi11bjakq13n210q216db2a.png

3.2. Why do we need Oozie

① For more complex Hadoop operating systems, simply rely on shell script mode, manual mode scheduling process is more difficult to control.

② algorithmic complexity system requires many different operations (e.g., mr, Java programs, shell scripts, hivesql, sqoop, spark, etc.) in a particular order, serial to parallel, at different times, different execution conditions, such scheduling requires oozie systems do support, will simplify complex issues.

3.3. Oozie What can bring

① the hadoop ecosystem mr common task is started, hdfs operation, shell scheduling, hive operation by scheduling a coherent unified way.

② complex dependencies, time-triggered, event triggered using xml language expression, improve development efficiency.

③ use a set of tasks the DAG (Directed Acyclic Graph) to said graphical expression, process logic clearer.

④ supports a variety of task scheduling, can do most of the hadoop-tasking.

⑤ EL program support defined constants and functions, and has written a small shell script partners did not use difficult.

Four .Oozie Chart

The Internet to find a oozie architecture diagram, as follows:

image_1akhmf44nvfh140gqc68pmhs1t.png

oozie includes four service components:

workflow: support action directed acyclic graph (DAG) design and implementation, may be performed mr, hive and the shell nodes in a particular order.

coordinator: a timing schedule for a specific workflow execution may be performed automatically based on an event, there is a resource, transmission parameters.

bundle: a group coordinator to perform batch setting.

SLA (Service Level Agreement, oozie server level agreement): log is used during program execution trace.

4.1.Oozie simple architecture

image_1akhm3bv8muv162h1u41ve318189.png

As FIG, mr Oozie schedule itself is a program that starts execution, end or failure, easy to understand.

So we can think about when oozie scheduling mr program, in fact, at the same time is running two mr, one is scheduled in itself, it is a task.

4.2. A directed acyclic graph

Task itself is a directed acyclic graph (DAG)

image_1akhmaidd6a3d6m11ak15oe9p0m.png

FIG fork behind the label and MR job Hive job is executed in parallel, are incorporated by the join node successfully.

4.3.coordinator life cycle

image_1akhmbdfe32m1m6f363juu27713.png

a coordinator is a timing service, is fixed by the frequency of the timing tasks, where the function is similar to crontab.

4.4.bundle Job

image_1akhmcmbn12hu12g412tv1cjo19g11g.png

Setting a plurality of action coordinator bundle is performed in a batch time service, it is also formed such that a plurality of tasks DAG.

Five .Oozie installation and configuration

5.1.Oozie installation

Separate installation: to install client-side and server-side

Components installed: oozie add components for use Ambari (use HA)

Note: If you use CDH cluster management tool, but also a key configuration because I was directly modular installation, do not go into detail here, there is little need partners can contact me, look at the situation to write about ambari configuration oozie.

5.2.Oozie arrangement

Node memory configuration:

A node in this memory configuration may involve oozie scheduling problems blocked, at this time there after finishing at the whole phenomenon as well as solutions to the problem, here's a look at

#(节点并发),决定了你可以同时执行几个action
oozie.service.callablequeueservice.callable.concurrency 
#(队列大小) 
oozie.service.callablequeueservice.queue.size 
#(扩展)一些扩展相关 
oozie.service.ActionService.executor.ext.classes 

clipboard.png

5.3.oozie metadata changes

ambari configuration metadata oozie

clipboard1.png

Ambari default database for the Derby.

When we configure, the absence of special circumstances demand, the general default selection mysql

Select the type of database, library name, user name, url connection string, a drive, a password

It can test the connection is successful.

Add 5.4.ext2.2

Oozie into the folder

The ext-2.2.tar.gz extract the directory into ./libext/ext-2.2

5.5. Adding third-party jar package
  • Runtime shared directory (under the HDFS)
  • libserver directory
  • libtools directory

Six .Oozie management

6.1.Oozie Administrative Web Interface

http://ip:11000/oozie/

Here sometimes appear oozieUI interface can not access the problem, after updating the article, briefly explain.

6.2.oozie use
  • Task List View
  • Task Status View
  • The flow returns information
  • Node View
  • Information flow chart
  • Log Viewer
  • View system information and configuration

clipboard2.png

6.3. Recognition Status
status Meaning Description
PREP A workflow Job creation will be the first time in PREP state, it represents the workflow Job has been defined, but not running.
RUNNING When a Job workflow has been created started, it is in the RUNNING state. It does not reach the end of the state, because only the end of an error, or is suspended.
SUSPENDED A workflow Job RUNNING state SUSPENDED state will become, and it will remain in that state unless the Job workflow is re-started or be killed.
SUCCEEDED When a workflow Job RUNNING state reaches the end of the node, it becomes SUCCEEDED final completion status.
KILLED When a workflow Job in the state after being created, or is RUNNING, SUSPENDED state, was killed, the workflow Job KILLED of state to state.
FAILED When a workflow Job unexpected errors failures terminated, it will become FAILED state.

clipboard3.png

I am Lu side, 2020 peace and love

Do not be surprised, this year's theme is love and peace, I wish I could continue using it ...

Routinely routinely, my personal public number: Lu Fabian Society, welcomed the attention

avatar

Guess you like

Origin www.cnblogs.com/lubians/p/12194612.html