Zhuangshi Data Technology 05: Data Scheduling

Zhao sturdy / data a person's private plots
Zhuangshi Data Technology 05: Data Scheduling
hello, Hello, everyone, I am a strange force girl Zhao sturdy.

I’m very happy to meet you again on Saturday morning~ In the last section of "Zhuangshixue Data Technology 04: ETL", we discussed the development of data warehouses. Today we will connect and talk about data processing and data report production. One less link: data scheduling.

01 What is data scheduling?

In data development, for data scheduling, we usually refer to "task scheduling" or "job scheduling". Here, we first talk about a concept, that is, job and task.

There are several differences between job and task in different contexts:

In spark context

In Spark, task is the smallest arithmetic unit that a job runs after cutting. In general, there will be as many tasks as there are partitions in an rdd, because each task only processes data on one partition. After the task is combined and divided into batches, it is called a stage. Spark will set up front and back dependencies for different stages and different tasks to ensure the correctness and integrity of the entire job. The end of the last resultTask means that the job runs successfully.

job>stage>task

in hadoop context

A job in Hadoop is called a Job. The job is divided into Map Task and Reduce Task phases. Each Task runs in its own process. When the Task ends, the process will also end;

job>task

In the context of a scheduling product

Task: A task.

TaskType: Task type, such as ETL, MR job, Simple.

Job: Job, the execution of a task during the running process.

To sum up, in different contexts of job and task, their relationship is different, so in different data scheduling products, pay attention to their differences.

Let's summarize, data scheduling is the dependency between when a task runs, when it ends, and the correct processing task. The primary focus we need to pay attention to is to start the right job at the right time to ensure that the job is executed in a timely and accurate manner in accordance with the correct dependencies.

02 What modules does the data scheduling product include?

In designing scheduling products, we need to understand several issues:

1. Trigger mechanism: time, dependence, mixing

·Time means tasks are scheduled by time (year/month/day/hour/minute/second/millisecond)

Dependence means that tasks are scheduled according to dependencies

·Mix the two for mutual scheduling

2. Workflow: task status (interrupted & running), task management or governance (type, change), task type, task fragmentation.

3. Scheduling strategy: ready & timeout; retry & retry times & retry time.

4. Task isolation: the relationship between tasks and execution.

Currently, task scheduling systems on the market include oozie, azkaban, airflow, etc. In addition, there are also TBSchedule from Ali, Lhotse from Tencent, and elastic-job from Dangdang.

We can be divided into two categories according to dag workflow category and timing sharding system:

One is the dag workflow system: oozie, azkaban, chronos, lhotse

One is the sharding system: TB Schedule, elastic-job, saturn

Among them, dag (Directed Acyclic Graph) is a kind of directed acyclic graph, which refers to a graph where any edge has a direction and there is no loop. There is a soul painter, I learn from it, we can feel what is "directed and acyclic".

Zhuangshi Data Technology 05: Data Scheduling

If we choose the dag workflow, we must pay attention to time and completion to ensure a rich and flexible trigger mechanism.

What is sharding? Let's take an example: if we have 3 physical machines, there are 10 timed tasks that are executed every 5s, and exactly every task is executed on the first machine. In order to avoid the death due to drought and death due to waterlogging, we need to allocate tasks to all currently executable physical machines in a balanced manner. This is the so-called fragmentation mechanism. Common sharding mechanisms such as average distribution algorithm, hash value, and polling algorithm. We use various algorithms to ensure the average "wearing" of physical machines.

If we choose the fragmentation method, we must pay attention to accurate and punctual triggering.

03 Introduction to data scheduling products

For simple offline data migration jobs, shell scripts are generally used to perform regular execution through crontab. However, as the complexity of multiple jobs increases, coordination work and task monitoring become troublesome, so we choose to use tools for scheduling monitor.

3.1 dag workflow system

oozie

oozie is an open source workflow scheduling engine of the Hadoop platform, which can manage Hadoop jobs. Oozie is a web application, composed of two components, oozie client and oozie Server. Oozie not only configures multiple MR (mapreduce) workflows, it can execute an MR1, then execute a java script, then execute a shell script, then a Hive script, then a Pig script, and finally execute an MR2. When using oozie, if the previous task fails to execute, the next task will not be scheduled.

azkaban

azkaban is a batch workflow task scheduler open sourced by Linkedin. Used to run a set of jobs and processes in a specific order within a workflow. Azkaban defines a KV file format to establish dependencies between tasks.

Zhuangshi Data Technology 05: Data Scheduling

Zhuangshi Data Technology 05: Data Scheduling

chronos

Chronos is an open source product launched by Airbnb to replace crontab. Users can use it to schedule jobs and support the use of Mesos as a job executor to interact with Hadoop. At the same time, chronos can also define triggers after job execution is completed, and supports arbitrary length dependency chains.

Zhuangshi Data Technology 05: Data Scheduling

3.2 Sharding System

TBSchedule: TBSchedule is Taobao's distributed scheduling open source framework, based on Zookeeper Java implementation. It allows batch tasks or constantly changing tasks to be dynamically allocated to different thread groups in the JVM of multiple hosts for parallel execution, so that all tasks can be processed quickly without repetition and without omission.

elastic-job: The elastic distributed task scheduling system developed by Dangdang uses zookeeper to achieve distributed coordination, achieve high availability and fragmentation of tasks, and can support cloud development.

saturn: The distributed timing task scheduling platform independently developed by Vipshop is developed based on Dangdang elastic-job version 1, and can be well deployed to docker containers.

Zhuangshi Data Technology 05: Data Scheduling

Well, today is another day of brain-burning.

I still remember the cute and loving Wenyu told me about crontab. I thought it was an extremely difficult thing.

But now from the perspective of the entire scheduling product and technical framework, crontab is Zelda's novice village.

So, you will experience the cry of looking for winter clothes, and you will also experience the joy of getting a flying machine.

To define is to limit。

Zhuangshi Data Technology 05: Data Scheduling

Let me express my feelings. I’m Zhao Zhuangshi. See you next week in "Zhuangshi Data Technology 06"!


The private place of a data person is a big family that helps the data person grow, helping partners who are interested in data to clarify the learning direction and accurately improve their skills. Follow me and take you to explore the magical mysteries of data

1. Go back to "Data Products" and get <Interview Questions for Data Products from Big Factory>

2. Go back to "Data Center" and get <Dachang Data Center Information>

3. Go back to "Business Analysis" and get <Dachang Business Analysis Interview Questions>;

4. Go back to "make friends", join the exchange group, and get to know more data partners.

Guess you like

Origin blog.51cto.com/13526224/2560197