Introduction to Tez of Apache

Tez is Apache's latest open source computing framework that supports DAG jobs. It can convert multiple dependent jobs into one job to greatly improve the performance of DAG jobs. Tez is not aimed directly at end users - in fact it allows developers to build faster and more scalable applications for end users. Hadoop has traditionally been a massive data batching platform. However, there are many use cases that require near real-time query processing performance. There are also some jobs that are not suitable for MapReduce, such as machine learning. The purpose of Tez is to help Hadoop handle these use case scenarios.

 



 

 

Tez 2 main design themes for Tez are:

Empowering end users by:

Expressive dataflow definition APIs

Flexible Input-Processor-Output runtime model

Data type agnostic

Simplifying deployment

 

 

Execution Performance

Performance gains over Map Reduce

Optimal resource management

Plan reconfiguration at runtime

Dynamic physical data flow decisions

 

 

 

The goal of the Tez project is to support a high degree of customization, so that it can meet the needs of various use cases, allowing people to do their own work without resorting to other external means, if projects like Hive and Pig use Tez instead of MapReduce as their The backbone of data processing, then their response time will be significantly improved. Tez is built on YARN, the new resource management framework used by Hadoop.

 

 

The main reason for Tez was to get around the limitations imposed by MapReduce. Besides the constraints of having to write Mappers and Reducers, there are inefficiencies in forcing all types of computations to fit this paradigm - such as using HDFS to store temporary data between multiple MR jobs, which is a load. In Hive, it is very common for queries to perform multiple shuffle operations on unrelated keys, such as join - grp by - window function - order by.

 

Key elements of Tez's design philosophy include:

 

Allow developers (and end users) to do what they want in the most efficient way

better execution performance

Tez's ability to achieve these goals relies on the following:

 

Expressive Dataflow API - The Tez team hopes to use a set of expressive dataflow definition APIs to allow users to describe the directed acyclic graph (DAG) of the computations they want to run. To achieve this, Tez implements a structured type API where you can add all processors and edges and visualize the actual constructed graph.

Flexible Input-Processor-Output Runtime Model - Runtime executors can be built dynamically by connecting different inputs, processors and outputs.

Data type independence - only concerned with the movement of the data, not the format of the data (key-value pairs, tuple-oriented formats, etc.).

Dynamic graph reconfiguration

Simple to deploy - Tez is entirely a client-side application that leverages YARN's local resources and distributed cache. As far as the use of Tez is concerned, you do not need to deploy anything on your own cluster, you only need to upload the relevant Tez libraries to HDFS, and then use the Tez client to submit these libraries.

You can even put two copies of the library on your cluster. One for the production environment, it uses the stable version for all production tasks; the other uses the latest version for user experience. These two libraries are independent of each other and do not affect each other.

 

Tez can run arbitrary MR tasks without any changes. This enables distribution migration for tools that now rely on MR.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326847319&siteId=291194637