Introduction to frameworks running on YARN

Offline computing framework MapReduce

Principle: Divide the calculation process into two stages, Map and Reduce. The Map stage processes the input data in parallel, and the Reduce stage aggregates the Map results. Shuffle connects the two stages of Map and Reduce. MapTask writes data to local disk, and Reduce Task reads a copy of data from each MapTask.

Advantages: only suitable for offline batch processing; good fault tolerance and scalability; suitable for simple batch processing tasks

Disadvantages: high startup overhead, inefficiency due to excessive use of disks, etc.

MapReduce 2.0与YARN:

The successful operation of an MR application requires several modules: task management and resource scheduling; task-driven modules (MapTask, ReduceTask); user code (Mapper, Reducer...).

Differences between MapReduce 2.0 and YARN: YARN is a resource management system responsible for resource management and scheduling; MapReduce is just an application running on YARN. If YARN is regarded as "android", then MapReduce is just an "app".

MapReduce2.0 consists of: YARN (there is only one for the entire cluster), MRAppMaster (one for each application), user code (Mapper, Reducer...);

The difference between MapReduce 1.0 and MapReduce 2.0 : MapReduce 1.0 is an independent system that runs directly on Linux. MapReduce 2.0 is a framework that runs on YARN and can run on YARN with various frameworks.

DAG computing framework Tez

1. There are data dependencies between multiple jobs, and a directed graph of dependencies (Directed Acyclic Graph) is formed. The calculation of this graph is called "DAG calculation".

2. Apache Tez: DAG computing framework based on YARN, running on top of YARN, making full use of YARN's resource management and fault tolerance functions; providing a wealth of data flow (dataflow) API; good scalability "Input-Processor-Output "Runtime model; dynamically generates physical data flow relationships.

3.Tez optimization technology:

ApplicationMaster buffer pool; jobs are submitted to the AMPoolServer service; several ApplicationMasters are pre-started to form an ApplicationMaster buffer pool;

Containers are pre-started, and several Containers can be pre-started when ApplicationMaster starts

Container reuse: After the task is completed, the ApplicationMaster will not immediately unregister the Container it uses, but will reassign it to other tasks that are not running.

4. Tez application scenarios

Writing applications directly: Tez provides a common programming interface suitable for writing jobs with dependencies.

Optimize Pig, Hive and other engines: the next generation of Hive: Stinger

Benefit 1: Avoid a lot of unnecessary network and disk IO after query statements are converted into too many MapReduce jobs

Benefit 2: A more intelligent task processing engine streaming computing framework Storm streaming (Streaming) computing means that the processed data flows into the system like a stream, and the system needs to process and calculate each piece of data in real time. stop (until the user explicitly kills the process);

Traditional approach: real-time computing with a real-time processing network consisting of message queues and message processors; lack of automation, lack of robustness, and poor scalability

Stream computing framework Storm

In-memory computing framework Spark

Overcome the shortcomings of MapReduce in iterative computing and interactive computing;

Introduce RDD (Resilient Distributed Datasets) data representation model; RDD is a data collection with fault tolerance mechanism that can be operated in parallel and can be cached in memory or on disk.
Copyright statement: If copyright issues are involved, please contact the author with the ownership certificate

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324531366&siteId=291194637