Comparison of Hadoop V2 yarn and Hadoop V1 MapReduce

For big data storage and distributed processing systems in the industry, Hadoop is a well-known and excellent open source distributed file storage and processing framework

1、Hadoop v1

1.1 Hadoop v1 MapReduce architecture diagram

Insert picture description here

1.2 The process and design ideas of Hadoop v1 MapReduce program

  • First, the user program (JobClient) submits a job, and the job information will be sent to the Job Tracker. The Job Tracker is the center of the Map-reduce framework. It needs to communicate with the machines in the cluster regularly (heartbeat), and need to manage which programs should be run. On which machines, you need to manage all job failures, restarts, and other operations.
  • TaskTracker is a part of each machine in the Map-reduce cluster. What it does is mainly monitor the resource situation of its own machine.
  • TaskTracker also monitors the running status of tasks on the current machine. TaskTracker needs to send this information to JobTracker through heartbeat, and JobTracker will collect this information to assign the newly submitted job which machines to run on. The dotted arrow in the figure above represents the process of sending and receiving messages.

1.3 Hadoop v1 MapReduce program problem

  • JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure.
  • JobTracker has completed too many tasks and caused too much resource consumption. When there are too many map-reduce jobs, it will cause a lot of memory overhead. Potentially, it also increases the risk of JobTracker fail, which is also common in the industry. It is concluded that the Map-Reduce of the old Hadoop can only support the upper limit of 4000 node hosts.
  • On the TaskTracker side, the number of map/reduce tasks is too simple to represent the resource, and the cpu/memory occupancy is not considered. If two tasks with large memory consumption are scheduled together, OOM is prone to occur.
  • On the TaskTracker side, resources are forcibly divided into map task slot and reduce task slot. If there is only a map task or only a reduce task in the system, it will cause a waste of resources, which is the problem of cluster resource utilization mentioned earlier.
  • When analyzing the source code level, you will find that the code is very difficult to read, often because a class does too many things, the code amount is more than 3000 lines, which makes the tasks of the class unclear and increases the difficulty of bug fixes and version maintenance.
  • From an operational point of view, the current Hadoop MapReduce framework will force a system-level upgrade and update when there are any important or unimportant changes (such as bug fixes, performance improvements, and characterization). To make matters worse, regardless of the user's preferences, it forces each client of the distributed cluster system to update at the same time. These updates will waste a lot of time for users to verify that their previous applications are compatible with the new Hadoop version.

2、Hadoop v2

2.1 Hadoop v2 MapReduce (Yarn) architecture diagram

Insert picture description here

2.2 Detailed explanation of the three parts of ResourceManager, ApplicationMaster and NodeManager of Hadoop v2 MapReduce (Yarn)

  • ResourceManager is a central service. What it does is to schedule and start the ApplicationMaster to which each Job belongs, and to monitor the existence of the ApplicationMaster. Careful readers will find that the monitoring, restarting, etc. of the task in the job are missing. This is why AppMst exists. ResourceManager is responsible for the scheduling of jobs and resources. Receive the job submitted by JobSubmitter, start the scheduling process according to the context information of the job and the status information collected from the NodeManager, and assign a Container as the App Mstr

  • The NodeManager function is more specific, that is, it is responsible for the maintenance of the Container state and maintains a heartbeat to the RM.

  • ApplicationMaster is responsible for all the work in a Job life cycle, similar to JobTracker in the old framework. But note that every Job (not every kind) has an ApplicationMaster, which can run on machines other than ResourceManager.

2.3 Advantages of Yarn framework over the old MapReduce framework

  • This design greatly reduces the resource consumption of JobTracker (now ResourceManager), and makes the program that monitors the status of each Job subtask (tasks) distributed, safer and more beautiful.
  • In the new Yarn, ApplicationMaster is a changeable part. Users can write their own AppMst for different programming models, so that more types of programming models can run in the Hadoop cluster. You can refer to mapred in the hadoop Yarn official configuration template. -site.xml configuration.
  • The representation of resources is based on memory (in the current version of Yarn, the CPU occupancy is not considered), which is more reasonable than the number of remaining slots before.
  • In the old framework, a big burden of JobTracker is to monitor the running status of tasks under the job. Now, this part is thrown to ApplicationMaster, and there is a module in ResourceManager called ApplicationsMasters (note that it is not ApplicationMaster), which monitors ApplicationMaster If something goes wrong, it will be restarted on other machines.
  • Container is a framework proposed by Yarn for resource isolation in the future. This should be based on the work of Mesos. It is currently a framework that only provides isolation of java virtual machine memory. The design idea of ​​the hadoop team should be able to support more resource scheduling and control in the future. Since the resource is expressed as the amount of memory, there is no The previous map slot/reduce slot separation caused the embarrassing situation of idle cluster resources.

Reference: Detailed explanation of Hadoop's new MapReduce framework Yarn

Guess you like

Origin blog.csdn.net/ytangdigl/article/details/109235244