Comparison of task processing architecture between hadoop1.0 and hadoop2.0

I just saw an article that explained hadoop1 and hadoop 2. The pictures are good and take a look


 

 Hadoop 1.0



 

 

From the above figure, we can clearly see the process and design ideas of the original MapReduce program:

  1. First, the user program (JobClient) submits a job, and the job information will be sent to the Job Tracker. The Job Tracker is the center of the Map-reduce framework. It needs to communicate with the machines in the cluster regularly (heartbeat), and needs to manage which programs should run. On which machines need to manage all job failures, restarts, etc.
  2. TaskTracker is a part of each machine in the Map-reduce cluster. What it does is to monitor the resources of its own machine.
  3. TaskTracker also monitors the task health of the current machine. The TaskTracker needs to send this information to the JobTracker through heartbeat, and the JobTracker will collect this information to assign which machines to run on the newly submitted job. The dotted arrow in the figure above represents the process of sending and receiving messages.

It can be seen that the original map-reduce architecture is simple and clear. In the first few years of its launch, it has also received numerous successful cases and won widespread support and affirmation in the industry. However, with the scale of distributed system clusters and its work With the increase of load, the problems of the original framework gradually surfaced. The main problems are as follows:

  1. The JobTracker is the centralized processing point of Map-reduce and has a single point of failure.
  2. The JobTracker has completed too many tasks, resulting in excessive resource consumption. When there are too many map-reduce jobs, it will cause a lot of memory overhead. Potentially, it also increases the risk of JobTracker fail, which is also common in the industry. It is concluded that the old Hadoop Map-Reduce can only support the upper limit of 4000 node hosts.
  3. On the TaskTracker side, the representation of the number of map/reduce tasks as resources is too simple, and does not take into account the cpu/memory occupancy. If two tasks with large memory consumption are scheduled together, OOM is prone to occur.
  4. On the TaskTracker side, resources are forcibly divided into map task slots and reduce task slots. If there is only map task or only reduce task in the system, it will cause waste of resources, which is the problem of cluster resource utilization mentioned above.
  5. When analyzing the source code, it will be found that the code is very difficult to read, often because a class has done too many things, and the amount of code is more than 3000 lines, which makes the task of the class unclear and increases the difficulty of bug fixing and version maintenance.
  6. From an operational point of view, the current Hadoop MapReduce framework forces a system-level update whenever there are any important or unimportant changes (such as bug fixes, performance improvements, and characterizations). Worse yet, it forces every client of the distributed cluster system to update at the same time, regardless of user preferences. These updates can cause users to waste a lot of time verifying that their previous applications work with the new Hadoop version.

 

 hadoop2.0:



 From the changing trend of using distributed systems in the industry and the long-term development of hadoop framework, MapReduce's JobTracker/TaskTracker mechanism needs large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance . Over the past few years, the hadoop development team has made some bug fixes, but these fixes have become more expensive recently, which shows how difficult it is to make changes to the original framework.

In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the long-term development of the Hadoop framework, starting from version 0.23.0, the MapReduce framework of Hadoop has been completely refactored and has undergone fundamental changes. The new Hadoop MapReduce framework named MapReduceV2 or Yarn,

 

The fundamental idea of ​​the refactoring is to separate the two main functions of JobTracker, resource management and task scheduling/monitoring, into separate components. The new resource manager globally manages the allocation of computing resources for all applications, and the ApplicationMaster of each application is responsible for the corresponding scheduling and coordination. An application is nothing more than a single traditional MapReduce task or a DAG (Directed Acyclic Graph) task. The ResourceManager and each machine's node management server can manage user processes on that machine and organize computations.

In fact, each application's ApplicationMaster is a detailed framework library that combines the resources obtained from the ResourceManager and the NodeManager to work together to run and monitor tasks.

The ResourceManager in the figure above supports hierarchical application queues that share a certain percentage of the cluster's resources. In a sense it is a pure scheduler that does not monitor and track the application during execution. Likewise, it cannot restart tasks that fail due to application failures or hardware errors.

The ResourceManager schedules based on the application's resource requirements; each application requires a different type of resource and therefore requires a different container. Resources include: memory, CPU, disk, network, and more. It can be seen that this is significantly different from the resource usage model of the fixed type of Mapreduce, which has a negative impact on the usage of the cluster. The resource manager provides a plugin for scheduling policies, which is responsible for allocating cluster resources to multiple queues and applications. Scheduling plugins can be based on existing capacity scheduling and fair scheduling models.

The NodeManager in the above figure is the agent of each machine framework, the container that executes the application, monitors the resource usage (CPU, memory, hard disk, network) of the application and reports to the scheduler.

The responsibilities of each application's ApplicationMaster are: ask the scheduler for the appropriate resource container, run tasks, track the status of applications and monitor their progress, and handle the reasons for task failures.

 

Detailed configuration reference:

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326419094&siteId=291194637