The king of the real big data resource scheduling framework, why is Yarn so good?

This article is shared from HUAWEI CLOUD Community " Why is Yarn the King of Resource Scheduling Framework? , by JavaEdge.

Hadoop main components:

  • Distributed file system HDFS

  • MapReduce, a distributed computing framework

  • Distributed cluster resource scheduling framework Yarn

The emergence of Yarn is accompanied by the development of Hadoop, which makes Hadoop from a single big data computing engine to a complete big data platform integrating storage, computing and resource management, and then develops its own ecosystem and becomes synonymous with big data .

In the process of starting the MapReduce application, the most important thing is to distribute the MapReduce program to the server of the big data cluster. In Hadoop 1, this process is mainly completed through the communication between the TaskTracker and the JobTracker.

Disadvantages of the program

The server cluster resource scheduling management and MapReduce execution process are coupled together. If you want to run other computing tasks in the current cluster, such as Spark or Storm, you cannot use the resources in the cluster uniformly.

In the early days of Hadoop, the only big data technology was Hadoop, and this disadvantage was not obvious. However, with the development of big data and the emergence of various new computing frameworks, it is impossible to deploy a server cluster for each computing framework, and even if a new cluster can be deployed, the data is still on the HDFS of the original cluster. Therefore, it is necessary to separate the resource management and computing framework of MapReduce, which is also the main change of Hadoop2: separate Yarn from MapReduce and become an independent resource scheduling framework.

Yarn, Yet Another Resource Negotiator, another resource scheduler. When the Hadoop community decided to separate resource management from Hadoop1 and develop Yarn independently, there were already some big data resource management products in the industry, such as Mesos, so Yarn developers simply called their product "another resource scheduler". For example, Java's Ant is the abbreviation of "Another Neat Tool", another finishing tool.

Yarn Architecture

Yarn includes:

Node Manager

The NodeManager process is responsible for resource and task management on specific servers. It is started on each computing server in the cluster and appears together with the DataNode process of HDFS.

Resource Manager

The ResourceManager process is responsible for the resource scheduling management of the entire cluster, usually deployed on a separate server

The Resource Manager consists of two main components:

scheduler

It is a resource allocation algorithm that allocates resources according to the resource application submitted by the Client application and the resource status of the current server cluster.

Yarn's built-in resource scheduling algorithm

Including Fair Scheduler, Capacity Scheduler, etc., can also be developed by Yarn for calling.

The resource allocation unit of Yarn is the container (Container). Each container contains a certain amount of computing resources such as memory and CPU. By default, each container contains a CPU core. Containers are started and managed by the NodeManager process. The NodeManger process monitors the running status of the containers on this node and reports to the ResourceManger process.

application manager

The application manager is responsible for submitting the application, monitoring the running status of the application, and so on. After the application starts, an ApplicationMaster needs to run in the cluster, and the ApplicationMaster also needs to run in the container. After each application starts, it will first start its own ApplicationMaster, and the ApplicationMaster will further apply for container resources to the ResourceManager process according to the resource requirements of the application. After getting the container, it will distribute its own application code to the container to start, and then start distributed computing. .

Yarn's workflow

1. Submit your application to Yarn, including

  • MapReduce ApplicationMaster
  • Our MapReduce program
  • MapReduce Application startup command

2. The ResourceManager process communicates with the NodeManager process, allocates the first container for the user program according to the cluster resources, distributes the MapReduce ApplicationMaster to this container, and starts the MapReduce ApplicationMaster in the container

3. The MapReduce ApplicationMaster registers with the ResourceManager process immediately after it starts, and applies for container resources for its own application.

4. After the MapReduce ApplicationMaster applies for the required container, it immediately communicates with the corresponding NodeManager process, distributes the user MapReduce program to the server where the NodeManager process is located, and runs it in the container, running Map or Reduce tasks.

5. The Map or Reduce task communicates with the MapReduce ApplicationMaster during the running period and reports its running status. If the operation ends, the MapReduce ApplicationMaster logs out of the ResourceManager process and releases all container resources.

If MapReduce wants to run in Yarn, it is necessary to develop a MapReduce ApplicationMaster that follows the Yarn specification. Other big data computing frameworks can also develop an ApplicationMaster that follows the Yarn specification. In this way, different big data computing frameworks can be concurrently executed in a Yarn cluster to realize resources unified scheduling management.

Why HDFS is the system and MapReduce and Yarn are the frameworks

The framework follows the dependency inversion principle: high-level modules cannot depend on low-level modules, they should jointly depend on an abstraction, which is defined by high-level modules and implemented by low-level modules.

The division of high-level and low-level modules: on the call chain, the front is the high-level, and the latter is the low-level. Taking a web application as an example, after the user requests to the server:

  • The first to process user requests is the web container Tomcat, which encapsulates the HTTP binary stream into a Request object by listening to port 80.
  • Then the Spring MVC framework extracts the user parameters in the Request object, and distributes them to the corresponding Model object for processing according to the requested URL
  • Finally the application code handles the user request

Tomcat is a high-level module compared to Spring MVC, and Spring MVC is also a high-level module compared to an application. Although Tomcat will call Spring MVC, because Tomcat has to hand over the Request to Spring MVC, but Tomcat does not depend on Spring MVC, how can Tomcat call Spring MVC without relying on Spring MVC?

Both Tomcat and Spring MVC rely on the J2EE specification. Spring MVC implements the HttpServlet abstract class of the J2EE specification, namely DispatcherServlet, and configures it in web.xml. In this way, Tomcat can call DispatcherServlet to process requests from users.

Similarly, Spring MVC does not need to rely on the Java code we write, but calls our Java code by relying on the Spring MVC configuration file or Annotation abstraction. Therefore, Tomcat or Spring MVC can be called frameworks, and they all follow the principle of dependency inversion.

Similarly, implementing the MapReduce programming interface and following the MapReduce programming specification can be invoked by the MapReduce framework to compute large-scale data in a distributed cluster; implementing the Yarn interface specification, such as Hadoop2 MapReduce, can be scheduled and managed by Yarn and arranged in a unified manner Server resources. So MapReduce and Yarn are both frameworks.

HDFS is not a framework. Using HDFS is to directly call the API interface provided by HDFS. HDFS is directly dependent on it as the underlying module.

Summarize

Yarn, a big data resource scheduling framework, schedules the big data computing engine itself. Unlike MapReduce or Spark programming, each big data application developer needs to develop their own MapReduce program or Spark program according to their needs.

The Yarn module used by the big data computing engine has also been made for use by the developers of these computing engines. Ordinary big data developers have no chance to write Yarn related programs.

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~​

Guess you like

Origin blog.csdn.net/devcloud/article/details/124117586