Several key points summarized by YARN

Previously in Hadoop 1.0, JobTracker mainly completed two functions: resource management and job control. In the scenario where the cluster scale is too large, the JobTracker
has the following shortcomings:
1) The JobTracker is a single point of failure.
2) The access pressure of JobTracker is high, which affects the scalability of the system.
3) It does not support computing frameworks other than MapReduce, such as Storm, Spark, and Flink.

Therefore, in the design of YARN, resource management and job control are separated. Replacing JobTracker is ResourceManager and ApplicationMaster.

  ● Resource Manager is a global resource manager. What it does is to schedule and start the ApplicationMaster to which each Job belongs, and to monitor the existence of the ApplicationMaster. Note: The RM is only responsible for monitoring the AM and starts it when the AM fails to run. The RM is not responsible for the fault tolerance of the internal tasks of the AM, which is done by the AM. (It is done through applicationManager in RM)
  ● ApplicationMaster is a part of every Job (not every type), ApplicationMaster can run on machines other than ResourceManager, and each application corresponds to an ApplicationMaster. .
  ● NodeManager is the agent of ResourceManager at each node, responsible for maintaining the state of Container and keeping heartbeat to RM.
  ● In addition, YARN uses Container to abstract resources, which encapsulates a certain amount of resources on a node (now YARN only supports CPU and memory resources). When the AM requests resources from the RM, the RM uses Containers to represent the resources returned by the AM. YARN assigns one or more Containers to each task, and the task can only use the resources described in the Container. (Note: AM also runs in a Container),) Currently, it can support multiple computing frameworks running on YARN, such as MapReduce, Storm, Spark, and Flink.

Explain that container
Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, network, etc. When AM requests resources from RM, the resources returned by RM for AM are used Container represents. YARN assigns a Container to each task, and the task can only use the resources described in the Container.

To use a YARN cluster, you first need a request from a client that contains an application.

Advantages of YARN design
  ● Separating resource management and job control, reducing the pressure of JobTracker
      ○ The design of YARN greatly reduces the resource consumption of JobTracker (that is, the current ResourceManager), and allows monitoring of the status of each Job subtask (tasks) The program is distributed, safer and more beautiful.
      ○ In the old framework, a big burden of JobTracker is to monitor the running status of tasks under the job. Now, this part is left to ApplicationMaster and there is a module in ResourceManager called ApplicationsManager (ASM), which is responsible for monitoring the operation of ApplicationMaster situation.
  ● Ability to support different computing frameworks

Working principle



mapreduce on yarn



Shortcomings and prospects of ARN
YARN is a two-level scheduler, which solves the shortcomings of the Monolithic scheduler (the typical representative of the central scheduler is JobTracker), the two-tier scheduling architecture seems to add flexibility and concurrency to scheduling, but in fact its conservative resource visibility and locking algorithms (using pessimistic concurrency) also limit flexibility and concurrency. First, conservative resource visibility makes each framework unable to perceive the resource usage of the entire cluster, and there are idle resources that cannot be notified to queued processes, which is prone to waste of resources; second, the locking algorithm reduces concurrency, and the scheduler will Resources are allocated to one architecture, and only after the architecture returns the resources, the scheduler will allocate the part of the resources to other architectures. During the first allocation process, the resources are equivalent to being locked, thus reducing concurrency. To sum up, YARN and other schedulers with two-tier architecture (eg Mesos) have the following disadvantages:
  ● Each application cannot perceive the overall resource usage of the cluster, and can only wait for the upper-layer scheduling to push information.
  ● Resource allocation adopts polling, ResourceOffer mechanism (mesos), pessimistic lock is used in the allocation process, and the concurrency granularity is small.
  ● Lack of an effective competition or preemption mechanism.
In order to improve the shortcomings of the two-layer scheduling system, especially the inability of each application to perceive the usage of the overall resources of the cluster and the low concurrency caused by pessimistic locking control, the Shared State Scheduler is increasingly used. The more people pay attention, the most representative of which is Google's Omega. The shared state scheduler is improved on the basis of the two-layer scheduler:
  ● The global resource manager in the two-layer scheduler is simplified, and a Cell State is used to record the resource usage in the cluster. These usages are shared data, so as to achieve the same effect as the global resource manager .
  ● Adopt optimistic concurrency control method when all tasks access shared data.
Shared schedulers also have shortcomings. For example, when a resource is accessed by different tasks at the same time, conflicts are likely to occur. The more tasks are accessed, the more conflicts will occur. The higher the number of conflicts, the faster the performance of the scheduler will degrade, which will affect the work efficiency and efficiency of the scheduler. work performance.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326799139&siteId=291194637