[BigData - Hadoop - YARN] YARN: next-generation computing platform Hadoop

Apache Hadoop Big Data is one of the most popular processing tools. It was many years successfully deployed in production. Although Hadoop is regarded as a reliable, scalable, and cost-effective solution, but a large developer community continues to improve it. Finally, version 2.0 offers a number of revolutionary features, including Yet Another Resource Negotiator (YARN), HDFS Federation and a highly available NameNode, it makes Hadoop cluster more efficient, robust and reliable. In this article, will YARN with previous versions of Hadoop distributed processing layer are compared, YARN understand the advantages brought about.

Brief introduction

Apache Hadoop 2.0 contains YARN, it will resource management and processing components separately. MapReduce is not bound by YARN-based architecture . This article will introduce YARN, as well as some of the advantages it previously with respect to Hadoop distributed processing layer. This article will learn how to use YARN scalability, efficiency and flexibility enhance your cluster.

Introduction to Apache Hadoop

Apache Hadoop is an open source software framework, business machines may be installed in a cluster, so that the machine can communicate with each other and work together in a highly distributed manner and process large amounts of data are stored together. Initially, Hadoop includes two main components: Hadoop Distributed File System (HDFS) and a distributed computing engine , the engine support to achieve the form of MapReduce jobs and run the program .

MapReduce is a simple programming model for Google promotion, its highly parallel and scalable way to handle large data sets useful. Inspired by the MapReduce programming function, the user may express their function is calculated for the map and reduce the data to be processed as a key. Hadoop provides a high-level API to implement custom map and reduce functions in a variety of languages.

Hadoop also provides the software infrastructure to run MapReduce jobs as a series of map and reduce tasks. Map task of  calling the map function on a subset of the input data. After completion of these calls, reduce task  start calling reduce task on the intermediate data generated by the map function, to generate the final output. map and reduce task to run separately from one another, which support parallel computing and fault tolerance.

The most important is that all the complex aspects of Hadoop infrastructure to deal with distributed processing: parallelization, scheduling, resource management, inter-machine communications, software and hardware troubleshooting, and so on. Thanks to this clean abstract, distributed application processing to achieve the number of TB of data on hundreds (or even thousands) of machine has never been so easy, even for previously did not use distributed systems experience developers, too.

Hadoop's golden age

Although there are a variety of open source MapReduce model to achieve, but Hadoop MapReduce soon became very popular. Hadoop is one of the world's most exciting open source project, it provides a number of outstanding features: high-level API, near-linear scalability, open source licensing, the ability to run on commodity hardware, and fault tolerance. It has received hundreds (perhaps thousands reached) a company's successful deployment is the latest standard large-scale distributed storage and processing.

Some early adopters of Hadoop, such as Yahoo! and Facebook, to build a large cluster contains 4,000 nodes, in order to meet the growing demand for data processing and change. However, after building their own clusters, they began to notice some of the limitations Hadoop MapReduce framework.

Limitations of classic MapReduce

The most serious limitation classic MapReduce is mainly related to scalability , resource utilization and to support MapReduce different workloads . In the MapReduce framework, the two types of job execution process is controlled by:

  • Called  JobTracker  major process that coordinates all jobs running on the cluster allocation map and reduce tasks to run on TaskTracker.
  • Many called  TaskTracker  subordinate process, task assignment and they run regularly report progress to the JobTracker.
Apache Hadoop version of the classic (MRv1)

The figure shows the classic version of Apache Hadoop (MRv1)

Large Hadoop cluster showing a scalability bottleneck caused by a single JobTracker . When Yahoo !, based on 5,000 nodes and 40,000 tasks run simultaneously in a cluster, in fact, such a design would be limited. Because of this limitation, you must create and maintain a smaller, more feature-poor clusters.

In addition, small and large Hadoop clusters are never the most efficient use of their computing resources. In Hadoop MapReduce, the computing resources on each slave node cluster administrator by decomposition of a fixed number of map and the reduce slot, the slot can not be replaced. After setting the number of map slot and reduce slot, the node can not run more than a map slot map tasks at any time, even without the reduce task running. This affects the efficiency of the cluster, as are used in all map slot (but we still need more), we can not use any reduce slot, even if they are available, and vice versa.

Last but not least is, Hadoop is designed to run MapReduce jobs. With the advent of alternative programming models (such as graphics processing Apache Giraph provided), and in addition to MapReduce, a growing need to share resources and can be run on the same cluster through efficient and equitable way of other programming model provide support.

In 2010, Yahoo! Engineers began to study a new Hadoop architecture, this architecture to solve all these restrictions and increase the variety of additional features.

Solve scalability issues

In the Hadoop MapReduce, JobTracker has two different functions:

  • Computing resource management cluster , which involves maintaining a list of active nodes, map and reduce slots list of available and occupied, as well as depending on the selected scheduling policy available slots assigned to the appropriate jobs and tasks
  • Coordination of all the tasks running on the cluster , which involves TaskTracker start guide map and reduce tasks to perform monitoring tasks, restart the mission failed, speculatively running slow task of calculating the sum of the value of the work counter, etc.

Arrange a lot of responsibilities for a single process will lead to a major scalability issues, especially on larger clusters, the JobTracker must keep track of the thousands of TaskTracker, hundreds of jobs and tens of thousands of map and reduce tasks. The following figure illustrates this problem. Instead, TaskTracker usually run nearly a dozen tasks that are assigned to them by the diligent JobTracker.

JobTracker busy on large-scale Apache Hadoop clusters (MRv1)

The figure shows busy on large-scale Apache Hadoop clusters (MRv1) JobTracker

To solve scalability issues, a simple but brilliant idea came into being: we have reduced the responsibilities of the individual JobTracker, will be part of the responsibilities delegated to TaskTracker , because there are many clusters TaskTracker. In the new design, this concept by JobTracker dual role (cluster resource management and task coordination) are separated into two different types of processes to reflect.

No longer has a single JobTracker, a new method introduces a cluster manager , its only duty is to track the active node in the cluster and available resources, and assign them to tasks . For each job (Task) submitted to the cluster, launches a dedicated, short JobTracker controlled to perform tasks in the job. Interestingly, by the short-lived JobTracker running on the slave nodes start TaskTracker . Therefore, coordination of operations spread over the life cycle of all the available machines in the cluster. Thanks to this behavior, the more work can be run in parallel, scalability has been significantly improved.

YARN: next-generation computing platform Hadoop

We are now about to change with a little speech. The following changes to the name of a better understanding of design YARN:

  • ResourceManager instead of the cluster manager
  • ApplicationMaster instead of a dedicated and transient JobTracker
  • NodeManager 代替 TaskTracker
  • A distributed application instead of a MapReduce job

YARN Hadoop computing platform is the next generation, as shown below.

YARN framework

The figure shows the architecture YARN

In YARN framework, a global ResourceManager runs as a background process mainly, it is usually run on a dedicated machine, cluster resources among competing applications arbitration available. ResourceManager keeps track of how many cluster nodes and resources available activities, which applications should be submitted to coordinate user access to these resources and when. ResourceManager process is the sole owner of this information, so it can be shared by some, secure, multi-tenant ways to develop distribution (or scheduling) decisions (for example, based on application priority, queue capacity, ACLs, data location, etc.) .

When a user submits an application called ApplicationMaster lightweight process instance will start to coordinate the execution of all tasks within the application. This includes monitoring tasks, restart a failed task, the sum of the speculatively slow task, and the application program counter value is calculated. These responsibilities previously assigned to all individual JobTracker jobs. ApplicationMaster and belong to its application task to run container resources by NodeManager control.

NodeManager is a TaskTracker more common and efficient version. There is no fixed number of map and reduce slots, NodeManager container has many resources dynamically created. Size of the container depends on the amount of resources it contains, such as memory, CPU, disk and network IO. Currently, only the memory and CPU (YARN-3). Cgroups future be used to control the disk and network IO. The number of containers on a node, determined by the configuration parameters specific to the slave daemon and outside the operating system the total amount of resources (such as CPU, and the total number of total memory) node resources.

Interestingly, ApplicationMaster can run any type of task in the container. For example, the MapReduce ApplicationMaster request to start a container map or reduce tasks, Giraph ApplicationMaster request Giraph a container to run the task. You can also implement a custom ApplicationMaster to run a specific task, then invented a new distributed application framework, change the pattern of big data world. You can access Apache Twill, it is designed to simplify writing distributed applications on top of YARN.

In YARN in, MapReduce downgraded to a distributed application of a role (but still is a very popular and useful role), now known as MRv2 . MRv2 classic MapReduce engine (now called MRv1) to reproduce, run on YARN .

 

A cluster can run any distributed application

ResourceManager, NodeManager container and do not care about the type of application or task . All the code specific to the application framework have shifted to its ApplicationMaster, so that any frame can be distributed by YARN support - as long as people realize the corresponding ApplicationMaster for it.

Thanks to this general method, Hadoop YARN clusters run many different workloads dream could be realized. Imagine: your data center in a Hadoop cluster can run MapReduce, Giraph, Storm, Spark, Tez / Impala, MPI and so on.

Single cluster method provides a number of significant advantages, including:

  • Higher utilization of cluster resources by another frame using a frame unused
  • Lower operating costs , as only a "do everything the" cluster requires management and regulation
  • Less data movement , without moving data between Hadoop YARN system running on a different machine cluster

Manage a single cluster will get a greener data processing solutions. Data center uses less space, less waste of silicon, use less power, fewer emissions, simply because we are running the same calculation, but on a smaller and more efficient Hadoop cluster.

 

YARN in the application submitted

This section discusses the time the application is submitted to YARN cluster, how ResourceManager, ApplicationMaster, NodeManagers container and interact with each other. The following figure shows an example.

YARN in the application submitted
YARN in the application submitted

Assume that the user uses in the same manner MRv1 type  hadoop jar command, the application will be submitted to the ResourceManager. ResourceManager list of available resources on the maintenance program list of applications running on the cluster, and each activity NodeManager. ResourceManager need to determine which application should receive the next part of the cluster resources. The decision by many restrictions, such as queue capacity, ACL and fairness. ResourceManager uses a pluggable Scheduler. Scheduler performs scheduling only; and when it manages who get cluster resources (in the form of a container), but will not perform any monitoring of tasks within the application, so it will not attempt to restart a failed task.

In ResourceManager told to submit a new application, Scheduler first decision-making is the selection will be used to run ApplicationMaster container. After ApplicationMaster start, it will be responsible for the entire life cycle of this application. First and most importantly, it sends a resource request to the ResourceManager, requesting the tasks required to run the application container. Resource request is a request for some of the containers to meet the needs of some resources, such as:

  • A certain amount of resources , currently using MB of memory and CPU shares to represent
  • A preferred location , the rack name is specified by a host name, or use * to indicate no preference
  • This application is a priority , rather than across multiple applications

If possible, the ResourceManager will assign a meet ApplicationMaster in the resource request container of the requested demands (expressed as container ID and host name). This allows the container application to use a specific host on a given amount of resources . After assigning a container, ApplicationMaster will ask the NodeManager (Host Management dispensing container) use these resources to launch a task-specific applications. This task can be any process (such as a MapReduce task or a task Giraph) written in any framework. NodeManager does not monitor task; it only monitors resource usage containers, for example, the initial allocation of memory than if a container consume more, it will end the container.

ApplicationMaster will make every effort to coordinate container, start all tasks required to complete its application. It also monitors the progress of the application and its tasks, restart a failed task in the new container request, as well as the progress report submitted to the client application's end. Once the application is complete, ApplicationMaster will shut itself down and release their containers.

Although the ResourceManager will not tasks within the application to perform any monitoring, but it checks ApplicationMaster health. If ApplicationMaster failure, ResourceManager can restart it in a new container. You can think of ResourceManager manages ApplicationMaster, while ApplicationMasters responsible for administrative tasks.

 

Interesting facts and features

YARN offers a variety of other excellent features. Introduction All of these features do not belong to the scope of this article, I list only some noteworthy features:

  • If the job is sufficiently small, Uberization  support all the tasks of running a MapReduce job at the JVM ApplicationMaster's. This allows you to avoid ResourceManager request from the container and the requirements NodeManagers start (probably small) task overhead.
  • Source code or binary compatibility with MRv1 prepared as MapReduce job (MAPREDUCE-5108).
  • For ResourceManager high availability (YARN-149). This work in progress, a number of suppliers has been completed.
  • Restart the application after the ResourceManager recovery (YARN-128). ResourceManager application running and completed the task information stored in HDFS. If ResourceManager restarted, it will recreate the state of the application, re-run only incomplete tasks. This work is nearing completion, the community is actively testing. It has been some suppliers to complete.
  • Simplified user access and log management . Application logs generated does not remain on each slave node (the same as MRv1), transferred to a central storage area, such as HDFS. In the future, they can be used for debugging purposes, or for historical analysis to discover performance problems.
  • The new look of the Web interface .
 

Conclusion

YARN is a complete rewrite of the Hadoop cluster architecture. It seems to realize and brought about changes in the way the implementation of distributed applications on the IBM cluster.

Compared with the first edition of the classic Hadoop MapReduce engine, YARN provides significant advantages in scalability, efficiency and flexibility. Small and large Hadoop clusters have benefited from YARN in. For end-users (developers, rather than the administrator), these changes are almost invisible, because you can use the same MapReduce API and CLI run unmodified MapReduce jobs.

There is no reason not to migrate MRv1 to YARN. Hadoop's largest supplier agree on this point, but also provides extensive support for Hadoop YARN operation. Today, YARN has been successfully applied in the production of many companies, such as Yahoo!, EBay, Spotify, Xing, Allegro and so on.

 

Acknowledgments

Thank Piotr Krewski and Fabian Alenius technical review of this article.

Reference material

Learn

Get products and technologies

discuss

 

Reproduced in: https: //www.cnblogs.com/licheng/p/6685934.html

Guess you like

Origin blog.csdn.net/weixin_33852020/article/details/92630007