Hadoop data operating system YARN full analysis

"The introduction of YARN in Hadoop 2.0 greatly improves the resource utilization of the cluster and reduces the cost of cluster management. How is it applied in heterogeneous clusters? What other successful practices can Hulu share?

  In order to manage and schedule resources in a cluster uniformly, Hadoop 2.0 introduced the data operating system YARN. The introduction of YARN greatly improves the resource utilization of the cluster and reduces the cost of cluster management. First, YARN allows multiple applications to run in a cluster and allocate resources to them on demand, which greatly improves resource utilization. Second, YARN allows various types of short jobs and long services to be mixed in a cluster, and Provides support for fault tolerance, resource isolation, and load balancing, which greatly simplifies the deployment and management costs of jobs and services.

  YARN generally adopts the master/slave architecture, as shown in Figure 1, where the master is called ResourceManager, and the slave is called NodeManager. ResourceManager is responsible for unified management and scheduling of resources on each NodeManager. When a user submits an application, he needs to provide an ApplicationMaster to track and manage the program. It is responsible for applying for resources to the ResourceManager and requires NodeManger to start a Container that can occupy a certain amount of resources. Since different ApplicationMasters are distributed to different nodes, and resource isolation is carried out through a certain isolation mechanism, they will not affect each other.

  

The basic architecture of Apache YARN

 

  Figure 1 The basic architecture of Apache YARN

  The resource management and scheduling functions in YARN are handled by the resource scheduler, which is one of the core components in Hadoop YARN and a pluggable service component in the ResourceManager. YARN organizes and divides resources through hierarchical queues, and provides a variety of multi-tenant resource schedulers. This scheduler allows administrators to group users or applications according to application requirements, and allocate different amounts of resources to different groups At the same time, by adding various constraints to prevent a single user or application from monopolizing resources, it can meet various QoS requirements. Typical representatives are Yahoo!'s Capacity Scheduler and Facebook's Fair Scheduler.

  As a general-purpose data operating system, YARN can run short jobs like MapReduce and Spark, as well as deploy long services like Web Server and MySQL Server, truly realizing a multi-purpose cluster. Such a cluster is usually called a light-weight cluster. The elastic computing platform is said to be lightweight because YARN adopts the lightweight isolation scheme of cgroups. It is elastic because YARN can adjust the resources occupied by various computing frameworks or applications according to the load or demand. Realize cluster resource sharing and elastic shrinkage of resources.

  

An ecosystem with YARN at its core

 

  Figure 2 The ecosystem with YARN at its core

  Application of Hadoop YARN in Heterogeneous Cluster

  Starting from version 2.6.0, YARN introduces a new scheduling strategy: label-based scheduling. The main motivation for introducing this mechanism is to better allow YARN to run in heterogeneous clusters, so as to better manage and schedule mixed types of applications.

  1. What is tag-based scheduling

  As the name suggests, tag-based scheduling is a scheduling strategy, just like priority-based scheduling, it is one of many scheduling strategies in the scheduler, and can be mixed with other scheduling strategies. The basic idea of ​​this strategy is: the user can label each NodeManager, such as highmem, highdisk, etc., as the basic attributes of the NodeManager; at the same time, the user can set several labels for the queue in the scheduler to limit the queue to occupy only Contains node resources with corresponding labels, so that jobs submitted to a queue can only run on specific nodes. By tagging, users can divide Hadoop into several sub-clusters, allowing users to run applications on nodes that meet certain characteristics, such as running memory-intensive applications (such as Spark) on large memory nodes.

  2. Hulu application case

  Label-based scheduling strategies are widely used within Hulu. The reason why this mechanism is enabled is mainly for the following three considerations:

  Clusters are heterogeneous. During the evolution of a Hadoop cluster, later machines are usually better configured than older machines, which eventually makes the cluster a heterogeneous cluster. At the beginning of the design of Hadoop, many design mechanisms assumed that the clusters were homogeneous. Even now, Hadoop's support for heterogeneous clusters is still very imperfect. For example, the MapReduce speculative execution mechanism has not yet considered heterogeneous clusters.

  Applications are diverse. Hulu simultaneously deploys MapReduce, Spark, Spark Streaming, Docker Service and other types of applications on the YARN cluster. When multiple types of applications are mixed in heterogeneous clusters, it often occurs that the completion time of parallelized tasks varies greatly due to different machine configurations, which is very unfavorable for the efficient execution of distributed programs. In addition, since YARN cannot perform complete resource isolation, multiple applications running mixed on a node easily interfere with each other, which is usually intolerable for low-latency applications.

  Individual machine requirements. Due to dependencies on special environments, some applications can only run on specific nodes in a large cluster. Typical representatives are spark and docker. Spark MLLib may use some native libraries. In order to prevent pollution of the system, these libraries are usually only installed on several nodes; the operation of docker container depends on docker engine. In order to simplify operation and maintenance costs, we only Will make docker run on a number of specified nodes.

  In order to solve the above problems, Hulu has enabled label-based scheduling policies based on Capacity Scheduler. As shown in Figure 3, we label the nodes in the cluster with various labels based on machine configuration and application requirements, including:

  q spark-node: machines used to run spark jobs, these machines are usually configured with higher configuration, especially with larger memory;

  q mr-node: machines running MapReduce jobs, these machines have various configurations;

  q docker-node: machines running docker applications, these machines have docker engine installed;

  q streaming-node: The machine running the spark streaming application.

  

YARN Deployment Example

 

  Figure 3 Example of YARN deployment

  It should be noted that YARN allows a node to have multiple labels at the same time, thereby enabling a machine to run multiple types of applications mixed (in hulu, we allow some nodes to be shared, and can run multiple applications at the same time). On the surface, the cluster is divided into multiple physical clusters by introducing tags, but in fact, these physical clusters are different from the clusters that are completely isolated in the traditional sense. These clusters are both independent and related to each other. Users can easily Dynamically adjust the purpose of a node by modifying the label.

  Hadoop YARN application cases and experience summary

  1 Hadoop YARN application case As a data operating system, Hadoop YARN provides rich APIs for users to develop applications. Hulu has done a lot of exploration and practice in the design of YARN applications, and has developed a number of distributed computing frameworks and computing engines that can run directly on YARN. The typical representatives are voidbox and nesto.

  (1) Docker-based container computing framework voidbox

  Docker is a very popular container virtualization technology in the past two years. It can automatically package and deploy most applications. It enables any program to run in a resource-isolated container environment, thereby providing a more elegant project construction, release, and operation. solution.

  In order to integrate the unique advantages of YARN and Docker, the Hulu Beijing big data team developed Voidbox. Voidbox is a distributed computing framework that uses YARN as a resource management module and Docker as an engine for executing tasks, so that YARN can schedule both traditional MapReduce and Spark applications, as well as applications packaged in Docker images. application.

  Voidbox supports DAG (Directed Acyclic Graph) tasks and long-term services (such as web services) based on Docker Container, and provides various application submission methods such as command line mode and IDE mode, which meets the needs of production and development environments. In addition, Voidbox can cooperate with Jenkins, GitLab, and private Docker repository to complete a complete set of development, testing, and automatic release processes.

  

Voidbox system architecture

 

  Figure 4 Voidbox system architecture

  In Voidbox, YARN is responsible for cluster resource scheduling, and Docker, as an execution engine, pulls images from the Docker Registry for execution. Voidbox is responsible for requesting resources for container-based DAG tasks and running Docker tasks. As shown in Figure 4, each black wireframe represents a machine with several modules running on it, as follows:

  Voidbox component:

  VoidboxClient: client program. Users can manage Voidbox applications through this component (a Voidbox application contains one or more Docker jobs, and a job contains one or more Docker tasks), such as submitting and killing Voidbox applications.

  VoidboxMaster: It is actually a YARN Application Master, responsible for applying for resources to YARN, and further assigning the obtained resources to internal Docker tasks.

  VoidboxDriver: Responsible for task scheduling for a single Voidbox application. Voidbox supports DAG task scheduling based on Docker Container and other user code can be inserted between tasks. Voidbox Driver is responsible for handling the dependent order scheduling between DAG tasks and running user code.

  VoidboxProxy: It is a bridge between YARN and Docker engine, and is responsible for relaying commands sent by YARN to Docker engine, such as starting or killing Docker containers.

  StateServer: Maintains the health status information of each Docker engine, and provides the Voidbox Master with a list of machines that can run Docker Containers, so that the Voidbox Master can apply for resources more efficiently.

  Docker components:

  DockerRegistry: Stores Docker images as a version management tool for internal Docker images.

  DockerEngine: The engine executed by the Docker Container, obtains the corresponding Docker image from the Docker Registry, and executes Docker-related commands.

  Jenkins: Cooperate with GitLab for application version management. When the application version is updated, Jenkins is responsible for compiling and packaging, generating a Docker image, and uploading it to the Docker Registry to complete the process of automatic application release.

  Similar to spark on yarn, Voidbox also provides two application running modes, yarn-cluster mode and yarn-client mode. In the yarn-cluster mode, the control components and resource management components of the application run in the cluster. After the Voidbox application is successfully submitted, the client can exit at any time without affecting the running of the application in the cluster. The yarn-cluster mode is suitable for submitting applications in the production environment; in the yarn-client mode, the control component of the application runs on the client, and other components run in the cluster. The client can see more information about the running status of the application. After exiting, the application running in the cluster also exits immediately, and the yarn-client mode is convenient for users to debug.

  (2) parallel computing engine nesto

  Nesto is an MPP computing engine similar to presto/impala inside hulu. It is specially designed for processing complex nested data and supports complex data processing logic (SQL is difficult to express). It uses columnar storage, code Generation and other optimization techniques to speed up data processing efficiency. Nesto's architecture is similar to presto/impala, it is decentralized, and multiple nesto servers perform service discovery through zookeeper.

  In order to simplify nesto deployment and management costs, hulu directly deploys nesto to YARN. In this way, the nesto installation and deployment process will become very simple: the Nesto installation program (including configuration files and jar packages) is packaged into an independent compressed package and stored in HDFS. Users can run a submit command and specify the number of nesto servers to start. , resources and other information required by each server, you can quickly deploy a nesto cluster.

  The Nesto on yarn program consists of an ApplicationMaster and multiple Executors. The ApplicationMaster is responsible for applying for resources like YARN and starting the Executor. The function of the Executor is to start the nesto server. The key design point is the ApplicationMaster. Its functions include:

  Communicate with ResourceManager to apply for resources. These resources need to be guaranteed to come from different nodes, so as to achieve the purpose of only starting one Executor per node;

  Communicate with NodeManager, start Executors, and monitor the health status of these Executors. Once an Executor is found to be faulty, restart a new Executor on other nodes;

  Provides an embedded web server to display the running status of tasks in each nesto server.

  2. Hadoop YARN development experience summary

  (1) Skillfully use resources to apply for API

  Hadoop YARN provides rich resource expression semantics. Users can apply for resources on a specific node/rack, or they can no longer accept resources on a node through a blacklist.

  (2)注意memory overhead

  The memory of a container is composed of three parts: java heap, jvm overhead and non-java memory. If the memory size set by the user for the application is X GB (-xmxXg), the memory size of the container applied by ApplicationMaster should be X. +D, where D is the jvm overhead, otherwise it may be killed by YARN for total memory exceeding the limit.

  (3) log rotation

  For long services, service logs will accumulate more and more, so log rotation is particularly important. Since the application cannot know the specific storage location of the log (such as which directory on which node) before it is started, in order to facilitate the user to operate the log directory, YARN provides a macro. When the macro appears in the startup command, YARN will automatically It is replaced with a specific log directory, such as:

  echo $log4jcontent > $PWD/log4j.properties && java -Dlog4j.configuration=log4j.properties …

  com.example.NestoServer 1>>/server.log 2>>/server.log

  The content of the variable log4jcontent is as follows:

  

variable log4jcontent content

 

  (4) Debugging skills

  Before the NodeManager starts the Container, it writes the Container-related environment variables, startup commands, and other information into a shell script, and starts the Container by starting the script. In some cases, the failure to start the Container may be due to the wrong start command (for example, some special characters are escaped). For this reason, you can judge whether there is a problem with the start command by checking the content of the last executed script. The specific method is: Add a command to print the contents of the script before the container executes the command.

  

Debugging Tips

 

  (5) Performance problems caused by shared clusters

  When multiple applications are running in a YARN cluster at the same time, the load on the nodes may vary, which in turn causes tasks on some nodes to run slower than other nodes, which is unacceptable for applications with OLAP requirements. In order to solve this problem, there are usually two solutions: 1) Run such applications on some exclusive nodes by tagging 2) Implement a speculative execution mechanism similar to MapReduce and Spark inside the application, which is slow The task starts one or more of the same tasks additionally, in a space-for-time manner, so as to avoid slow tasks from slowing down the running efficiency of the entire application.

  Hadoop YARN development trend

  For YARN, it will develop towards general resource management and scheduling, not only in the field of big data processing, including support for MapReduce, Spark short jobs, and support for long services such as Web Service.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326330782&siteId=291194637