3.1.4 Hadoop, Yarn, resource scheduling strategy, Apache Hadoop core source code analysis, Hadoop 3.x new features overview

table of Contents

Part 7 YARN Resource Scheduling

Section 1 Yarn architecture

Section 2 Yarn task submission (work mechanism)

Section 3 Yarn scheduling strategy

Section 4 Yarn multi-tenant resource isolation configuration

The second part of Apache Hadoop HDFS core source code analysis

Section 1 Source code reading preparation

Section 2 NameNode startup process

Section 3 DataNode startup process

Section 4 Data Writing Process

Section 5 How NameNode Supports High Concurrent Access (Double Buffering Mechanism)

Extend Hadoop 3.x new features overview

 Common improvements of Hadoop3.x new features

HDFS improvement of Hadoop3.x new features

 YARN improvement of Hadoop3.x new features

MapReduce improvement of new features of Hadoop3.x

Other new features of Hadoop 3.x


 

Part 7 YARN Resource Scheduling

Section 1 Yarn architecture

ResourceManager(rm) : Process client requests, start/monitor ApplicationMaster, monitor NodeManager, resource allocation and scheduling;

NodeManager(nm) : Resource management on a single node, processing commands from the ResourceManager, and processing commands from the ApplicationMaster;

ApplicationMaster(am) : Data segmentation, application of resources for applications, and allocation to internal tasks, task monitoring and fault tolerance.

Container : An abstraction of the task operation environment, which encapsulates multi-dimensional resources such as CPU and memory, as well as environment variables, startup commands and other task operation-related information.

 

Section 2 Yarn task submission (work mechanism)

YARN of the job submission process ( to be able to repeat it )

Assignment submission

Step 1: Client calls job.waitForCompletion method to submit MapReduce jobs to the entire cluster.

Step 2: Client applies for a job id from RM.

Step 3: RM returns the submission path and job id of the job resource to the Client.

Step 4: Client submits jar package, cut information and configuration files to the specified resource submission path.

Step 5: After the client submits the resources, it applies to the RM to run MrAppMaster.

Job initialization

Step 6: When RM receives the client's request, it adds the job to the capacity scheduler.

Step 7: A certain free NM receives the job.

Step 8: The NM creates a Container and produces MRAppmaster.

Step 9: Download the resources submitted by the Client to the local.

Task Assignment

Step 10: MrAppMaster applies to RM to run multiple MapTask task resources.

Step 11: The RM assigns the MapTask task to the other two NodeManagers, and the other two NodeManagers receive the tasks and create containers respectively.

Task operation

Step 12: MR sends the program startup script to the two
NodeManagers that have received the task. The two NodeManagers start MapTask respectively, and MapTask sorts the data partitions.

Step 13: After MrAppMaster waits for all MapTasks to run, apply for containers from RM and run ReduceTasks.

Step 14: ReduceTask obtains the data of the corresponding partition from MapTask.

Step 15: After the program runs, MR will apply to RM to log off.

Progress and status updates Tasks in YARN return their progress and status to the application manager, and the client requests progress updates from the application manager every second (set by mapreduce.client.progressmonitor.pollinterval) and displays it to the user.

Homework completed

In addition to requesting the progress of the job from the application manager, the client calls waitForCompletion () every 5 seconds to check whether the job is complete.

The time interval can be set by mapreduce.client.completion.pollinterval.

After the job is completed, the Application Manager and Container will clean up the job status. The job information will be stored by the job history server for later user verification.

 

Section 3 Yarn scheduling strategy

There are three main types of Hadoop job schedulers: FIFO , Capacity Scheduler and Fair Scheduler .

The default resource scheduler of Hadoop 2.9.2 is Capacity Scheduler .

You can view yarn-default.xml

1. FIFO (first in first out scheduler)


2. Capacity scheduler (Capacity Scheduler default scheduler)

The scheduling strategy used by Apache Hadoop by default. The Capacity scheduler allows multiple organizations to share the entire cluster, and each organization can obtain a portion of the computing power of the cluster. By assigning a dedicated queue to each organization, and then assigning a certain cluster resource to each queue, the entire cluster can provide services to multiple organizations by setting up multiple queues. In addition, the queue can be divided vertically, so that multiple members within an organization can share the queue resources. Within a queue, the resource scheduling is based on a first-in first-out (FIFO) strategy .


3. Fair Scheduler (Fair Scheduler, the default scheduler used by CDH version of hadoop)

The design goal of the Fair scheduler is to allocate fair resources for all applications (the definition of fairness can be set by parameters). Fair scheduling can also be performed among multiple queues.

For example, suppose there are two users A and B, and they each have a queue.
When A starts a job and B has no tasks, A will get all cluster resources; when B starts a job, A's job will continue to run, but after a while, the two tasks will each get half of the resources Cluster resources. If B starts the second job at this time and other jobs are still running, it will share the resources of the B queue with B's first job, that is, B's two jobs will be used for one-quarter The cluster resources of A, but A’s job is still used for half of the resources of the cluster, the result is that the resources are finally shared equally between the two users

 

Section 4 Yarn multi-tenant resource isolation configuration

Yarn cluster resources are set to two queues, A and B. Queue
A is set to occupy 70% of resources and is mainly used to run regular timed tasks, and
queue B is set to occupy 30% of resources to run temporary tasks. The
two queues can interact with each other. Resource sharing, if the resources of queue A are full and the resources of queue B are more abundant, queue A can use the resources of queue B to maximize the overall utilization of resources

Choose to use Fair Scheduler scheduling strategy! !

Specific placement

1. yarn-site.xml

<!-- 指定任务调度使⽤fairScheduler的调度⽅式 -->
<property>
     <name>yarn.resourcemanager.scheduler.class</name>
     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
     <description>In case you do not want to use the default scheduler</description>
</property>

2. Create fair-scheduler.xml file Create the file
in the Hadoop installation directory/etc/hadoop

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
      <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
      <queue name="root" >
          <queue name="default">
              <aclAdministerApps>*</aclAdministerApps>
              <aclSubmitApps>*</aclSubmitApps>
              <maxResources>9216 mb,4 vcores</maxResources>
              <maxRunningApps>100</maxRunningApps>
              <minResources>1024 mb,1vcores</minResources>
              <minSharePreemptionTimeout>1000</minSharePreemptionTimeout>
              <schedulingPolicy>fair</schedulingPolicy>
              <weight>7</weight>
          </queue>
          <queue name="queue1">
              <aclAdministerApps>*</aclAdministerApps>
              <aclSubmitApps>*</aclSubmitApps>
              <maxResources>4096 mb,4vcores</maxResources>
              <maxRunningApps>5</maxRunningApps>
              <minResources>1024 mb, 1vcores</minResources>
              <minSharePreemptionTimeout>1000</minSharePreemptionTimeout>
              <schedulingPolicy>fair</schedulingPolicy>
              <weight>3</weight>
          </queue>
      </queue> 
      <queuePlacementPolicy>
         <rule create="false" name="specified"/>
          <rule create="true" name="default"/>
      </queuePlacementPolicy>
</allocations>

 

Boundary verification

 

The second part of Apache Hadoop HDFS core source code analysis

Section 1 Source code reading preparation

1. Download the official source code of Apache Hadoop-2.9.2
2. Import the source code into idea

Start idea and choose to import in the prompt interface

Wait for the completion of downloading and resolving dependencies, and the source code import is successful! !

 

Section 2 NameNode startup process

Command to start the HDFS cluster

start-dfs.sh


This command will start the NameNode and DataNode of Hdfs. The NameNode is started mainly through the
org.apache.hadoop.hdfs.server.namenode.NameNode class.

We focus on what the NameNode does during the startup process (the technical details that deviate from the main line will not be studied)

For the analysis startup process, two parts of the code are mainly concerned:

The main responsibility of namenode is the management of file meta-information and data block mapping. Correspondingly, the startup process of the namenode needs to pay attention to the working thread that communicates with the client and the datanode, the management mechanism of file meta-information, and the management mechanism of data blocks. Among them, RpcServer is mainly responsible for communicating with clients and datanodes, and FSDirectory is mainly responsible for managing file meta information.

 

Section 3 DataNode startup process

The Main Class of datanode is DataNode, first find DataNode.main()

 

Section 4 Data Writing Process

There are many important working threads on datanode. Among them, DataXceiverServer and BPServiceActor are most closely related to the process of writing data blocks. The client and the data node mainly complete the reading/writing of the data block through the data transfer protocol.

DataTransferProtocol is used for streaming communication between clients and data nodes in the entire pipeline. Among them, DataTransferProtocol--writeBlock() is responsible for writing data blocks:

 

Section 5 How NameNode Supports High Concurrent Access (Double Buffering Mechanism)

What kind of problems will be encountered when accessing NameNode concurrently:

After learning the metadata management mechanism of HDFS, the Client requests the NameNode to modify one piece of metadata each time

(Rather than applying to upload a file, you must write an edits log, which involves two steps:

Write to local disk--edits file

It is transmitted to the JournalNodes cluster through the network (Hadoop HA cluster-combined with zookeeper to learn).


The difficulty of high concurrency lies in the multi-thread safety of data and the efficiency of each operation! !

For multi-thread safety:

NameNode has a few principles when writing edits log:

When writing data to edits_log, it is necessary to ensure that each edit has a transactionId (txid for short) that increases in global order, so that the order of the edits can be identified.

If you want to ensure that the txid of each edits is incremented, you must add a synchronization lock. That is, when each thread modifies the metadata, when you want to write an edit, you must queue up to acquire the lock in order to generate an incremental txid, which represents the serial number of the edits to be written this time.

 

The resulting problem:
If every time a locked code block is generated, txid is generated, and then the disk file edits log is written, this kind of synchronization lock and disk write operation is very time-consuming! !

HDFS optimization solution
The main reason for the problem is that serialization and queuing when writing edits results in an increase in txid + writing disk operation takes time.

HDFS solution
1. Serialization: use segment lock
2. Write to disk: use double buffering

In the segmented locking mechanism,   each thread acquires the lock for the first time in sequence, generates a sequentially increasing txid, and then writes edits into memory double-buffered area 1, and then releases the lock for the first time. Taking advantage of this gap, subsequent threads can acquire the lock again for the first time, and then immediately write their own edits to the memory buffer.

The double buffer mechanism   program will open up two identical memory spaces, one of which is bufCurrent, and the generated data will be directly written to this bufCurrent, and the other is called bufReady, and the data is written in bufCurrent (up to one After setting the standard), the two memories will be exchanged. Directly exchange double buffered area 1 and area 2. It is ensured that all requests for data write from the client are received from the operating memory, not to the disk synchronously.

Double buffer source code analysis

 

Extend Hadoop 3.x new features overview

Many features have been enhanced in Hadoop3.x. In Hadoop3.x, jdk1.7 is no longer allowed, and jdk1.8 or above is required. This is because Hadoop 2.0 was developed based on JDK 1.7, and JDK 1.7 was stopped updating in April 2015, which directly forced the Hadoop community to re-release a new Hadoop version based on JDK 1.8, which is exactly Hadoop 3.x. Hadoop 3.x will adjust the project architecture in the future, and mapreduce will be based on memory + io + disk to process data together.

Hadoop 3.x introduced some important functions and optimizations, including HDFS erasable coding, multi-Namenode support, MR NativeTask optimization, YARN cgroup-based memory and disk IO isolation, YARN container resizing, etc.

The Hadoop3.x official document address is as follows:

http://hadoop.apache.org/docs/r3.0.1/

 Common improvements of Hadoop3.x new features

Hadoop Common improvements:

1. Streamline the Hadoop kernel, including removing outdated APIs and implementations, replacing the default component implementation with the most efficient implementation (for example, changing the default implementation of FileOutputCommitter to v2, abolishing hftp and replacing it with webhdfs, and removing the Hadoop sub-implementation Serialization library org.apache.hadoop.Records

2. Lasspath isolation is used to prevent conflicts between different versions of jar packages. For example, when Google Guava mixes Hadoop, HBase and Spark, it is easy to cause conflicts. (https://issues.apache.org/jira/browse/HADOOP -11656)

3. Shell script reconstruction. Hadoop 3.0 refactored Hadoop management scripts, fixed a large number of bugs, added new features, and supported dynamic commands. The usage method is consistent with the previous version. (Https://issues.apache.org/jira/browse/HADOOP-9902)

 

HDFS improvement of Hadoop3.x new features

The most significant change in Hadoop3.x is HDFS. HDFS is calculated by the latest black block. According to the principle of recent calculation, the local black block is added to the memory, first calculated, through the IO, shared memory calculation area, and finally quickly formed the calculation result.

1. HDFS supports erasure coding of data, which allows HDFS to save half of the storage space without reducing reliability. (Https://issues.apache.org/jira/browse/HDFS-7285)

2. Multi-NameNode support, that is, support the deployment of one active and multiple standby namenodes in a cluster. Note: The multi-ResourceManager feature has been supported in hadoop 2.0. (Https://issues.apache.org/jira/browse/HDFS-6440)

The official document address for these two characteristics:

http://hadoop.apache.org/docs/r3.0.1/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html

http://hadoop.apache.org/docs/r3.0.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html


 YARN improvement of Hadoop3.x new features

1. Cgroup-based memory isolation and IO Disk isolation (https://issues.apache.org/jira/browse/YARN-2619)

2. Use curator to implement RM leader election (https://issues.apache.org/jira/browse/YARN-4438)

3. containerresizing(https://issues.apache.org/jira/browse/YARN-1197)

4. Timelineserver next generation (https://issues.apache.org/jira/browse/YARN-2928)

Official document address:

http://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html


MapReduce improvement of new features of Hadoop3.x

1. Tasknative optimization. Added C/C++ map output collector implementation (including Spill, Sort, IFile, etc.) to MapReduce, and can switch to this implementation by adjusting job level parameters. For shuffle-intensive applications, its performance can be improved by about 30%. (Https://issues.apache.org/jira/browse/MAPREDUCE-2841)

2. MapReduce memory parameters are automatically inferred. In Hadoop 2.0, setting memory parameters for MapReduce jobs is very cumbersome, involving two parameters: mapreduce.{map,reduce}.memory.mb and mapreduce.{map,reduce}.java.opts. Once the settings are unreasonable , It will cause a serious waste of memory resources. If the former is set to 4096MB, but the latter is set to "-Xmx2g", the remaining 2g cannot actually be used by the java heap. (Https://issues.apache.org/jira/browse/MAPREDUCE-5785)

 

Other new features of Hadoop 3.x

1. Add new hadoop-client-api and hadoop-client-runtime components to a single jar package to solve the problem of dependency incompatibility. (https://issues.apache.org/jira/browse /HADOOP-11804)
2. Supports Microsoft's Azure distributed file system and Arab's aliyun distributed file system

Guess you like

Origin blog.csdn.net/chengh1993/article/details/111771134