Be sure to watch it! Big data YARN core knowledge points are coming!

01 Let's learn big data together

Hello everyone, today I am sharing the core knowledge points of big data YARN. Old Liu tries his best to use easy-to-understand words to tell the knowledge points of YARN, so that everyone can express them in colloquial form after reading them, so as to be true. After reading it! (If you think Lao Liu wrote well, give Lao Liu a thumbs up)

02 YARN core knowledge points

Insert picture description here
Point 1: What is YARN?

YARN is the resource scheduling engine module in the Hadoop architecture. As can be seen from the name of this module, YARN is used to provide resource management and scheduling for applications.

Similar to HDFS, YARN is also a classic master-slave architecture. The content of the structure is put on the second point. If the interview introduces YARN, Lao Liu suggests that the first and second points should be discussed together.

Point 2: YARN architecture First
Insert picture description here
look at this architecture diagram, you can know that YARN is a very typical master-slave architecture. YARN consists of a ResourceManager (RM) and multiple NodeManagers (NM). RM is the master node and NM is the slave node.

What is ResourceManager?

RM is a global resource manager, there is only one cluster, it is mainly responsible for the resource management and allocation of the entire system, start monitoring ApplicationMaster, monitoring NodeManager and resource allocation and scheduling.

RM mainly consists of two components: scheduler and application manager.

What is a scheduler (Scheduler)?

The scheduler is to allocate the resources in the system to each running application according to the capacity and queue restrictions. Here is a sentence to say that the scheduler is a pure scheduler, that is, it only takes care of resource allocation and does not participate in specific applications. Program related work.

What is the Application Manager (ApplicationMaster)?

The application manager is mainly responsible for monitoring and managing all the applications of the entire system, as well as applying for and returning resources from RM.

What is NodeManager?

NodeManager is a slave service, and there are multiple in the entire cluster. It is responsible for receiving RM's resource allocation request and assigning specific Containers to applications. It is also responsible for monitoring and reporting Container usage information to RM.

What is Contaienr?

Container is a unit of resource allocation in yarn, including resources such as memory and CPU. YARN allocates resources in the unit of Container.

Point 3: YARN application submission process
Insert picture description here
Lao Liu will briefly talk about the YARN application submission process, and mainly want to talk about the content of MapReduce On Yarn.

According to the above figure, you can see the YARN application submission process.

The first step is that the user submits the application to the ResourceManager;

The second step is that the ResourceManager applies for resources for the ApplicationMaster, and communicates with a NodeManager to start the first Container to start the ApplicationMaster;

The third step is to register and communicate between ApplicationMaster and ResourceManager, and apply for resources for tasks to be executed internally. Once resources are obtained, they will communicate with NodeManager to start the corresponding Task;

The fourth step is that after all tasks are completed, ApplicationMaster will log out of ResourceManager, and the entire application program will end.

The following is the key content, detailing MapReduce On Yarn.

Point 4: MapReduce On Yarn
Insert picture description here
is very important. Point 3 is about the fur. Point 4 is a detailed description of the whole process. Old Liu tried to make it easy to understand. The operation process of MapReduce On Yarn is as follows:

1. First type the program into a Jar package, and then run the hadoop jar command on the client side, and the job will be submitted to the cluster to run. In this process, job.waitForCompletion() in the program will call the submit() method in Job.

2. Then the getId of ResourceManager will be called remotely, and the id of an MR job will be obtained. At the same time, it will also check whether the output directory exists. If the output directory is not specified or the directory already exists, an error will be reported; the job fragmentation will also be calculated, and if the fragmentation cannot be calculated, an error will be reported.

3. In the next step, the configuration file, jar package, and slice information related to the job will be uploaded to HDFS.

4. The client submits the application to RM, and the task starts to run.

5. When the RM receives the notification of task submission, it will communicate with the designated NodeManager to notify the NodeManager to start the container. The NodeManager will create a Container that occupies a specific resource and run the MRAppMaster process in this Container.

6. The MRAppMaster process will initialize the job, create multiple bookkeeping objects, and record the progress information and status information of each map task and reduce task.

The bookkeeping object in it has to talk about it. Many places didn't explain what it was, and just ignored it. In Lao Liu's opinion, this is a bit bad.

After searching for it, Old Liu introduced it like this: The application master of the MapReduce job is a Java application, and its main class is MRAppMaster. It initializes the job by creating a certain number of bookkeeping objects (bookkeeping objects) to track the progress of the job, which accepts the progress and completion of the task report.

7. AppMaster needs to start Task tasks, but it does not know how many map tasks to start and which node to start on, so at this time AppMaster needs to obtain fragmentation information from HDFS.

8. After obtaining the information, it is about to start allocating tasks. AppMaster will request resource allocation from RM for each task. After the RM receives the message, it will calculate the resources. What resources will be calculated? Generally, how much memory is allocated for map tasks and reduce tasks, and how many virtual kernels are used.

What Liu wants to say at this step is that when looking at the information, you can try to ask yourself, for example, it can calculate resources. You can ask yourself what resources it calculates, which is very helpful to your own progress.

9. After AppMaster receives the returned results of these computing resource information, it will communicate with the NodeManager, and the NodeManager will start a JVM (container).

10. But before running the task (YarnChild) in the container, the resources needed to run the task will be pulled to the local (jar package, configuration file, distributed cache file).

11. Task operation.

Lao Liu tried his best to express it in colloquial form. I hope everyone can remember it.

Point 5: YARN application life cycle

1. Client submits applications to RM, including AM programs and commands to start AM.

2. The RM allocates the first container to the AM and communicates with the corresponding NM to start the AM of the application on the container.

3. Register with RM when AM starts, allowing Client to obtain AM information from RM and then directly communicate with AM.

4. AM negotiates container resources for applications through the resource request protocol.

5. If the container is allocated successfully, AM requires NM to start the application in the container. After the application is started, it can communicate with AM independently.

6. The application is executed in the container and reports to AM.

7. During application execution, Client communicates with AM to obtain application status.

8. After the application is executed, the AM logs off and closes the RM to release the resources.

Point 6: Common YARN commands

启动YARN
start-yarn.sh

停止RM和NM
stop-yarn.sh

查看正在运行的任务
yarn application -list

杀掉正在运行任务
yarn application -kill 任务id

查看节点列表
yarn node -list

Point 7: YARN scheduler

Think about why you need a scheduler first. In real life, you will definitely encounter scenarios where tasks are submitted at the same time. How to allocate resources to meet these tasks at this time? Who will execute it first? They are all exquisite!

​So in the Yarn framework, the scheduler is a very important content. With proper scheduling rules, you can ensure that multiple applications can work orderly at the same time.

The most primitive scheduling rule in YARN is FIFO, that is, whoever submits the task first executes it. But this is likely to lead to two situations: ① A large task monopolizes resources, and other resources need to wait for the completion of the large task continuously; ② A bunch of small tasks occupy resources, and the large task has been unable to obtain appropriate resources. So although FIFO is very simple, it cannot meet our needs.

So there are three schedulers to choose from in YARN: FIFO Scheduler, Capacity Scheduler, and FairScheduler.

FIFO Scheduler arranges applications into a queue in the order of submission. This is a first-in-first-out queue. When resource allocation is performed, the first application in the queue is allocated resources first, and the first application needs to be satisfied. Give the second assignment.

FIFO Scheduler is the simplest and easiest to understand scheduler, and does not require any configuration, but its shortcomings have just been mentioned. Large applications may occupy all cluster resources, which causes other applications to be blocked. So if in a shared cluster, it is more suitable to use Capacity Scheduler or Fair Scheduler.

For the Capacity scheduler, it will have a dedicated queue for running small tasks, but setting up a queue for small tasks will pre-occupy certain cluster resources, which causes the execution time of large tasks to lag behind when using the FIFO scheduler time.

But in the Fair scheduler, we don't need to occupy certain system resources in advance. The Fair scheduler dynamically adjusts system resources for all running jobs. When the first large job is submitted, only this job is running at this time, and it has all the cluster resources; when the second small task is submitted, the Fair scheduler will allocate half of the resources to this small task, allowing this The two tasks share cluster resources fairly. After the execution of the small task is completed, it will release its own resources, and the large task will obtain all the resources. This ensures that the Fair scheduler not only obtains high resource utilization, but also ensures the timely completion of small tasks.

03 Summary

Alright, the YARN knowledge points are almost covered. Old Liu tried his best to express these knowledge points in an easy-to-understand colloquial form, hoping to be helpful to students who are interested in big data, and also hope to get big guys. Criticism and guidance.

Think that Lao Liu wrote well, please give Lao Liu a thumbs up!

Finally, if you have something, contact the official account: Lao Liu who works hard; if it's okay, just learn big data with Lao Liu.

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/110038577