A, Spark architecture and operating mode

First, what is the Spark

Spark is calculated based on the frame memory.

Two, Spark generated background

Spark generated reasons mainly to solve the drawbacks of Hadoop, there is a time line can be drawn Spark of birth.

1, Hadoop 1.x-- 2011 Nian

Here Insert Picture Description

(Hadoop 1.x architecture)

Here Insert Picture Description

(Hadoop 1.x architecture of mapreduce)

Hadoop1.x officially released in 2011, this time with a lot of people, but after some time, it was discovered that the presence of Hadoop 1.x problems (mainly mapreduce problems):

1, mapreduce is calculated based on the data set: always read from the disk file, and then calculated last written to disk, a task is closed. It seems there is no problem, but with the need, we need to deal with real-time data, streaming data, map data, more and more frequently read the disk mapreuduce apparently fast enough, so we want to introduce a new calculation ways to solve this problem. But - (see 2)

2, in Hadoop 1.x time, mapreduce not responsible for the calculation, it is also responsible for scheduling tasks and resources, resource scheduling and calculation tightly coupled together, but only if we want to change the calculation framework is clearly impossible no resource scheduling how to allocate computing resources?

2, in Hadoop 1.x time, Mapreduce in a process called JobTracker, JobTracker that is responsible for resource scheduling, and responsible for task scheduling, and JobTracker men have a process called TaskTracker, this process is responsible for the calculation, where the JobTracker the burden is too heavy task, tightly coupled.

People can not stand this time the problem of Hadoop 1.x, it was born spark.

2, Spark of birth - June 2013

Spark is born mapreduce to solve the problems mentioned above, in order to solve the problem of slow it adopted the strategy of memory-based computing, and supports the iterative calculation; coupling in order to solve the problem, the Spark resource scheduling and task scheduling phase separation:

Here Insert Picture Description

(Spark architecture)

Elements in the figures explained:

Master: responsible for resource scheduling

Application Master: responsible for task scheduling, Driver carried out by some mechanism and Master part of the communication, Driver is not the Master, the Master is only responsible for resource scheduling, since the birth of the Spark in time, there is no YARN, so it is used own resource scheduler, called the Master, after YARN, Driver can connect to the ResouceManager order to YARN instead of Master.

Worker: is responsible for calculating and resource scheduling on the table node

Executor: responsible for performing calculations

Can be seen from the figure: the Worker is directly connected to Master, it is directly connected with the Executor Driver

So you can see, two Spark architecture is the most important concept is the Driver and Executor, because only two it is irreplaceable.

3, Hadoop 2.x-- 2013 Nian 11 Yue

Here Insert Picture Description

In Hadoop 2.x, the introduction of YARN, it is YARN, was saved Hadoop, YARN the resource scheduling, task scheduling and calculation of phase separation.

Here Insert Picture Description

There are two large drawing element, the ResourceManager is one, one is NodeManager.

ResourceManager: you can see, ResourceManager is responsible for resource management and scheduling, plus a start ApplicationMaster, Note: The schedule here is pure resource scheduling does not involve task scheduling, task scheduling should be done in the AM.

** NodeManager: ** NodeManager resource and task manager on each node, complete the calculation functions NodeManager above.

** ApplicationMaster: ** RM responsible to apply for funding, start monitoring tasks. When users submit jobs to the ResourceManager, RM will find the first available NM, in which the allocation of a Container, which started in AM, AM is not fixed, but changes according to the user's submission, if you submit MR task, it is about MR AM , if it is Spark, so AM is about the Spark.

Container: Container is an abstraction of resources, as we in the Windows system virtual machine, the virtual machine can use Windows Explorer, then this can boot Linux or Mac on resources, also in the Container YARN, we can start MapReduce or Spark task.

To understand what the role of these elements are mainly still have to understand the entire workflow framework.

In summary, YARN do understand coupling, we just came up a computational framework, as long as the AM and Task into their own things, you can run on Hadoop 2.0. It is really strong!

Three, Spark workflow on the YARN

[Picture dump outside the chain fails, the source station may have a security chain mechanism, it is recommended to save the pictures uploaded directly down (img-2kXok9CM-1583725096217) (C: \ Users \ dell \ AppData \ Roaming \ Typora \ typora-user-images \ image-20200307124214254.png)]

Through the above summary, we see Spark alone mode of operation and structure is very similar YARN, Spark alone in Driver equivalent to AM, Master is equivalent to RM, Worker equivalent NM, Executor is equivalent to Task.

If we want to run Spark in Hadoop 2.0, the Spark just need to put the Driver Application of at YARN, YARN into the Spark of the Task Executor can.

Three, Spark important two components

Here Insert Picture Description

After the birth of YARN, Spark's resource scheduler can be ResourceManager or Master, while the Worker Spark can be in the Worker or YARN in NodeManager, and these two are replaceable, therefore, in Spark the core of the two components is the Driver and the Executor.

1、Driver

Driver is in the process of developing the program main method.

2、Executor

Executor is a working process, responsible for running tasks in Spark jobs, tasks independent of each other.

Four, Spark deployment model

1、local

spark task to run locally, local [k] represents the k threads start to run, local [*] on behalf of all the threads start to run.

2、standalone

Independently run on a cluster.

3、yarn

Running on a resource management system, such as yarn and mesos.

Run, local [*] on behalf of all the threads start to run.

2、standalone

Independently run on a cluster.

3、yarn

Running on a resource management system, such as yarn and mesos.

Published 42 original articles · won praise 3 · Views 2065

Guess you like

Origin blog.csdn.net/stable_zl/article/details/104748981
Recommended