First, what is the Spark
Spark is calculated based on the frame memory.
Two, Spark generated background
Spark generated reasons mainly to solve the drawbacks of Hadoop, there is a time line can be drawn Spark of birth.
1, Hadoop 1.x-- 2011 Nian
(Hadoop 1.x architecture)
(Hadoop 1.x architecture of mapreduce)
Hadoop1.x officially released in 2011, this time with a lot of people, but after some time, it was discovered that the presence of Hadoop 1.x problems (mainly mapreduce problems):
1, mapreduce is calculated based on the data set: always read from the disk file, and then calculated last written to disk, a task is closed. It seems there is no problem, but with the need, we need to deal with real-time data, streaming data, map data, more and more frequently read the disk mapreuduce apparently fast enough, so we want to introduce a new calculation ways to solve this problem. But - (see 2)
2, in Hadoop 1.x time, mapreduce not responsible for the calculation, it is also responsible for scheduling tasks and resources, resource scheduling and calculation tightly coupled together, but only if we want to change the calculation framework is clearly impossible no resource scheduling how to allocate computing resources?
2, in Hadoop 1.x time, Mapreduce in a process called JobTracker, JobTracker that is responsible for resource scheduling, and responsible for task scheduling, and JobTracker men have a process called TaskTracker, this process is responsible for the calculation, where the JobTracker the burden is too heavy task, tightly coupled.
People can not stand this time the problem of Hadoop 1.x, it was born spark.
2, Spark of birth - June 2013
Spark is born mapreduce to solve the problems mentioned above, in order to solve the problem of slow it adopted the strategy of memory-based computing, and supports the iterative calculation; coupling in order to solve the problem, the Spark resource scheduling and task scheduling phase separation:
(Spark architecture)
Elements in the figures explained:
Master: responsible for resource scheduling
Application Master: responsible for task scheduling, Driver carried out by some mechanism and Master part of the communication, Driver is not the Master, the Master is only responsible for resource scheduling, since the birth of the Spark in time, there is no YARN, so it is used own resource scheduler, called the Master, after YARN, Driver can connect to the ResouceManager order to YARN instead of Master.
Worker: is responsible for calculating and resource scheduling on the table node
Executor: responsible for performing calculations
Can be seen from the figure: the Worker is directly connected to Master, it is directly connected with the Executor Driver
So you can see, two Spark architecture is the most important concept is the Driver and Executor, because only two it is irreplaceable.
3, Hadoop 2.x-- 2013 Nian 11 Yue
In Hadoop 2.x, the introduction of YARN, it is YARN, was saved Hadoop, YARN the resource scheduling, task scheduling and calculation of phase separation.
There are two large drawing element, the ResourceManager is one, one is NodeManager.
ResourceManager: you can see, ResourceManager is responsible for resource management and scheduling, plus a start ApplicationMaster, Note: The schedule here is pure resource scheduling does not involve task scheduling, task scheduling should be done in the AM.
** NodeManager: ** NodeManager resource and task manager on each node, complete the calculation functions NodeManager above.
** ApplicationMaster: ** RM responsible to apply for funding, start monitoring tasks. When users submit jobs to the ResourceManager, RM will find the first available NM, in which the allocation of a Container, which started in AM, AM is not fixed, but changes according to the user's submission, if you submit MR task, it is about MR AM , if it is Spark, so AM is about the Spark.
Container: Container is an abstraction of resources, as we in the Windows system virtual machine, the virtual machine can use Windows Explorer, then this can boot Linux or Mac on resources, also in the Container YARN, we can start MapReduce or Spark task.
To understand what the role of these elements are mainly still have to understand the entire workflow framework.
In summary, YARN do understand coupling, we just came up a computational framework, as long as the AM and Task into their own things, you can run on Hadoop 2.0. It is really strong!
Three, Spark workflow on the YARN
[Picture dump outside the chain fails, the source station may have a security chain mechanism, it is recommended to save the pictures uploaded directly down (img-2kXok9CM-1583725096217) (C: \ Users \ dell \ AppData \ Roaming \ Typora \ typora-user-images \ image-20200307124214254.png)]
Through the above summary, we see Spark alone mode of operation and structure is very similar YARN, Spark alone in Driver equivalent to AM, Master is equivalent to RM, Worker equivalent NM, Executor is equivalent to Task.
If we want to run Spark in Hadoop 2.0, the Spark just need to put the Driver Application of at YARN, YARN into the Spark of the Task Executor can.
Three, Spark important two components
After the birth of YARN, Spark's resource scheduler can be ResourceManager or Master, while the Worker Spark can be in the Worker or YARN in NodeManager, and these two are replaceable, therefore, in Spark the core of the two components is the Driver and the Executor.
1、Driver
Driver is in the process of developing the program main method.
2、Executor
Executor is a working process, responsible for running tasks in Spark jobs, tasks independent of each other.
Four, Spark deployment model
1、local
spark task to run locally, local [k] represents the k threads start to run, local [*] on behalf of all the threads start to run.
2、standalone
Independently run on a cluster.
3、yarn
Running on a resource management system, such as yarn and mesos.
Run, local [*] on behalf of all the threads start to run.
2、standalone
Independently run on a cluster.
3、yarn
Running on a resource management system, such as yarn and mesos.