Chapter 4 Spark Operation Architecture (Silicon Valley Notes)


4.1 Operational architecture

The core of the Spark framework is a computing engine. Overall, it adopts a standard master-slave structure.
As shown in the figure below, it shows the basic structure of a Spark execution. The Driver in the figure represents the master, which is responsible for managing job task scheduling in the entire cluster. The Executor in the graph is the slave, responsible for actually executing the task.

Insert image description here

4.2 Core components

由上图可以看出,对于 Spark 框架有两个核心组件:

4.2.1 Driver

The Spark driver node is used to execute the main method in the Spark task and is responsible for the actual code execution.
Driver 在 Spark 作业执行时主要负责:

  • ➢ Convert user program into job
  • ➢ Scheduling tasks between Executors
  • ➢ Track the execution of Executor
  • ➢ Display query running status through UI

In fact, we cannot accurately describe the definition of Driver because we did not see any words about
Driver in the entire programming process. So to simply understand, the so-called Driver is the program that drives the entire application to run, also called the
Driver class.

4.2.2 Executor

Spark Executor is a JVM process in the worker node (Worker) in the cluster. It is responsible for running specific tasks (Task) in Spark jobs
. The tasks are independent of each other. When the Spark application starts, the Executor node is started at the same time and always exists with the entire Spark application life cycle. If an Executor node fails or crashes, the Spark application can continue to execute, and the tasks on the faulty node will be scheduled to other Executor nodes to continue running. .


Executor 有两个核心功能:

  • ➢ Responsible for running the tasks that make up the Spark application and returning the results to the driver process
  • ➢ They provide in-memory storage for RDDs that require caching in user programs through their own block managers. RDD is cached directly in the
    Executor process, so tasks can make full use of cached data to accelerate operations during runtime.

4.2.3 Master & Worker

In the independent deployment environment of the Spark cluster, there is no need to rely on other resource scheduling frameworks. It realizes the resource scheduling function itself
, so there are two other cores in the environment. Components: Master and Worker. The Master here is a process, which is mainly responsible for resource scheduling and allocation, and cluster monitoring and other responsibilities, similar to RM in the Yarn environment, and
a> data in parallel, similar to Yarn NM in the environment.
Worker is also a process. A Worker runs on a server in the cluster, and the Master allocates resources to process and calculate

4.2.4 ApplicationMaster

When a Hadoop user submits an application to the YARN cluster, the submission program should include ApplicationMaster, which is used
to apply to the resource scheduler for the resource container Container to execute the task and run the user's own program Task job monitors the execution of the entire task, tracks the status of the entire task, and handles abnormal situations such as task failure. To put it simply, the decoupling between ResourceManager (resource) and Driver (computation) relies on ApplicationMaster.


4.3 Core concepts

4.3.1 Executor and Core

Spark Executor is a JVM process running in the worker node (Worker) in the cluster. It is a node dedicated to computing in the entire cluster
. When submitting an application, you can provide parameters to specify the number of computing nodes and the corresponding resources
. The resources here generally refer to the memory size of the worker node Executor and the number of virtual CPU cores (Core) used.

应用程序相关启动参数如下:
Insert image description here

4.3.2 Parallelism

In a distributed computing framework, multiple tasks are generally executed simultaneously. Since tasks are distributed on different computing nodes for
calculations, multi-tasking parallel execution can truly be achieved. Remember, this is parallelism, not concurrency. Here we call
the number of tasks executed in parallel across the entire cluster as parallelism. So what is the degree of parallelism of a job? This depends
on the framework's default configuration. Applications can also be modified dynamically during runtime.

4.3.3 Directed Acyclic Graph (DAG)

Insert image description here

Big data computing engine frameworks are generally divided into four categories according to different usage methods. The first category is MapReduce carried by
Hadoop, which divides calculations into two The stages are Map stage and Reduce stage respectively.
For upper-layer applications, we have to find ways to split the algorithm, and even have to implement the series connection of multiple Jobs
in the upper-layer application to complete a complete algorithms, such as iterative calculations. Due to such drawbacks, the creation of the DAG framework was born. Therefore, DAG-enabled frameworks are divided into second-generation computing engines. Such as Tez and above Oozie. We will not go into the differences between various DAG implementations here, but for Tez and Oozie at that time, most of them were batch processing tasks. Next is the third generation of computing engines represented by Spark. The main features of the third-generation computing engine are DAG support within Job (not across Jobs) and real-time computing.



The so-called directed acyclic graph here is not a real graph, but a high-level abstract model of data flow directly mapped by the Spark program.
A simple understanding is to graphically represent the execution process of the entire program calculation, which is more intuitive and easier to understand and can be used to represent the topological structure of the program.

DAG (Directed Acyclic Graph) is a topological graph composed of points and lines. The graph has direction
and will not close a loop.

4.4 Submission process

The so-called submission process is actually the process in which the application written by our developers according to the needs is submitted through the Spark client
to the Spark running environment for calculation. In different deployment environments, the submission process is basically the same, but there are subtle differences. We will not make a detailed comparison here, but because in domestic work, Spark references are deployed to
a>
There will be more in the Yarn environment, so the submission process in this course is based on the Yarn environment.
Insert image description here

When a Spark application is submitted to the Yarn environment for execution, there are generally two deployment and execution methods: Client
and Cluster. The main difference between the two modes is: the location of the running node of the Driver program.

4.2.1 Yarn Client mode

Client mode executes the Driver module used for monitoring and scheduling on the client instead of in Yarn, so it is generally used for testing.

  • ➢ Driver runs on the local machine where the task is submitted
  • ➢ After the Driver is started, it will communicate with the ResourceManager to apply for starting the ApplicationMaster.
  • ➢ ResourceManager allocates containers and starts them on the appropriate NodeManager
    ApplicationMaster is responsible for applying for Executor memory from ResourceManager
  • ➢ ResourceManager will allocate a container after receiving the resource request from ApplicationMaster, and then
    ApplicationMaster starts the Executor process on the NodeManager specified by resource allocation
  • ➢ After the Executor process is started, it will reversely register with the Driver. After all Executor registrations are completed, the Driver will start executing the main function.
  • ➢ When the Action operator is executed later, a Job is triggered and stages are divided according to wide dependencies. Each stage generates the corresponding
    TaskSet, and then distributes the tasks to each Executor. execute on.

4.2.2 Yarn Cluster mode

Cluster mode starts the Driver module used for monitoring and scheduling to be executed in the Yarn cluster resource. Generally used in
actual production environments.

  • ➢ In YARN Cluster mode, after the task is submitted, it will communicate with ResourceManager to apply for startup
    ApplicationMaster,
  • ➢ Then the ResourceManager allocates the container and starts the ApplicationMaster on the appropriate NodeManager.
    The ApplicationMaster at this time is the Driver.
  • ➢ After the Driver starts, it applies for Executor memory to ResourceManager. ResourceManager will allocate the container after receiving the resource application from
    ApplicationMaster, and then start the Executor on the appropriate NodeManager
    Process
  • ➢ After the Executor process is started, it will reversely register with the Driver. After all Executor registrations are completed, the Driver starts executing the main function.
  • ➢ When the Action operator is executed later, a Job is triggered and stages are divided according to wide dependencies. Each stage generates the corresponding
    TaskSet, and then distributes the tasks to each Executor. execute on.

Guess you like

Origin blog.csdn.net/Argonaut_/article/details/129504211