Spark Kernel Analysis-Overall Overview 1 (6)

1. Overall overview of Spark

1.1 Overall concept

Apache Spark is an open source general-purpose cluster computing system that provides a High-level programming API and supports three programming languages: Scala, Java and Python. The Spark kernel is written in the Scala language. Through the functional programming features based on Scala, it abstracts at different computing levels, and the code design is very excellent.

1.2RDD abstraction

RDD (Resilient Distributed Datasets), a resilient distributed data set, is a memory abstraction of distributed data sets that provides fault tolerance through limited shared memory. At the same time, this memory model makes calculations faster than traditional data flows. The model must be efficient. RDD has 5 important characteristics, as shown in the figure below:

Insert image description here
The above figure shows two RDDs performing a JOIN operation, which reflects the five main characteristics of RDD, as follows:
1) A set of partitions
2) A function that calculates each data fragment
3) A set of dependencies on the RDD
4 ) Optional, for key-value pair RDD, there is a Partitioner (usually HashPartitioner)
5) Optional, a set of Preferred location information (for example, the location information of the Block of the HDFS file).
With the above characteristics, it can pass the RDD very well To express distributed data sets and serve as the basis for constructing DAG graphs: first abstract the logical representation of a distributed computing task, and finally process and execute the task in the actual physical computing environment.

1.3 Computational abstraction

When describing the computing abstraction in Spark, we first need to understand the following concepts:
1)
The Spark program written by the Application user completes the processing of a computing task. It consists of a Driver program and a set of Executors running on the Spark cluster.
2)
In the Job user program, each time Action is called, a Job will be generated logically, and a Job contains multiple Stages.
3) Stage
Stage includes two categories: ShuffleMapStage and ResultStage. If the user program calls an Operator that requires Shuffle calculation, such as groupByKey, etc., it will be divided into ShuffleMapStage and ResultStage with Shuffle as the boundary.
4) TaskSet
can be directly mapped to TaskSet based on Stage. A TaskSet encapsulates Tasks that need to be calculated at one time and have the same processing logic. These Tasks can be calculated in parallel. Coarse-grained scheduling is based on TaskSet.
5) Task
Task is the basic unit running on the physical node. Task includes two categories: ShuffleMapTask and ResultTask, which correspond to a basic execution unit in ShuffleMapStage and ResultStage in Stage respectively.
Next, let’s take a look at the relationship between the above basic concepts, as shown in the figure below:
Insert image description here
In the figure above, for the sake of simplicity, each Job is assumed to be simple and only needs to be shuffled once, so it corresponds to 2 Stages. In actual applications, a job may contain several stages, or a relatively complex stage DAG.
In Standalone mode, the simple scheduling strategy of FIFO is used by default. During the scheduling process, the general process is as shown in the figure below:

Insert image description here
The user submits the Spark program and finally generates a TaskSet. During scheduling, a TaskSet (containing a set of tasks that can be executed on the physical node) is managed through the TaskSetManager. The TaskSet must be executed in order to ensure the correctness of the calculation results. , because TaskSets are sequentially dependent on each other (tracing back to ShuffleMapStage and ResultStage), only after all the tasks in a TaskSet have been run can the tasks in the next TaskSet be scheduled for execution.

Insert image description here
Let’s start with Executor and SchedulerBackend. Executor is the process that actually performs tasks. It has several CPUs and memories and can perform computing tasks in thread units. It is the smallest unit that the resource management system can provide. SchedulerBackend is an interface provided by spark and defines many processes related to Executor events, including: recording executor information when a new executor is registered, increasing the global resource amount (number of cores), performing a makeOffer; executor updates status, if If the task is completed, recycle the core and perform a makeOffer; other events such as stopping the executor, remove executor, etc. The following is expanded by makeOffer.

The purpose of makeOffer is to trigger an allocation of existing tasks by calling the scheduler's resourceOffers method when there is a resource update, and finally launch new tasks. The global scheduler here is TaskScheduler, and the implementation is TaskSchedulerImpl, which can be connected to various SchedulerBackend implementations, including standalone, yarn, and mesos. When SchedulerBackend does makeOffer, it will pass the existing executor resources to the scheduler in the form of a WorkerOffer list, that is, in worker units, it will hand over the worker information and its resources to the scheduler. After the scheduler gets the resources of these clusters, it traverses the submitted tasks and decides how to launch the tasks based on locality.

In TaskScheduler, the resourceOffers method will prioritize the submitted tasks. There are currently two sorting algorithms: FIFO or FAIR. After obtaining the tasks to be run, the next step is to reasonably allocate the worker resource information handed over by schedulerBackend to these tasks. Before allocation, in order to avoid that the first few workers are assigned tasks every time, the WorkerOffer list is randomly shuffled first. The next step is to traverse the tasks to see if the resources of the workers are "enough" and "do they meet the requirements" of the task. If it is ok, the task will be officially launched. Note that it is easy to judge whether the resources here are "enough". The number of CPUs required to start each task is set in the TaskScheduler. The default is 1, so you only need to judge the size of the core number and subtract 1 to traverse the allocation. Go down. The matter of "conformance" depends on the locality setting of each task.

There are five types of locality for tasks, ranked by priority: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, and ANY. That is to say, it is best to be in the same process, the second best is the same node (i.e. machine), the third best is the same rack, or anything will do. Tasks have their own locality. What should I do if there is no locality resource that I want in this resource? Spark has a spark.locality.wait parameter, the default is 3000ms. For process, node, and rack, this time is used by default as the waiting time for locality resources. So once a task requires locality, delay scheduling may be triggered.

At this point, you have a general understanding of task allocation and resource usage. In fact, TaskScheduler's resourceOffer also triggers TaskSetManager's resourceOffer method. TaskSetManager's resourceOffer will check the locality of the task and finally call DAGScheduler to launch the task. The names of these classes and their calling relationships with each other seem to be quite confusing. I'll briefly summarize it.

This matter starts with Spark's DAG cutting. Spark RDD is strung together to form a DAG through its transaction and action operations. The call of action triggers the submission of the DAG and the execution of the entire job. After triggering, DAGScheduler, the globally unique stage-oriented DAG scheduler, splits the DAG and cuts it into multiple small DAGs, namely stages, according to whether it is shuffled. All RDDs that are narrowly dependent on each other are grouped into a stage, where each operation corresponds to a MapTask, and the degree of parallelism is the number of partitions in each RDD. Whenever an operation with wide dependencies is encountered, the operation is cut into a stage, and the operations there correspond to ResultTask. The number of partitions in the resulting RDD is the degree of parallelism. MapTask and ResultTask can be simply understood as traditional MR Map and Reduce respectively, and the basis for dividing them is essentially shuffle. Therefore, before shuffle, a large number of maps can be operated in the same partition. Each stage corresponds to multiple MapTasks or multiple ResultTasks. The tasks in this stage are gathered into a TaskSet class. The TaskSetManager manages the running status of these tasks and handles locality (for example, delay scheduling is required). This TaskSetManager is at the Spark level. How to manage its own tasks, that is, task threads, this layer is separated from the underlying resource management. The resourceOffer method of TaskSetManager we mentioned above is the interaction between the task and the underlying resources. The coordinator of this resource interaction is TaskScheduler, which is also global. TaskScheduler interfaces with different SchedulerBackend implementations (such as mesos, yarn, standalone), so To connect different resource management systems. At the same time, for the resource management system, they are responsible for the process, how many processes are started on the worker, and how many resources are allocated to each process. So these two layers are very clear. Spark manages thread-level tasks within its own computing framework. Each stage has a TaskSet, which is a small DAG itself and can be thrown into the globally available resource pool to run; the two-layer resource management part of the lower body of spark What is controlled is the process-level executor. It does not care about how the tasks are placed or the running status of the tasks. This is a matter managed by TaskSetManager. The coordinator of the two is the TaskScheduler and the SchedulerBackend implementation within it.

The implementation of SchedulerBackend, excluding local mode, is divided into two types: fine-grained and coarse-grained. Fine-grained is only implemented by Mesos (mesos can be used in both coarse and fine granularity). Coarse-grained implementers include yarn, mesos, and standalone. Taking the standalone mode as a coarse-grained example, each physical machine is a worker, and how much CPU and memory the worker can use in total. When starting, you can specify how many executors, that is, processes, each worker has, and how much CPU and memory each executor has. In my opinion, the main difference between coarse-grained and fine-grained is that the coarse-grained process is long-running, and the calculation thread can be transferred to the executor to run, but the executor's CPU and memory are more likely to be wasted. If it is fine-grained, reuse can exist, preemption can be implemented, and other things that are more demanding but promote resource utilization. These two concepts were first proposed in the AMPLab paper and implemented in Mesos. AMPLab has many papers in the field of resource usage granularity and even optimal task allocation, including Mesos's DRF algorithm, Sparrow scheduler, etc. Therefore, in standalone mode, based on the number of partitions in the RDD and the number of CPUs required by each task, it is easy to calculate the load and resource consumption of each physical machine, and even know how many batches the TaskSet needs to be divided into to complete a stage.

1.4 Cluster mode

When the Spark cluster was designed, the resource management design was not closed to the outside world. Instead, it fully considered the future connection with some more powerful resource management systems, such as YARN, Mesos, etc. Therefore, the Spark architecture design abstracts resource management into a separate Layer, through this abstraction, a plug-in resource management module suitable for the current technology stack of the enterprise can be built, thereby providing different resource allocation and scheduling strategies for different computing scenarios. Spark cluster mode architecture, as shown in the figure below:

Insert image description here
As shown in the above figure, the Spark cluster Cluster Manager currently supports the following three modes:
1) Standalone mode
Standalone mode is a cluster management mode implemented by default within Spark. This mode uses the Master in the cluster to uniformly manage resources, and is different from the Master. It is the StandaloneSchedulerBackend inside the Driver that negotiates resource requests (it is actually its internal StandaloneAppClient that actually communicates with the Master), which will be explained in detail later.
2) YARN mode
In YARN mode, resource management can be handed over to the ResourceManager of the YARN cluster. Choosing this mode can adapt to the existing technology stack within the enterprise to a greater extent. If the enterprise is already using Hadoop technology Build a big data processing platform.
3) Mesos model
As Apache Mesos continues to mature, some companies are already trying to use Mesos to build data center operating systems (DCOS). Spark is built on Mesos and can support fine-grained and coarse-grained resource scheduling strategies (Mesos' Advantages), and can also better adapt to the existing technology stack within the enterprise.
So, how does Spark consider meeting this important design decision? In other words, how to ensure that Spark can easily connect third-party resource management systems. Let’s take a deeper look at the class design level, as shown in the class diagram below:

Insert image description here
It can be seen that Task scheduling directly relies on SchedulerBackend, and SchedulerBackend interacts with the actual resource management module to implement resource requests. Here, CoarseGrainedSchedulerBackend is the most important abstraction related to resource scheduling in Spark. It needs to abstract the logic of communication with TaskScheduler, and at the same time, it must be able to interact seamlessly with various third-party resource management systems. In fact, CoarseGrainedSchedulerBackend internally uses a ResourceOffer method to handle resource requests.

1.5RPC network communication abstraction

The Spark RPC layer is designed and developed based on the excellent network communication framework Netty, but Spark provides a good abstraction method to shield the underlying communication details, and can also be designed based on this to meet scalability. For example, if there is Other new RPC access requirements that are not based on Netty's network communication framework can be well expanded without affecting the design of the upper layer. RPC layer design, as shown in the class diagram below:

Insert image description here
Any two Endpoints can only communicate through messages, and can implement one RpcEndpoint and one RpcEndpointRef: If you want to communicate with RpcEndpoint, you need to obtain the RpcEndpointRef corresponding to the RpcEndpoint, and the logic for managing the creation and communication of RpcEndpoint and RpcEndpointRef is unified in Managed in RpcEnv object.

1.6 Start the Standalone cluster

In Standalone mode, the Spark cluster adopts a simple Master-Slave architecture mode. The Master manages all Workers uniformly. This mode is very common. Let's briefly look at the basic process of starting the Spark Standalone cluster, as shown in the following figure:

Insert image description here
It can be seen that the Spark cluster uses the message mode for communication, which is the EDA architecture mode. With the elegant design of the RPC layer, any two Endpoints that want to communicate can just send messages and carry data. The process description in the above figure is as follows:
1) When the Master starts, it first creates an RpcEnv object, which is responsible for managing all communication logic.
2) The Master creates an Endpoint through the RpcEnv object. The Master is an Endpoint, and the Worker can communicate with it.
3) When the Worker starts It also creates an RpcEnv object.
4) Worker creates an Endpoint through RpcEnv object.
5) Worker establishes a connection to Master through RpcEnv pair and obtains an RpcEndpointRef object, through which it can communicate with Master.
6) Worker registers with Master. The registration content includes Host name, port, number of CPU Cores, and number of memories
7) Master receives the Worker’s registration and maintains the registration information in a Table in memory, which also contains an RpcEndpointRef object reference to the Worker
8) Master replies that the Worker has received Register to inform the Worker that the registration has been successful.
9) If a user submits a Spark program at this time, the Master needs to coordinate and start the Driver; and after the Worker receives a successful registration response, it begins to periodically send heartbeats to the Master.

1.7 Core components

When the cluster processes the runtime of computing tasks (users submit Spark programs), the core top-level components are Driver and Executor. They internally manage many important components to collaboratively complete computing tasks. The core component stack is shown in the figure below:

Insert image description here
Driver and Executor are both components created at runtime. Once the user program ends, they will release resources and wait for the next user program to be submitted to the cluster for subsequent scheduling. In the figure above, we have listed most of the components, among which SparkEnv is a heavyweight component. They contain the main components required in the calculation process. Moreover, SparkEnv also contains many components that are required by Driver and Executor. We won’t go into too much detail here. The functions responsible for most components will be explained later in the interaction process and so on.

1.8 Core component interaction process

In Standalone mode, the interactions between various components in Spark are still relatively complex, but for a general distributed computing system, these are very important and basic interactions. First, in order to understand the main interaction process between components, we give some basic points:
An Application will start a Driver.
A Driver is responsible for tracking and managing all resource status and task status during the running of the Application.
A Driver will manage a group of Executors
. Executor only executes
the main interaction process between the core components of Tasks belonging to one Driver, as shown in the following figure:

Insert image description here
In the above figure, through different colors or types of lines, the following six core interaction processes are given. We will explain in detail:
Orange: Submit user Spark program
. The user submits a Spark program. The main process is as follows:
1) User The spark-submit script submits a Spark program and creates a ClientEndpoint object, which is responsible for communicating and interacting with the Master.
2) The ClientEndpoint sends a RequestSubmitDriver message to the Master, indicating that the user program is submitted.
3) The Master receives the RequestSubmitDriver message and replies to the ClientEndpoint with a SubmitDriverResponse, indicating that The user program has completed registration
4) ClientEndpoint sends a RequestDriverStatus message to the Master to request the Driver status
5) If the Driver corresponding to the current user program has been started, ClientEndpoint exits directly and completes the submission of the user program
Purple: Starts the Driver process
when the user submits the user Spark program , the Driver needs to be started to process the calculation logic of the user program and complete the calculation task. At this time, the Master needs to start a Driver for coordination. The specific process is as follows:
1) The Maser memory maintains the task Application submitted by the user for calculation, and each time the memory structure changes Scheduling will be triggered and a LaunchDriver request is sent to the Worker.
2) When the Worker receives the LaunchDriver message, it will start a DriverRunner thread to perform the LaunchDriver task
. 3) The DriverRunner thread starts a new JVM instance on the Worker, and a Driver process runs in the JVM instance. The Driver will create a SparkContext object.
Red: Register Application
. After the Dirver is started, it will create a SparkContext object, initialize the basic components necessary in the calculation process, and register the Application with the Master. The process is described as follows:
1) Create a SparkEnv object and create and manage some basic components. Component
2) Create TaskScheduler, responsible for Task scheduling
3) Create StandaloneSchedulerBackend, responsible for resource negotiation with ClusterManager
4) Create DriverEndpoint, other components can communicate with Driver
5) Create a StandaloneAppClient inside StandaloneSchedulerBackend, responsible for handling communication interaction with Master
6) StandaloneAppClient creates a ClientEndpoint, which is actually responsible for communicating with the Master.
7) ClientEndpoint sends
a RegisterApplication message to the Master to register the Application. 8) After the Master receives the RegisterApplication request, it replies to the ClientEndpoint with a RegisteredApplication message, indicating that the registration has been successful.
Blue: Starts the Executor process
1) The Master sends a RegisteredApplication message to the Master. The Worker sends a LaunchExecutor message and requests to start the Executor; at the same time, the Master will send an ExecutorAdded message to the Driver, indicating that the Master has added a new Executor (it has not been started yet).
2) When the Worker receives the LaunchExecutor message, it will start an ExecutorRunner thread to execute the LaunchExecutor. Task
3) Worker sends an ExecutorStageChanged message to the Master to notify the Executor that the status has changed.
4) Master sends an ExecutorUpdated message to the Driver. At this time, the Executor has started.
Pink: Start Task execution
1) StandaloneSchedulerBackend starts a DriverEndpoint
2) After DriverEndpoint starts, it will periodically Check the status of the Executor maintained by the Driver. If there is an idle Executor, the task will be scheduled for execution.
3) DriverEndpoint sends a Resource Offer request to the TaskScheduler.
4) If there are available resources to start the Task, the DriverEndpoint sends a LaunchTask request to the Executor.
5) Within the Executor process CoarseGrainedExecutorBackend calls the launchTask method of the internal Executor thread to start the Task
6) The Executor thread maintains a thread pool internally, creates a TaskRunner thread and submits it to the thread pool for execution.
Green: The Task is completed
. 1) The Executor thread inside the Executor process notifies CoarseGrainedExecutorBackend that the Task is completed.
2) CoarseGrainedExecutorBackend sends a StatusUpdated message to DriverEndpoint to notify the Driver of a change in the status of the Task running
3) StandaloneSchedulerBackend calls TaskScheduler's updateStatus method to update the Task status
4) StandaloneSchedulerBackend continues to call TaskScheduler's resourceOffers method to schedule other tasks to run

1.9Block management

Block management mainly provides service support for the Broadcast mechanism provided by Spark. Spark is built-in using TorrentBroadcast. The data (Task data) or data set (such as RDD) corresponding to the Broadcast variable will be divided into several 4M blocks by default. The Broadcast variable is read during the running of the Task and will be divided into 4M blocks. The unit of Block is the smallest unit for pulling data. Finally, all Blocks are merged into the complete data or data set corresponding to the Broadcast variable. Split the data into blocks of 4M size, and Task pulls blocks from multiple Executors, which can balance the network transmission load very well and improve the stability of the entire computing cluster.

Usually, during the writing process of the user program, a certain variable will be broadcast, which is called a broadcast variable. When executing the Task on the Executor of the actual physical node, the data set corresponding to the Broadcast variable needs to be read. At this time, the data set that has been generated upstream of the DAG execution flow will be pulled as needed. Using the Broadcast mechanism can effectively reduce the cost of data transmission in a computing cluster environment. Specifically, if the Broadcast variable in the program corresponding to a user corresponds to a data set, it needs to pull the corresponding data during the calculation process. If multiple Tasks are running on the same physical node, multiple Tasks need With the Broadcast mechanism, this data only needs to be retrieved and stored on the local physical machine disk for sharing by multiple Task calculations.

In addition, during the scheduling process, the user program will move the Task calculation logic data (code) to the corresponding Worker node according to the scheduling strategy. The optimal situation is to process local data, so the code (serialization format) also needs to be in Transmission over the network is also carried out through the Broadcast mechanism. However, this method is to first serialize the code to the Worker node where the Driver is located. If the Task is subsequently executed in other Workers, the Broadcast variable of the corresponding code needs to be read. The first step is to read the Broadcast variable from the Driver. Pull code data, and then other tasks scheduled later may directly pull code data from Executors on other Workers.

We take the Broadcast variable taskBinary as an example to illustrate how Block is managed, as shown in the following figure:

Insert image description here
In the above figure, the Driver is responsible for managing the Executor where the data corresponding to the Broadcast variable is located, that is, an Executor maintains a Block list. When running a Task in an Executor, the corresponding Broadcast variable taskBinary is executed. If there is no corresponding data locally, it will request the Driver to obtain the data corresponding to the Broadcast variable, including the Executor list where one or more Blocks are located, and then the Executor will The Executor list returned by the Driver directly requests the corresponding Executor to pull the Block through the underlying BlockTransferService component. The Block pulled by the Executor will be cached locally, and the Block information existing on the Executor will be reported to the Driver so that other Executors can obtain the data corresponding to the Broadcast variable when executing the Task.

1.10 Overall application

The user submits or runs the spark-shell REPL through spark-submit, the cluster creates the Driver, and the Driver loads the Application. Finally, the Application is converted into RDD according to the user code, the RDD is decomposed into Tasks, and the Executor executes the Task and other series of knowledge. The overall interaction blueprint is as follows:

Insert image description here
1) When the Client is running, it sends a startup driver application to the Master (sends the RequestSubmitDriver command)
2) The Master schedules available Worker resources for driver installation (sends the LaunchDriver command)
3) The Worker runs DriverRunner to load the driver and sends an application registration request to the Master (sends the RegisterApplication command)
4) Master schedules the available Worker resources to install the application's Executor (send the LaunchExecutor command)
5) After the Executor is installed, register the driver's available Executor resources with the Driver (send the RegisterExecutor command)
6) Finally, when running the user code, through DAGScheduler, TaskScheduler Encapsulated as an executable TaskSetManager object
7) The TaskSetManager object matches the Executor resource in the Driver, and publishes the task in the Executor of the formation (sends the LaunchTask instruction)
8) After the TaskRunner is executed, the DriverRunner is called to submit it to the DAGScheduler, loop 7. Until mission completed

Guess you like

Origin blog.csdn.net/qq_44696532/article/details/135390025