Detailed analysis of the underlying principles of Spark (in-depth good article, recommended collection)

Introduction to Spark

Apache Spark is a unified analysis engine for large-scale data processing . Based on in-memory computing, it improves the real-time performance of data processing in a big data environment, while ensuring high fault tolerance and scalability , allowing users to deploy Spark in a large number of On top of the hardware, a cluster is formed.

Spark source code has grown from 40w lines in 1.x to more than 100w lines now, and more than 1400 big cows have contributed code. The entire Spark framework source code is a huge project. Let's take a look at the underlying execution principle of spark.

Spark running process

Spark running process

Spark running process

The specific operation process is as follows:

  1. SparkContext registers with the resource manager and applies to the resource manager to run the Executor

  2. The resource manager allocates the Executor, and then the resource manager starts the Executor

  3. Executor sends heartbeat to resource manager

  4. SparkContext builds DAG directed acyclic graph

  5. Decompose DAG into Stage (TaskSet)

  6. Send Stage to TaskScheduler

  7. Executor applies for Task to SparkContext

  8. TaskScheduler sends Task to Executor to run

  9. At the same time, SparkContext releases application code to Executor

  10. Task runs on Executor and releases all resources after running

1. The construction of DAG diagram from the perspective of code

Val lines1 = sc.textFile(inputPath1).map(...).map(...)

Val lines2 = sc.textFile(inputPath2).map(...)

Val lines3 = sc.textFile(inputPath3)

Val dtinone1 = lines2.union(lines3)

Val dtinone = lines1.join(dtinone1)

dtinone.saveAsTextFile(...)

dtinone.filter(...).foreach(...)

The DAG diagram of the above code looks like this:

Build the DAG graph

Build the DAG graph

The Spark kernel draws a directed acyclic graph of the computational path at the moment when the computation needs to happen, which is the DAG shown above.

Spark's calculation occurs in the Action operation of the RDD, and for all Transformations before the Action, Spark only records the trajectory generated by the RDD, without triggering the real calculation .

2. Divide DAG into Stage core algorithms

An Application can have multiple jobs and multiple Stages:

In Spark Application, many jobs can be triggered due to different Actions. There can be many jobs in an Application. Each job is composed of one or more Stages. The later Stages depend on the previous Stages, that is to say, only the former ones depend on the previous Stages. After the Stage is calculated, the following Stage will run.

Divide by:

Stage division is based on wide dependencies . Operators such as reduceByKey and groupByKey will lead to wide dependencies.

Review the division principle of wide and narrow dependencies:
Narrow dependencies : One partition of the parent RDD will only be depended on by one partition of the child RDD. That is, a one-to-one or many-to-one relationship can be understood as an only child. Common narrow dependencies are: map, filter, union, mapPartitions, mapValues, join (parent RDD is hash-partitioned), etc.
Wide dependencies : One partition of the parent RDD will be dependent on multiple partitions of the child RDD (involving shuffle). That is, a one-to-many relationship can be understood as a super birth. Common wide dependencies include groupByKey, partitionBy, reduceByKey, join (parent RDD is not hash-partitioned), etc.

Core algorithm: backtracking algorithm

Backtracking/reverse parsing from the back to the front, join the stage when encountering narrow dependencies, and perform stage segmentation when encountering wide dependencies.

The Spark kernel will start from the RDD that triggers the Action operation from the back to the front . First, it will create a Stage for the last RDD, and then continue to push backwards. If it finds that there is a wide dependency on a certain RDD, it will use the wide dependent one. RDD creates a new Stage, and that RDD is the last RDD of the new Stage.
Then, and so on, continue to push backwards, and divide Stages according to narrow dependencies or wide dependencies, until all RDDs are traversed.

3. Divide DAG into Stage Analysis

DAG divides Stage

DAG divides Stage

A Spark program can have multiple DAGs (if there are several Actions, there are several DAGs, and there is only one Action at the end of the above figure (not shown in the figure), then it is one DAG) .

A DAG can have multiple Stages (divided according to wide dependencies/shuffle).

The same stage can have multiple tasks executed in parallel ( the number of tasks = the number of partitions , as shown in the figure above, there are three partitions P1, P2, and P3 in Stage1, and there are also three corresponding tasks).

It can be seen that only the reduceByKey operation in this DAG is a wide dependency, and the Spark kernel will use this as a boundary to divide it into different Stages.

At the same time, we can notice that in Stage1 in the figure, from textFile to flatMap to map are all narrow dependencies. These steps can form a pipeline operation. The partition generated by the flatMap operation can continue without waiting for the entire RDD calculation to end. The map operation is performed, which greatly improves the efficiency of the calculation .

4. Submit Stages

The submission of the scheduling phase will eventually be converted into the submission of a task set. DAGScheduler submits a task set through the TaskScheduler interface. This task set will eventually trigger TaskScheduler to build an instance of TaskSetManager to manage the life cycle of this task set. For DAGScheduler, The work of the commit scheduling phase is now complete.

The specific implementation of TaskScheduler will further schedule specific tasks to the corresponding Executor nodes for operation through the TaskSetManager when computing resources are obtained.

5. Monitor Job, Task, Executor

  1. DAGScheduler monitors Job and Task:

To ensure that the interdependent job scheduling stages can be successfully scheduled and executed, DAGScheduler needs to monitor the current job scheduling stage and even the completion of tasks.

This is achieved by exposing a series of callback functions to the outside world. For TaskScheduler, these callback functions mainly include the start and end failure of tasks and the failure of task sets. DAGScheduler further maintains the status of jobs and scheduling stages according to the life cycle information of these tasks. information.

  1. DAGScheduler monitors the life status of Executor:

The TaskScheduler notifies the DAGScheduler of the specific Executor's life status through the callback function. If an Executor crashes, the output result of the ShuffleMapTask of the corresponding task set in the scheduling phase will also be marked as unavailable, which will cause the status of the corresponding task set to change, and then Re-execute related computing tasks to get lost related data .

6. Get the task execution result

  1. Result DAGScheduler:

After a specific task is executed in the Executor, the result needs to be returned to DAGScheduler in some form. Depending on the type of task, the return method of the task result is also different.

  1. Two kinds of results, intermediate and final:

For tasks corresponding to FinalStage, what is returned to DAGScheduler is the operation result itself.

For the task ShuffleMapTask corresponding to the intermediate scheduling stage, what is returned to DAGScheduler is the relevant storage information in a MapStatus, not the result itself. These storage location information will be used as the basis for the next scheduling stage to obtain input data.

  1. Two types, DirectTaskResult and IndirectTaskResult :

According to the size of the task result, the results returned by ResultTask are divided into two categories:

If the result is small enough, it is placed directly inside the DirectTaskResult object.

If it exceeds a certain size, the DirectTaskResult will be serialized on the Executor side first, and then the serialized result will be stored in the BlockManager as a data block, and then the BlockID returned by the BlockManager will be placed in the IndirectTaskResult object and returned to the TaskScheduler, and the TaskScheduler will then call the TaskResultGetter to The BlockID in the IndirectTaskResult is taken out and the corresponding DirectTaskResult is finally obtained through the BlockManager.

7. General interpretation of task scheduling

A diagram illustrating the overall scheduling of tasks:

overall task scheduling

overall task scheduling

Spark running architecture features

1. Executor process exclusive

Each Application obtains a dedicated Executor process, which resides during the Application period and runs Tasks in a multi-threaded manner .

Spark Applications cannot share data across applications, except when writing data to an external storage system. as the picture shows:

Executor process exclusive

Executor process exclusive

2. Support multiple resource managers

Spark has nothing to do with the resource manager, as long as it can get the Executor process and keep communicating with each other.

Spark supports resource managers including: Standalone, On Mesos, On YARN, Or On EC2. as the picture shows:

Supports multiple resource managers

Supports multiple resource managers

3. Job submission proximity principle

The Client submitting the SparkContext should be close to the Worker node (the node running the Executor) , preferably in the same Rack (rack), because there is a lot of information exchange between the SparkContext and the Executor during the Spark Application running process;

If you want to run in a remote cluster, it is better to use RPC to submit the SparkContext to the cluster, do not run the SparkContext away from the worker .

as the picture shows:

Job submission proximity principle

Job submission proximity principle

4. Execution of the principle of moving programs instead of moving data

The principle of moving programs rather than moving data is executed, and Task adopts the optimization mechanism of data locality and speculative execution .

Key methods: taskIdToLocations, getPreferedLocations.

as the picture shows:

data locality

data locality

Search the official account: Five minutes to learn big data, take you to learn big data technology!

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324187346&siteId=291194637