Take you to understand the principles of the extremely flexible Spark architecture

Abstract: Compared with MapReduce's rigid Map and Reduce staged computing, Spark's computing framework is more flexible and flexible, and has better running performance.

This article is shared from Huawei Cloud Community " Spark Architecture Principles ", author: JavaEdge.

Compared with MapReduce's rigid Map and Reduce staged computing, Spark's computing framework is more flexible and flexible, and has better running performance.

Spark's Computational Stage

  • MapReduce An application runs only one map and one reduce at a time
  • Spark can be divided into more computing stages (stages) according to the complexity of the application to form a directed acyclic graph DAG, and the Spark task scheduler can execute the computing stages according to the dependencies of the DAG

Logistic regression machine learning performance Spark is more than 100 times faster than MapReduce. Because some machine learning algorithms may require a large number of iterative calculations, resulting in tens of thousands of calculation stages, these calculation stages are processed in one application instead of tens of thousands of applications like MapReduce, so the operation efficiency is extremely high.

DAG, directed acyclic graph, the dependencies of different stages are directed, and the calculation process can only be executed in the direction of the dependencies. Before the execution of the dependent stage is completed, the dependent stage cannot be executed. The dependency cannot have circular dependencies, otherwise it will be an infinite loop.

A typical Spark runs the different stages of a DAG:

The entire application is divided into 3 stages, stage 3 needs to depend on stages 1 and 2, and stages 1 and 2 are independent of each other. When Spark performs scheduling, phases 1 and 2 are executed first, and then phase 3 is executed after completion. Corresponding Spark pseudo code:

rddB = rddA.groupBy(key)
rddD = rddC.map(func)
rddF = rddD.union(rddE)
rddG = rddB.join(rddF)

Therefore, the core of Spark job scheduling and execution is DAG. The entire application is divided into several stages, and the dependencies of each stage are also clear. A task set (TaskSet) is generated according to the amount of data to be processed in each stage, and each task is assigned a task process to process. Spark realizes distributed computing of big data.

The component responsible for the generation and management of Spark application DAG is DAGScheduler. DAGScheduler generates DAG according to the program code, and then distributes the program to the distributed computing cluster, and schedules execution according to the sequence of computing stages.

So what is the basis for Spark to divide the calculation stages? Obviously not every transformation function on the RDD will generate a computation stage, such as the above example has 4 transformation functions, but only 3 stages.

You can observe the above DAG diagram again. The division of computing stages can be seen from the diagram. When the transformation connection lines between RDDs show a many-to-many cross-connection, a new stage will be generated. An RDD represents a data set, and each RDD in the figure contains multiple small blocks, and each small block represents a shard of the RDD.

Multiple data shards in one dataset need to be transmitted by partition and written to different shards of another dataset. We have also seen this operation of data partition cross-transmission during the operation of MapReduce.

Yes, this is the shuffle process. Spark also needs to recombine the data through shuffle. The data of the same key is put together for aggregation, association and other operations, so each shuffle generates a new calculation stage. This is why the calculation phase has dependencies. The data it needs comes from the data generated by one or more previous calculation phases. It must wait for the previous phase to complete before shuffling and obtaining the data.

The calculation stage is divided on the basis of shuffle, not the type of conversion function . Some functions sometimes have shuffle, sometimes not. As shown in the example above, RDD B and RDD F are joined to obtain RDD G. Here, RDD F needs to be shuffled, but RDD B does not.

Because RDD B has been partitioned during the shuffle process of stage 1 in the previous stage. The number of partitions and the partition K remain unchanged, so there is no need to shuffle:

  • This kind of dependency without shuffle is called narrow dependency in Spark
  • Dependencies that need to be shuffled are called wide dependencies

Similar to MapReduce, shuffle is also very important to Spark. Only through shuffle can related data be calculated with each other.

Since shuffle is required, why is Spark more efficient?

Essentially, Spark is a different implementation of the MapReduce computing model. Hadoop MapReduce is simple and crude. According to shuffle, the big data calculation is divided into two stages: Map and Reduce. However, Spark is more detailed. The previous Reduce and the latter Map are connected as a stage for continuous calculation to form a more elegant and efficient computing model. The essence is still Map and Reduce. However, this scheme of relying on execution of multiple computing stages can effectively reduce the access to HDFS and the number of scheduled executions of jobs, so the execution speed is faster.

Unlike Hadoop MapReduce, which mainly uses disk to store data in the shuffle process, Spark preferentially uses memory for data storage, including RDD data. Unless memory is not enough, use memory as much as possible, which is why Spark performance is higher than Hadoop.

Spark job management

There are two types of RDD functions in Spark:

  • The conversion function, after calling it, is still an RDD, and the calculation logic of the RDD is mainly completed by the conversion function
  • The action function will not return RDD after being called. For example , the count () function returns the number of elements in the data in the RDD; saveAsTextFile (path), stores the RDD data in the path path. When Spark's DAGScheduler encounters shuffle, it will generate a calculation stage, and when it encounters an action function, it will generate a job (job)

For each data shard in the RDD, Spark will create a computing task to process, so a computing stage contains multiple computing tasks.

Dependencies and chronological relationships of jobs, computing phases, and tasks:

Time on the horizontal axis, tasks on the vertical axis. Between the two thick black lines is a job, and between the two thin lines is a calculation stage. A job contains at least one compute stage. The red line in the horizontal direction is the task, each stage consists of many tasks, and these tasks form a task set.

After DAGScheduler generates a DAG graph according to the code, Spark task scheduling is allocated in units of tasks, and the tasks are allocated to different machines in the distributed cluster for execution.

Spark execution flow

Spark supports multiple deployment schemes such as Standalone, Yarn, Mesos, and K8s. The principles are similar, except that the roles of different components are named differently.

Spark cluster components:

First, the Spark application is started in its own JVM process (Driver process). After startup, SparkContext is called to initialize the execution configuration and input data. SparkContext starts the DAG graph that DAGScheduler constructs and executes, and divides it into the smallest execution unit - computing tasks.

Then, the Driver requests computing resources from the Cluster Manager for distributed computing of the DAG. After the Cluster Manager receives the request, it notifies all the computing node Workers of the cluster of information such as the driver's host address.

After receiving the information, the Worker communicates with and registers with the Driver according to the host address of the Driver, and then informs the Driver of the number of tasks that it can use according to its free resources. The Driver starts to assign tasks to the registered Worker according to the DAG diagram.

After the worker receives the task, it starts the Executor process to execute the task. The Executor first checks whether it has the execution code of the Driver. If not, it downloads the execution code from the Driver and loads it through Java reflection to start execution.

Summarize

Compared with Mapreduce, the main features of Spark:

  • The RDD programming model is simpler
  • The multi-stage calculation process of DAG segmentation is faster
  • It is more efficient to use memory to store intermediate calculation results

Spark became popular in 2012. At that time, the improvement of memory capacity and cost reduction were an order of magnitude stronger than that of MapReduce ten years ago, and the conditions for Spark to preferentially use memory were ripe.

reference

  • https://spark.apache.org/docs/3.2.1/cluster-overview.html

 

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/5517745