Detailed explanation of Spark RDD core

This article mainly explains Spark's programming model and job execution scheduling process. For spark, the core is RDD (Resilient Distributed Dataset, Resilient Distributed Dataset), which is a special collection that supports multiple sources, has a fault-tolerant mechanism, and can be cached to support parallel operations. Let's take a look at the core of RDD, an abstract dataset.

Spark programming model

Features of
RDD RDD has a total of five features, three basic features, and two optional features.
(1) Partition: There is a list of data shards, which can divide the data. The divided data can be calculated in parallel and is an atomic part of the data set.
(2) Function (compute): For each shard, there will be a function to iterate/compute to execute it.
(3) Dependency: Each RDD has a dependency on the parent RDD, and the source RDD has no dependency. The lineage (lineage) between them is recorded through the establishment of the dependency.
(4) Preferred location (optional): Each shard will have a preferred location. That is, which machines are better to execute the task on (data locality).
(5) Partitioning strategy (optional): For key-value RDDs, you can tell them how to shard. It can be specified by the repartition function.

RDD dependencies
RDD dependencies are divided into two models, one is narrow dependency and wide dependency.
1. Each partition of a parent RDD with a narrow dependency
value is used by a partition of a child RDD at most, which means that a partition of a parent RDD corresponds to a partition of a child RDD (the first type), or the partitions of multiple parent RDDs correspond to A partition of an RDD (type 2), that is, a partition of a parent RDD cannot correspond to multiple partitions of a child RDD.
As shown in the figure below, joins that co-partitioned the input belong to the second category. When the partition of a child RDD depends on the partition of a single parent RDD, the structure of the partition will not change, such as map, filter and other operations in the following figure. On the contrary, the partition of a child RDD depends on the partition of multiple RDDs. At this time, the structure of the partition will change, as shown in the union operation in the following figure.
2. Wide dependencies
Wide dependencies are values ​​where each partition of a child RDD depends on all or multiple partitions of all parent RDDs. That is to say, there is a partition of a parent RDD corresponding to multiple partitions of a child RDD. The groupByKey shown in the figure below is a wide dependency. Among them, the wide dependency will start the shuffle operation, which will be described in detail below.
write picture description here
Creating an RDD
There are two ways to create an RDD:

  1. Parallelized Collections (parallel computing a collection)
  2. External Datasets (reference external data)

    Parallelized Collections

scala> val data=Array(1,2,3,4,5)
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData=sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:23

scala> distData.reduce((a,b)=>a+b)
res5: Int = 15

External Datasets
is used to read files on the HDFS file system and count the frequency of words as an example:

scala> val rdd = sc.textFile("hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/spark/wc.input")

scala> val wordRdd = rdd.flatMap(_.split(" "))

scala> val kvRdd = wordRdd.map((_,1))

scala> val wordCountRdd = kvRdd.reduceByKey(_ + _)

scala> wordCountRdd.collect
//将结果保存到HDFS文件系统中
scala> wordCountRdd.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/sparkOutput")

In the operation of the above RDD, the operation of the application is usually divided into two operations, transformation (transformation) and execution (active), by default, all transformation operations of spark are lazy (lazy), for each The transformed RDD result will not calculate the result immediately, but just write down some basic data sets of the transformation operation. There can be multiple transformation results. Once the active operation is encountered, all the previous transformation operations will be performed, and finally the result will be obtained (spark The job execution is actually to build a DAG (directed acyclic graph), which will be described below), all RDDs are cleared, if the next job will use the RDDs in other jobs, it will cause the RDD to be recalculated, so , in order to avoid recalculation, we can use persist or cache operations to "persist" an RDD to memory, or cache it to disk

RDD China has a series of transformation operations. A case is given below. For details, please refer to the official spark documentation.

//初始化list
scala> val data=List(("a",1),("b",1),("c",1),("a",2),("b",2),("c",2))
//并行化数组创建RDD
scala> val rdd=sc.parallelize(data)
//进行reduceByKey操作
scala> val rbk=rdd.reduceByKey(_+_).collect
rbk: Array[(String, Int)] = Array((b,3), (a,3), (c,3))

//进行groupByKey操作
scala> val grk=rdd.groupByKey().collect
grk: Array[(String, Iterable[Int])] = Array((b,CompactBuffer(1, 2)), (a,CompactBuffer(1, 2)), (c,CompactBuffer(1, 2)))

//进行sortByKey
scala> val srk=rdd.sortByKey().collect
srk: Array[(String, Int)] = Array((a,1), (a,2), (b,1), (b,2), (c,1), (c,2))

RDD control operations
RDD control operations mainly include operations such as failure recovery, data persistence, and data removal.

Failure Recovery
For a cluster, spark makes two assumptions:

  1. Processing time is limited.
  2. Keeping data persistent is the responsibility of external data, mainly to stabilize data protection during processing.
    Spark will compromise the selection scheme based on assumptions. Based on the dependencies between RDDs, if an RDD is broken, it will re-execute the corresponding partition of its parent RDD, without re-executing all JOBs.
    The re-execution of wide dependencies involves multiple parent RDDs (because wide dependencies will initiate shuffle operations, that is, wide dependencies span multiple stages), which will cause the entire JOB to re-execute. To avoid this, spark will keep the middle of the Map stage. Persistence of data output, in the event of a failure, intermediate data can be obtained by simply backtracking to the corresponding partition executed by mapper.
    RDD persistence
    RDD persistence is divided into active persistence and automatic persistence.

Automatic persistence means that there is no need for users to call persistence operations, and spark automatically saves some intermediate results of shuffle operations (saved to disk) to avoid recalculating all inputs when the node crashes.

Active persistence requires the user to call the persist operation or cache operation on the RDD that needs to be persisted to cache data into memory (default). The persistence level selection is determined by passing a Storage Level object to the persist method. The cache method calls persist () The default level is MEMORY_ONLY (memory). The following are the persistence level selection options on the official website:
write picture description here

RDD data removal
RDD can be cached in memory, spark will insist on the cache used on each node, if there is not enough memory in the cluster, spark will use the LRU algorithm (least recently used algorithm, operating system memory management chapter content) Delete the data partition.
If you want to delete it manually, you can call the unpersist method in the specified RDD to delete it, and it will take effect immediately.

2. Spark job execution scheduling

1. Spark components
From the architectural point of view, each spark application includes a master empty node master, cluster manager for cluster resource management, worker nodes and execution unit executor processes that perform specific task scheduling, client responsible for job submission and responsible for It is composed of Driver processes controlled by job scheduling.
write picture description here

The client is responsible for submitting the user's job, and the Driver process executes various parallel computations and operations on the cluster through the user-defined main function. SparkContext is the only channel for the interaction between the application and the cluster, including: obtaining data, building the DAG graph of RDD, Schedule tasks and other operations through TaskScheduler.
When the user submits a task, the Driver organizes all the RDDs with dependencies to build a DAG graph after processing. When the Active operation is performed, the TaskScheduler will schedule all the RDD execution tasks, and manage the same allocation of resources through the ClusterManager. Specific tasks will be executed on the Worker node. Task Threads are responsible for the execution of specific tasks, and BlockManager is responsible for storage management. The data can be stored in multiple copies in memory. task while executing RetryTask and StragglingTask to quickly restore data.

When executing the DAG directed graph, it will be divided into multiple stages according to the dependencies between RDDs and narrow dependencies, and the shuffle operation will be started between the stages, so the first stage will produce intermediate results. The result is also the input result of the second stage, so how to submit this intermediate result to the second stage? These intermediate results contain some calculation states, and the specific intermediate results will be written to the disk. Therefore, the next stage will read the intermediate results from the disk through the BlockManager for calculation.

2. Spark job execution flow
The following is the Spark job execution flow chart (drawn by myself, it's a bit ugly, don't mind ha ^_^):
write picture description here
submit a task to the cluster, take standlone mode as an example, it has two modes, through –deploy -mode to configure, the default is client mode (the driver is on the client), the above picture shows the cluster mode, that is, the driver is started in the worker. First start the Master, then the Wroker node. After the Wroker node is started, it must first register with the Master node, and then register the client to generate SparkContext for tasks to generate a series of RDDs, and then the master notifies a Wroker node to start the Driver process, the Wroker node The corresponding Driver Process is generated, and the Driver registers the application on the Master node. The Master node notifies the Worker node to start the Executor according to the submitted task (which can be understood by analogy with the Executor framework in the Java concurrent library), and then the Worker node generates the corresponding Executor Process. , the Driver will generate a DAG graph based on the RDD generated by the application, and generate a TaskSet and submit it to the TaskScheduler for scheduling to the Worker node for execution. Finally, each Worker node must send a heartbeat report to the Master node.

The Worker node sends a heartbeat report to the Master to report its health status. There are several solutions for the following failures:
1. When the Worker node fails, when the Worker node fails, its corresponding process Executor will be killed. Therefore, the Master does not receive the heartbeat report of the Worker node, judges that it is faulty, and removes the node.

2. When the Executor fails, the ExecutorRunner will send a report to the Master, but since the Worker is normal, the Master will send the LaunchExecutor command to the Worker node to start the Executor process again.

3. When the Master sends a fault, HA can be built through zookeeper for automatic failover.

Let's take a look at how Spark schedules jobs to generate a series of RDDs and builds a DAG graph to the Worker node, as shown in the following figure:
write picture description here
First, when users submit jobs, they generate RDDs through a series of operations, such as join, groupByKey, filter, etc., and then A DAG graph is constructed by DAGScheduler to eliminate the dependencies of RDD, but does not execute jobs. Only when the action operation of RDD is encountered, the execution of all previous tasks will be triggered. DAGScheduler divides the tasks into stages according to the dependencies and submits the TaskSet to the TaskScheduler. The TaskScheduler goes to the ClusterManager to apply for resources to the Worker node according to the task, and then submits it to the Worker node for execution.

The DAGScheduler scheduler first divides the constructed DAG graph into a complete stage, and then backtracks according to the last RDD in the stage. During the backtracking process, it constantly judges the dependencies of the RDDs. If it is a narrow dependency, it continues to backtrack. , if it is a wide dependency, a new stage is divided, so that the entire stage is divided into multiple new stages. Therefore, the DAG graph is divided into multiple stages, and each stage consists of multiple tasks.

The following summarizes the main functions of DAGScheduler and TaskScheduler as follows.
DAGScheduler :

  • Receive the JOB submitted by the user.
  • Build the stage, recording which RDD or stage output was materialized.
  • Submit the TaskSet to the underlying scheduler.

    Task Scheduler

  • Submit a TaskSet (a set of Tasks) to the cluster to run and monitor.

  • Build a TaskManager for each TaskSet to manage the declaration cycle of the TaskSet.
  • Data locality determines the best location for each task.
  • Speculative execution. When the starggle task is encountered, it needs to be re-executed on other nodes. When it encounters shuffle lost, a fetch fail report is required.

    Finally, let's talk about Task. Task is the smallest execution unit of Executor. There are two common sources of data processed by Task: shuffle data or external data. In addition, Task can run on any node of the cluster, and it will write the shuffle output to memory or disk for fault tolerance.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325517003&siteId=291194637