Advanced RDD (Chapter 3)

Advanced RDD (Chapter 3)

Analyze WordCount

sc.textFile("hdfs:///demo/words/t_word") //RDD0
    .flatMap(_.split(" ")) //RDD1
    .map((_,1)) //RDD2
    .reduceByKey(_+_) //RDD3 finalRDD
    .collect //Array 任务提交

What are the characteristics of RDD?

RDD has partitions-the number of partitions is equal to that of the RDD.
Each partition is operated independently to achieve partition locality calculations as much as possible.
RDD is a read-only data set, and there is a mutual dependence between RDD and RDD
for key- Value RDD, you can specify the partition strategy [Optional]
Based on the location of the data, select the best location to achieve locality calculation [Optional]

RDD fault tolerance

The prerequisite for understanding how DAGSchedule does state division is to understand a professional term lineage, which is usually called the RDD system.

The essence of Spark's calculation is to perform various transformations on RDD, because RDD is an immutable read-only collection, so each transformation requires the previous RDD as the input of this transformation, so the lineage of RDD describes It is the interdependence between RDDs.

Spark categorizes the relationship between RDDs into wide dependencies and narrow dependencies.

Spark will perform fault tolerance for RDD calculations based on the dependencies of the RDD stored by Lineage. Destination time fault-tolerance policies before Saprk of 根据RDD依赖关系重新计算-⽆需⼲预, RDD做Cache-临时缓存, RDD做Checkpoint-持久化 ⼿段完成RDD计算的故障容错.

RDD cache

The cache is a section of RDD calculation fault tolerance. When the RDD data is lost, the program can quickly calculate the current RDD value through the cache without recalculating all the RDDs. Therefore, Spark needs to recalculate a certain RDD. When using it multiple times, in order to improve the execution efficiency of the program, users can consider using RDD cache.

scala> var finalRDD=sc.textFile("hdfs:///demo/words/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
finalRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[25] at reduceByKey at<console>:24
scala> finalRDD.cache              //finalRDD.persist(StorageLevel.MEMORY_ONLY)
res7: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[25] at reduceByKey at<console>:24
scala> finalRDD.collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (day,2), (come,1), (hello,1),(baby,1), (up,1), (spark,1), (a,1), (on,1), (demo,1), (good,2), (study,1))
scala> finalRDD.collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (day,2), (come,1), (hello,1),
(baby,1), (up,1), (spark,1), (a,1), (on,1), (demo,1), (good,2), (study,1))

The user can call the upersist method to clear the cache

scala> finalRDD.unpersist()          //清空缓存中的finalRDD
res11: org.apache.spark.rdd.RDD[(String, Int)]
@scala.reflect.internal.annotations.uncheckedBounds = ShuffledRDD[25] at reduceByKey at<console>:24

The previous cache scheme supported by Spark is as follows:

object StorageLevel {
 val NONE = new StorageLevel(false, false, false, false)
 val DISK_ONLY = new StorageLevel(true, false, false, false) // 仅仅存储磁盘
 val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2) // 仅仅存储磁盘 存储两份
 val MEMORY_ONLY = new StorageLevel(false, true, false, true)       //缓存在内存中
 val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
 val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) //先序列化再存储内存,费CPU节省内存
 val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
 val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)       //推荐使用
 val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
 val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
 val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
 val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
...

How to choose?

By default, the highest performance is of course MEMORY_ONLY, but the premise is that your memory must be large enough to store all the data of the entire RDD. Because it does not perform serialization and deserialization operations, this part of the performance overhead is avoided; subsequent calculation operations on this RDD are based on operations on data in pure memory, and do not need to read from disk files The performance of the data is also very high; there is no need to make a copy of the data and transmit it to other nodes remotely. However, it must be noted that in the actual production environment, I am afraid that the scenarios in which this strategy can be directly used are still limited. If there are a lot of data in RDD (such as hundreds of millions), direct use of this persistent The level of change will cause the OOM memory overflow exception of the JVM.

If memory overflow occurs when using the MEMORY_ONLY level, it is recommended to try to use the MEMORY_ONLY_SER level. This level will serialize the RDD data and save it in memory. At this time, each partition is just a byte array, which greatly reduces the number of objects and reduces memory usage. The performance overhead of this level less than MEMORY_ONLY is mainly the overhead of serialization and deserialization. However, subsequent calculations can be performed based on pure memory, so the overall performance is still relatively high. In addition, the problems that may occur are the same as above. If the amount of data in the RDD is too much, it may still cause an exception of OOM memory overflow.

Don't leak to the disk, unless you need a lot of cost to calculate in memory, or you can filter a large amount of data, and save the relatively important part in memory. Otherwise, the calculation speed stored in the disk will be very slow, and the performance will be drastically reduced.

The suffix is ​​_2. All data must be copied and sent to other nodes. Data replication and network transmission will cause a large performance overhead, unless it requires high availability of the job, otherwise Not recommended.

Check Point mechanism

In addition to using the caching mechanism to effectively ensure the failure recovery of RDDs, if the cache fails, it will still cause the system to recalculate the results of the RDD. Therefore, for some scenarios where the RDD lineage is relatively long, the calculation ratio is time-consuming, and the user can Try to use the checkpoint mechanism to store RDD calculation results.

The most important difference between this mechanism and caching is that after using checkpoint, the RDD data of the checkpoint is directly persisted in the file system. It is generally recommended to write the results in hdfs. This kind of checpoint will not be automatically cleared.

Note that the checkpoint is first to mark the RDD during the calculation process, and after the task is completed, the checkpoint is implemented on the RDD of the mark, which means that the dependence and result of the RDD after the mark must be recalculated.

sc.setCheckpointDir("hdfs://CentOS:9000/checkpoints")
val rdd1 = sc.textFile("hdfs://CentOS:9000/demo/words/")
.map(line => {
 println(line)
})
//对当前RDD做标记
rdd1.checkpoint()
rdd1.collect()

Therefore, checkpoint generally needs to be used in conjunction with cache, so that the calculation can be guaranteed once

sc.setCheckpointDir("hdfs://CentOS:9000/checkpoints")
val rdd1 = sc.textFile("hdfs://CentOS:9000/demo/words/")
.map(line => {
 println(line)
})
rdd1.persist(StorageLevel.MEMORY_AND_DISK) //先cache,保证后面在checkpoint时不必重新进行RDD转换
//对当前RDD做标记
rdd1.checkpoint()
rdd1.collect()
rdd1.unpersist()//删除缓存

Detailed explanation of spark RDD calculation

http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Analysis of Spark task calculation source code

Theoretical guidance

sc.textFile("hdfs:///demo/words/t_word") //RDD0
 .flatMap(_.split(" ")) //RDD1
 .map((_,1)) //RDD2
 .reduceByKey(_+_) //RDD3 finalRDD
 .collect //Array 任务提交

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-bGxiUp7v-1582287314652)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ 1582271261086.png)]

By analyzing the above code, it is not difficult to find that Spark will form a task execution DAG according to the conversion relationship of RDD in the early stage of task execution. Divide the task into several stages. Spark's bottom layer is divided into stages based on the division of dependencies between RDDs. Spark categorizes the conversion between RDD and RDD: ShuffleDependency-wide dependency | NarrowDependency-narrow dependency. If Spark finds that there is a narrow dependency between RDD and RDD, the system will automatically summarize the calculation of RDD with narrow dependency as A stage, if you encounter a wide dependency system, open a new stage.

Spark's breadth and narrowness depends on judgment

Wide dependency: One partition of the RDD corresponds to multiple partitions of the sub-RDD, and the occurrence of a fork is regarded as a wide dependency. ShuffleDependency

Narrow dependency: One partition of the RDD (multiple RDDs) only corresponds to only one partition of the sub-RDD, which is regarded as a narrow dependency. OneToOneDependency | RangeDependency | PruneDependency

In the early stage of task submission, Spark first reverses all dependent RDDs and the dependencies between RDDs based on finalRDD. If narrow dependencies are encountered, they are merged in the current stage, and if they are wide dependencies, a new stage is opened.

Source tracking

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-KpFdjVS6-1582287314654)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ 1582287145373.png)]

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-iNUeWhqC-1582287314655)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ 1582287178213.png)]

getMissingParentStages

private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new ArrayStack[RDD[_]]
    def visit(rdd: RDD[_]) {
        if (!visited(rdd)) {
            visited += rdd
            val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
            if (rddHasUncachedPartitions) {
                for (dep <- rdd.dependencies) {
                    dep match {
                        case shufDep: ShuffleDependency[_, _, _] =>
                        val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
                        if (!mapStage.isAvailable) {
                            missing += mapStage
                        }
                        case narrowDep: NarrowDependency[_] =>
                        waitingForVisit.push(narrowDep.rdd)
                    }
                }
            }
        }
    }
    waitingForVisit.push(stage.rdd)
    while (waitingForVisit.nonEmpty) {
        visit(waitingForVisit.pop())
    }
    missing.toList
}

In case of wide dependencies, the system will automatically create a ShuffleMapStage

submitMissingTasks

private def submitMissingTasks(stage: Stage, jobId: Int) {

    //计算分区
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
    ...
    //计算最佳位置
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
        stage match {
            case s: ShuffleMapStage =>
            partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd,
                                                                  id))}.toMap
            case s: ResultStage =>
            partitionsToCompute.map { id =>
                val p = s.partitions(id)
                (id, getPreferredLocs(stage.rdd, p))
            }.toMap
        }
    } catch {
        case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}",
                   Some(e))
        runningStages -= stage
        return
    }
    //将分区映射TaskSet
    val tasks: Seq[Task[_]] = try {
        val serializedTaskMetrics =
        closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
        stage match {
            case stage: ShuffleMapStage =>
            stage.pendingPartitions.clear()
            partitionsToCompute.map { id =>
                val locs = taskIdToLocations(id)
                val part = partitions(id)
                stage.pendingPartitions += id
                new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
                                   taskBinary, part, locs, properties, serializedTaskMetrics,
                                   Option(jobId),
                                   Option(sc.applicationId), sc.applicationAttemptId,
                                   stage.rdd.isBarrier())
            }
            case stage: ResultStage =>
            partitionsToCompute.map { id =>
                val p: Int = stage.partitions(id)
                val part = partitions(p)
                val locs = taskIdToLocations(id)
                new ResultTask(stage.id, stage.latestInfo.attemptNumber,
                               taskBinary, part, locs, id, properties, serializedTaskMetrics,
                               Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
                               stage.rdd.isBarrier())
            }
        }
    } catch {
        case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}",
                   Some(e))
        runningStages -= stage
        return
    }
    //调⽤taskScheduler#submitTasks TaskSet
    if (tasks.size > 0) {
        logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd})
(first 15 " +
                s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
        taskScheduler.submitTasks(new TaskSet(
            tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    }
    ...
}

总结关键字:逆推、finalRDD、ResultStage 、Shu!leMapStage、Shu!leMapTask、ResultTask、Shu!leDependency、NarrowDependency、DAGScheduler、TaskScheduler、SchedulerBackend、
t(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
}

}


总结关键字:逆推、finalRDD、ResultStage 、Shu!leMapStage、Shu!leMapTask、ResultTask、Shu!leDependency、NarrowDependency、DAGScheduler、TaskScheduler、SchedulerBackend、
DAGSchedulerEventProcessLoop

Guess you like

Origin blog.csdn.net/origin_cx/article/details/104434056