Spark RDD fault tolerance mechanism


1. RDD fault tolerance mechanism

When a node in the Spark cluster loses data due to downtime, the RDD in Spark can be used to perform fault-tolerant recovery of the lost data. RDD provides two fault recovery methods, lineage and checkpoint.

(1) Lineage

According to the dependencies between RDDs, data recovery is performed on RDDs with lost data. If the child RDD with lost data performs narrow dependency operations, you only need to recalculate the corresponding partition of the parent RDD with lost data, without relying on other nodes, and there is no redundant calculation during the calculation process; if the RDD with lost data is used for wide For dependent operations, all partitions of the parent RDD need to be calculated from beginning to end, and there are redundant calculations in the calculation process.

(2) Set checkpoint mode

The essence is to write RDD to disk storage. When RDD performs wide-dependency calculations, it is only necessary to set a checkpoint in the middle stage for fault tolerance, that is, the sparkContext in Spark calls the setCheckpoint() method, sets the fault-tolerant file system directory as the checkpoint checkpoint, and writes the checkpoint data to the previously set fault tolerance Persistent storage is carried out in the file system. If a node crashes later and the data of the partition is lost, the calculation will be recalculated from the RDD of the checkpoint, and the calculation from the beginning to the end is not required, thereby reducing the overhead.

Two, RDD checkpoint

(1) RDD checkpoint mechanism

RDD's checkpoint mechanism (Checkpoint) is equivalent to taking a snapshot of RDD data, which can snapshot frequently used RDDs to a specified file system, preferably a shared file system, such as HDFS. When the machine fails and the RDD data in the memory or disk is lost, the specified RDD can be quickly restored from the snapshot, without the need to calculate from scratch according to the dependencies of the RDD, which greatly improves the calculation efficiency.

(2) Differences from RDD persistence

cache() or persist() is to store data in the local memory or disk of the machine. When the machine fails, the data cannot be recovered, and the checkpoint is to store the RDD data in an external shared file system (such as HDFS). Shared files The system's copy mechanism ensures data reliability.

After the execution of the Spark application ends, the data stored in cache() or persist() will be cleared, while the data stored in the checkpoint will not be affected and will exist permanently unless it is manually removed. Therefore, checkpoint data can be used by the next Spark application, while cache() or persist() data can only be used by the current Spark application.

(3) RDD checkpoint case demonstration

Create a day06 subpackage in the net.army.rdd package, and then create a CheckPointDemo object in the subpackage
insert image description here

package net.army.rdd.day06

import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
 * 作者:梁辰兴
 * 日期:2023/6/6
 * 功能:检查点演示
 */
object CheckPointDemo {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 创建Spark配置对象
    val conf = new SparkConf()
      .setAppName("CheckPointDemo") // 设置应用名称
      .setMaster("local[*]") // 设置主节点位置(本地调试)
    // 基于Spark配置对象创建Spark容器
    val sc = new SparkContext(conf)

    // 设置检查点数据存储路径(目录会自动创建的)
    sc.setCheckpointDir("hdfs://master:9000/spark-ck")
    // 基于集合创建RDD
    val rdd = sc.makeRDD(List(1, 1, 2, 3, 5, 8, 13))
    // 过滤大于或等于5的数据,生成新的RDD
    val rdd1 = rdd.filter(_ >= 5)
    // 将rdd1持久化到内存
    rdd1.cache(); // 相当于无参persist()方法
    // 将rdd1标记为检查点
    rdd1.checkpoint();

    // 第一次行动算子 - 采集数据,将标记为检查点的RDD数据存储到指定位置
    val result = rdd1.collect.mkString(", ")
    println("rdd1的元素:" + result)
    // 第二次行动算子 - 计算个数,直接从缓存里读取rdd1的数据,不用从头计算
    val count = rdd1.count
    println("rdd1的个数:" + count)

    // 停止Spark容器
    sc.stop()
  }
}

The above code uses the checkpoint() method to mark the RDD as a checkpoint (just a mark, and it will be executed when an action operator is encountered). In the first action calculation, the data of the RDD marked as checkpoint will be saved as a file in the file system directory specified by the setCheckpointDir() method, and all parent RDD dependencies of this RDD will be removed, because The next time the RDD is calculated, the data will be read directly from the file system without recalculation based on dependencies.

Spark recommends that before marking the RDD as a checkpoint, it is best to persist the RDD to memory, because Spark will start a separate task to write the data of the RDD marked as a checkpoint to the file system. If the data of the RDD has been persisted to The memory will directly read data from the memory and then write it to improve the efficiency of data writing. Otherwise, it is necessary to repeatedly calculate the RDD data.

Run the program and view the results on the console.
insert image description here
Use Hadoop WebUI to view the HDFS checkpoint directory.
insert image description here
View the directory in the red box.
insert image description here
View the rdd-1 directory
insert image description here
. Because the sc.stop() statement is executed, the Spark container is closed, and the cached data is cleared. Of course, there is no way to access Spark's stored data.

3. Shared variables

Usually, when the Spark application is running, the function func in the Spark operator (such as map(func) or filter(func)) will be sent to multiple remote Worker nodes for execution. If an operator uses An external variable, the variable will be copied to each Task task of the Worker node, and the operations of each Task task on the variable are independent of each other. When the amount of data stored in a variable is very large (such as a large collection), it will increase the overhead of network transmission and memory. Therefore, Spark provides two kinds of shared variables: broadcast variables and accumulators.

(1) Broadcast variables

The broadcast variable is to send a variable to the cache of each Worker node in the form of broadcast, instead of sending it to each Task task, and each Task task can share the data of the variable. Therefore, broadcast variables are read-only.

Preparations: Create data.txt in the /home directory
insert image description here
and upload it to the /park directory of HDFS
insert image description here

1. By default, variables are passed

Start spark shell in cluster mode
insert image description here

The external variable arr is used in the function passed in by the map() operator
insert image description here

val arr = Array(1, 2, 3, 4, 5)
val lines = sc.textFile("hdfs://master:9000/park/data.txt")
val result = lines.map((_, arr))
result.collect

In the above code, the function (_, arr) passed to the map() operator will be sent to the Executor for execution, and the variable arr will be sent to all Task tasks of the Worker node. The process of passing the variable arr is shown in the figure below.
insert image description here
Assuming that the amount of data stored in the variable arr is 100MB, each Task task needs to maintain a 100MB copy. If 3 Task tasks are started in an Executor, the Executor will consume 300MB of memory.

2. Variable transfer when using broadcast variables

The broadcast variable is actually the encapsulation of the ordinary variable. In the distributed function, the value of the broadcast variable can be accessed through the value method of the Broadcast object.
insert image description here

val arr = Array(1,2,3,4,5)
val broadcastVar = sc.broadcast(arr) // 定义广播变量
broadcastVar.value

Use the broadcast variable to pass the array arr to the map() operator
insert image description here

val arr = Array(1, 2, 3, 4, 5)
val broadcastVar = sc.broadcast(arr) // 定义广播变量
val lines = sc.textFile("hdfs://master:9000/park/data.txt")
val result = lines.map((_, broadcastVar)) // 算子携带广播变量
result.collect

The above code uses the broadcast() method to send (broadcast) a read-only variable to the cluster. This method is sent only once and returns a broadcast variable broadcastVar, which is an org.apache.spark.broadcast.Broadcast object. Broadcast objects are read-only and are cached on each Worker node in the cluster.
insert image description hereEach Task task of the Worker node shares a unique broadcast variable, which greatly reduces network transmission and memory overhead.

Output the arr result by traversing the operator
insert image description here

val arr = Array(1,2,3,4,5)
val broadcastVar = sc.broadcast(arr) // 定义广播变量
val lines = sc.textFile("hdfs://master:9000/park/data.txt")
val result = lines.map((_, broadcastVar)) // 算子携带广播变量
result.collect.foreach(tuple => {
    
    
  print(tuple._1 + ": ")
  tuple._2.value.foreach(x => print(x + " "))
  println()
})

Output the data of result through a double loop
insert image description here

for (tuple <- result.collect) {
    
    
  print(tuple._1 + ": ")
  for (x <- tuple._2.value)
    print(x + " ")
  println()
}

(2) Accumulator

1. Accumulator function

The accumulator provides the function of aggregating the value of the Worker node to the Driver, which can be used to implement counting and summing.

2. Do not use the accumulator

sum an array of integers
insert image description here

The above code is incorrect because the sum variable is defined in the Driver, and the accumulation operation sum = sum + x will be sent to the Executor for execution.

3. Use the accumulator

sum an array of integers
insert image description here

val myacc = sc.longAccumulator("acc") // 在Driver里声明累加器
val rdd = sc.makeRDD(Array(1, 2, 3, 4, 5)) // 创建RDD
rdd.foreach(x => myacc.add(x)) // 在Executor里向累加器添加值
println("sum = " + myacc.value) // 在Driver里输出累加结果

The above code creates a Long type accumulator by calling the longAccumulator () method of the SparkContext object, and the default initial value is 0. Accumulators of type Double can also be created using the doubleAccumulator() method.

The accumulator can only be defined on the Driver side and updated on the Executor side. The value of the accumulator cannot be read on the Executor side, and needs to be read using the value attribute on the Driver side.

Guess you like

Origin blog.csdn.net/m0_62617719/article/details/131062689