Spark's data fault tolerance mechanism

Generally speaking, the fault tolerance of distributed data sets has two ways: data checkpoint and record data update

  • checkpoint mechanism-data checkpoint
  • Record update mechanism (corresponding to Lineage mechanism in Saprk)

checkpoint mechanism

  • Checkpoint means to establish a checkpoint, similar to a snapshot. In the traditional Spark task calculation process, the DAG is extremely long, and the cluster needs to complete the entire DAG calculation to obtain the result, but if data loss occurs during this long calculation process, Spark will again Recalculate the results from the beginning based on the dependencies. This is a waste of performance. Of course, we can consider caching or persisting the intermediate calculation results to the memory or disk, but this cannot guarantee that the data will not be lost at all, the storage memory has problems or the disk is broken. If it is dropped, it will also cause Spark to recalculate it again based on the RDD, so the checkpoint mechanism appears. The function of the checkpoint is to store the more important intermediate calculation results in the DAG in a highly available place, usually this place is HDFS. 
      
  • Also need to pay attention to the difference between cache and checkpoint mechanism:
    1. The checkpoint is completed by creating a new job, which is completed by creating a new job after executing a job; while the cache is performed during job execution.
    2. The checkpoint of the checkpoint to the RDD truncates the pedigree of the data, and only saves the RDD that you want to save in HDFS, while the cache is to calculate the pedigree data in the memory.
    3. The cache clearing method is also different. The RDD from checkpoint to HDFS needs to be cleared manually. If it is not cleared manually, it will always exist and can be used by the next driver. The partitions from cache to memory and persist to disk are managed by blockManager. . Once the execution of the driver program ends, that is, the CoarseGrainedExecutorBackend stop of the executor process, the blockManager will also stop, and the RDD cached on the disk will be cleared (the local folder used by the entire blockManager is deleted).

 

  • Insufficient checkpoint mechanism : high operating costs, a huge data set needs to be replicated between machines through the network connection of the data center, and the network bandwidth is often much lower than the memory bandwidth, and it also needs to consume more storage resources. So Spark focuses on the way that records are updated.

 

Lineage mechanism

  RDD only supports coarse-grained conversion, that is, it only records a single operation performed on a single block, and then creates a series of transformation sequences of RDD (each RDD contains how it is transformed from other RDDs and how to reconstruct a block of data Therefore, the fault-tolerant mechanism of RDD is also called "Lineage" fault tolerance) to record it in order to recover the lost partition. Lineage is essentially similar to the Redo Log in the database, except that this redo log has a large granularity, and it does the same redo to the global data to recover the data.

  Compared with the fine-grained memory data update level backup or LOG mechanism of other systems, RDD's Lineage records the behavior of specific data transformation operations (such as filter, map, join, etc.) with coarse-grained data. When part of the partition data of this RDD is lost, it can obtain enough information through Lineage to recalculate and restore the lost data partition. Because this coarse-grained data model limits the use of Spark, Spark is not suitable for all high-performance scenarios, but at the same time, it also brings performance improvements compared to fine-grained data models.

  Lineage reliance Rdd divided into two dependency: narrow-dependent (Narrow Dependencies) and wide-dependent , according to the parent RDD or partition is corresponding to a plurality of sub-partitions to distinguish narrow RDD dependent (child partitions corresponding to a parent partition) and wide-dependent ( The parent partition corresponds to multiple child partitions).

 

  • First , narrow dependence can directly calculate a certain piece of data corresponding to the child RDD on a computing node by calculating a certain piece of data of the parent RDD; wide dependence has to wait until all the data of the parent RDD is calculated, and the calculation of the parent RDD After the result is hashed and passed to the corresponding node, the child RDD can be calculated.
  • Second , when data is lost, only the missing piece of data needs to be recalculated for narrow dependence to restore; for wide dependence, all data blocks in the ancestor RDD must be recalculated to restore. Therefore, when there is a long "lineage" chain, especially when there are wide dependencies, data checkpoints need to be set at an appropriate time. These two characteristics also require different task scheduling mechanisms and fault-tolerant recovery mechanisms for different dependencies.

 

Principle of Fault Tolerance

In the fault-tolerant mechanism, if a node crashes and the operation is narrowly dependent, only the missing parent RDD partition can be recalculated, without relying on other nodes. While wide dependency requires all partitions of the parent RDD to exist, recalculation is very expensive. The economics of overhead can be understood in this way: in the narrow dependency, when the partition of the child RDD is lost and the parent RDD partition is recalculated, all the data in the corresponding partition of the parent RDD is the data of the child RDD partition, and there is no redundant calculation. In the case of wide dependence, all the data of each partition of each parent RDD that is recalculated after losing a child RDD partition is not used for the lost child RDD partition. Some data is equivalent to the corresponding non-lost child RDD. The data needed in the RDD partition will generate redundant calculation overhead, which is also the reason for the greater cost of wide dependence. Therefore, if you use the Checkpoint operator to do checkpointing, you must not only consider whether the Lineage is long enough, but also whether there is a wide dependency. Adding Checkpoint to the wide dependency is the best value for money.

 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/101542086