Big data development-Spark-RDD persistence and caching

1.RDD caching mechanism cache, persist

One reason Spark is so fast is that RDD supports caching. After successful caching, if the data set is used in subsequent operations, it is directly obtained from the cache. Although the cache also has the risk of loss, due to the dependency between RDDs, if the cached data of a certain partition is lost, only the partition needs to be recalculated.

Operators involved: persist, cache, unpersist; all are Transformation

Cache is to write calculation results to different media, user-defined storage level can be defined (storage level defines the cache storage media, currently supports memory, off-heap memory, disk);

Through caching, Spark avoids repeated calculations on RDDs and can greatly increase the calculation speed; RDD persistence or caching is one of the most important features of Spark. It can be said that caching is a key factor for Spark to build iterative algorithms and fast interactive queries;

One of the reasons Spark is very fast is to persist (or cache) a data set in memory. When an RDD is persisted, each node will save the calculated fragmentation result in memory and reuse it in other actions (Action) performed on this data set (or derived data set). This makes subsequent actions faster; use the persist() method to mark an RDD as persistent. The reason why we say "marked as persistent" is because where the persist() statement appears, the RDD will not be calculated and persisted immediately, but it will not be calculated until the first action operation triggers the actual calculation. The calculation result will be persisted; through the persist() or cache() method, an RDD to be persisted can be marked, the persistence is triggered, and the RDD will be retained in the memory of the computing node and reused;

When to cache data, there is a trade-off between space and speed. In general, if a certain RDD is needed for multiple actions, and its computational cost is high, then this RDD should be cached;

The cache may be lost, or the data stored in the memory may be deleted due to insufficient memory. The fault-tolerant mechanism of the RDD cache ensures that the calculation can be performed correctly even if the cache is lost. Through a series of conversions based on RDD, the lost data will be recalculated. Each Partition of RDD is relatively independent, so only the missing part needs to be calculated, instead of recalculating all Partitions.

Two parameters need to be configured to start off-heap memory:

  • spark.memory.offHeap.enabled : Whether to enable off-heap memory, the default value is false, and it needs to be set to true;
  • spark.memory.offHeap.size : The size of the off-heap memory space, the default value is 0, and it needs to be set to a positive value.

1.1 Cache level

One reason Spark is so fast is that RDD supports caching. After successful caching, if the data set is used in subsequent operations, it is directly obtained from the cache. Although the cache also has the risk of loss, due to the dependency between RDDs, if the cached data of a certain partition is lost, only the partition needs to be recalculated.

file

file

Spark supports multiple cache levels:

Storage Level Meaning
MEMORY_ONLY The default cache level stores the RDD in the JVM as a deserialized Java object. If the memory space is not enough, some partition data will no longer be cached.
MEMORY_AND_DISK Store the RDD in the JVM as a deserialized Java object. If the memory space is not enough, store the uncached partition data to the disk and read it from the disk when these partitions are needed.
MEMORY_ONLY_SER Store the RDD as a serialized Java object (each partition is a byte array). This method saves storage space than deserializing objects, but it will increase the computational burden of the CPU when reading. Only Java and Scala are supported.
MEMORY_AND_DISK_SER Similar MEMORY_ONLY_SER, but the overflowed partition data is stored to disk instead of recalculating when they are used. Only Java and Scala are supported.
DISK_ONLY Cache RDD on disk only
MEMORY_ONLY_2, MEMORY_AND_DISK_2 The function is the same as the corresponding level above, but each partition will be replicated on two nodes in the cluster.
OFF_HEAP With MEMORY_ONLY_SERsimilar, but the data stored in the external memory in the heap. This requires enabling off-heap memory.

Two parameters need to be configured to start off-heap memory:

  • spark.memory.offHeap.enabled : Whether to enable off-heap memory, the default value is false, and it needs to be set to true;
  • spark.memory.offHeap.size : The size of the off-heap memory space, the default value is 0, and it needs to be set to a positive value.

1.2 Use cache

There are two ways to cache data: persistand cache. cacheAlso called internally persist, it is persista specialized form, is equivalent to persist(StorageLevel.MEMORY_ONLY). Examples are as follows:

// 所有存储级别均定义在 StorageLevel 对象中
fileRDD.persist(StorageLevel.MEMORY_AND_DISK)
fileRDD.cache()

The cached RDD has a green dot in the DAG graph.

file

1.3 Remove cache

Spark will automatically monitor the cache usage on each node and delete old data partitions according to the least recently used (LRU) rule. Of course, you can also use the RDD.unpersist()method to manually remove.

2. RDD fault tolerance mechanism Checkpoint

2.1 Operators involved: checkpoint; also Transformation

In addition to the persistence operation, Spark also provides a checkpoint mechanism for data preservation; the essence of the checkpoint is to write the RDD to a highly reliable disk, and the main purpose is for fault tolerance. Checkpoints are achieved by writing data to the HDFS file system

RDD checkpoint function. If the lineage is too long, the cost of fault tolerance will be too high, so it is better to checkpoint fault tolerance in the intermediate stage. If there is a problem with the node later and the partition is lost, from

The checkpointed RDD starts to redo Lineage, which will reduce the overhead.

2.2 The difference between cache and checkpoint

There is a significant difference between cache and checkpoint. The cache calculates the RDD and stores it in the memory, but the dependency chain of the RDD cannot be dropped. When an executor goes down at a certain point, the RDD of the cache above will be dropped, and it needs to pass the dependency chain Replay calculations. The difference is that checkpoint is

RDD is stored in HDFS and is a reliable storage of multiple copies. At this time, the dependency chain can be lost, so the dependency chain is cut off.

2.3 Checkpoint is suitable for the scene

The following scenarios are suitable for using the checkpoint mechanism:

1) The Lineage in DAG is too long, if recalculation, the overhead will be too large

2) Doing Checkpoint on wide dependence will gain greater benefits

Similar to cache, checkpoint is also lazy.

val rdd1 = sc.parallelize(1 to 100000)
// 设置检查点目录

sc.setCheckpointDir("/tmp/checkpoint")

val rdd2 = rdd1.map(_*2)

rdd2.checkpoint

// checkpoint是lazy操作

rdd2.isCheckpointed

// checkpoint之前的rdd依赖关系

rdd2.dependencies(0).rdd

rdd2.dependencies(0).rdd.collect

// 执行一次action,触发checkpoint的执行

rdd2.count

rdd2.isCheckpointed

// 再次查看RDD的依赖关系。可以看到checkpoint后,RDD的lineage被截断,变成从checkpointRDD开始

rdd2.dependencies(0).rdd

rdd2.dependencies(0).rdd.collect

//查看RDD所依赖的checkpoint文件

rdd2.getCheckpointFile 

Remarks: The file operation of checkpoint will not be deleted after execution. Wu Xie, Xiao San Ye, a little rookie in the background, big data, artificial intelligence field. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/112760214