spark cache articles

Spark data mart RDD, a very important feature here is that RDD can choose to be stored for other actions, the following are various cache levels


Storage methods : RDD persistence or cache options are through persist() or cache()

Remarks : 1. The data in the RDD is fault-tolerant; 2. The Shuffle operation does not need to specify cache() or persist(), and the rdd result will be automatically cached to avoid recalculation of the RDD due to node failure and other issues

. Support coarse-grained conversion (a single operation performed on a single block, such as from which rdd-map), if there is a problem, you don't need to re-execute it all, just read which rdd was converted from before, and how to convert , that is, the data can be regenerated----------so this method has also become a "Lineage" fault-tolerant
RDD. This fault tolerance of RDD is divided into narrow dependencies and wide dependencies. Narrow dependencies are highly efficient, mainly because For the case where there is only one parent RDD (Transformation), the
parent RDD comes from many slices, and needs to be recalculated.
My understanding is that most of the shuffle operations are wide dependencies, and the Transformation operations are narrow dependencies.


Every time an action is called, Spark will Starting from the original input RDD, re-execute all the transformations. For batch tasks, this is no problem, but for interactive tasks that repeatedly operate on the same data, it is very inefficient to repeatedly perform the same calculation.

Spark provides a caching API to solve this problem. Users can use the cache() or persist() method to cache intermediate calculation results in memory or hard disk. When the same calculation is performed next time, the cache can be directly read to improve efficiency. The difference between persist() and cache() is that persist() provides more parameters to support different levels of caching mechanisms.


MEMORY_ONLY default option, RDD (partition) data is directly stored in the JVM memory in the form of Java objects. If the memory space is insufficient, the data of some partitions will not be cached and need to be recalculated according to the generation information when used.
The data of MYMORY_AND_DISK RDD is directly stored in the memory of the JVM in the form of Java objects. If the memory space is not available, the data of some partitions will be stored to the disk and read from the disk when used.
MEMORY_ONLY_SER RDD data (Java objects) are serialized and stored in JVM memory (a partition's data is a byte array in memory), which can effectively save memory space compared to MEMORY_ONLY (especially using a fast serialization tool) case), but requires more CPU overhead when reading the data; if there is insufficient memory space, the processing is the same as MEMORY_ONLY.
Compared with MEMORY_ONLY_SER, MEMORY_AND_DISK_SER stores serialized data on disk when memory space is insufficient.
DISK_ONLY only uses disk to store RDD data (unserialized).
"MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc." Take MEMORY_ONLY_2 as an example. Compared with MEMORY_ONLY, MEMORY_ONLY_2 stores data in the same way. The difference is that the data is backed up to two different nodes in the cluster, and the rest is similar.
OFF_HEAP(experimental) The data of the RDD is serialized and stored in Tachyon. Compared with MEMORY_ONLY_SER, OFF_HEAP can reduce garbage collection overhead, make Spark Executor smaller and lighter, and can share memory; and data is stored in Tachyon, Spark cluster node failure will not cause data loss, so this kind of This approach is attractive in the context of "large" memory or multiple concurrent applications. It should be noted that Tachyon is not directly included in the Spark system, and an appropriate version needs to be selected for deployment; its data is managed in "blocks", and these blocks can be discarded according to a certain algorithm, and will not be rebuilt.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326446455&siteId=291194637