What is the caching mechanism of big data Spark?

What kind of caching mechanism does Spark have for big data? First of all, Spark is open source, so if you look at Spark's code, you can learn more about Spark's caching mechanism. In earlier versions of Spark, the main function of CacheManager is to cache. Assuming that the user caches the RDD data of a partition, when the data is needed again, it can be extracted from the cache.
The underlying storage of the CacheManager is the BlockManager, and the CacheManager is responsible for maintaining the meta information of the cache. In Spark's cache, the RDD cache is stored in the Blockmanager. If the RDD is needed next time, it can be obtained directly from the cache, which reduces the trouble of recalculating.

In the cache, the following problems are common: persistence strategy, RDD cache process, elimination mechanism, cache update problem, etc. These problems have corresponding processing strategies in the spark cache mechanism.
For example, persistence strategies: Spark defines several different mechanisms for persisting RDDs, represented by different StorageLevel values. Mr. Chen from Shangxuetang pointed out that rdd.cache() can store RDDs as unserialized Java objects. When Spark stores a partition when it estimates that the memory is not enough, it can store the partition in a different memory and recalculate it the next time it is needed. StorageLevel.MEMORY is suitable when objects need frequent or low-latency access to avoid the cost of serialization. Compared to other options, this option takes up more memory space. At the same time, it will also affect the efficiency of Java GC recycling.

Another example is the process of RDD caching. Before the RDD is cached, the Record space cannot be continuous. After the RDD is cached in the storage memory, the Partition becomes a Block, and the Record occupies a continuous space in the heap or off-heap storage memory. The process of converting Partition from discontinuous storage space to continuous storage space, thereby expanding.
The storage levels MEMORY_AND_DIS and MEMORY_AND_DISK_SE are similar to MEMORY and MEMORY_SER, respectively. For MEMORY and MEMORY_SER, if a partition does not fit in memory, the entire partition will not fit in memory. For MEMORY_AND_DISK and MEMORY_AND_DISK_SER, if the partition does not fit in memory, Spark will overflow it to disk.

Assuming that the data is non-serialized data, during the Unroll process, if the memory is used up, the previously applied memory will also be automatically released. If the data is serialized, there is no such problem. memory for calculations.
What kind of caching mechanism does Spark have for big data? Of course, there are many more, such as the elimination mechanism, the update of the cache, etc., which will not be described in detail here. In a word, how to arrange the relationship between space and speed is always a complex subject for data caching for big data tools including Spark. It is also necessary to consider the relationship between cost and benefit in combination with the actual project.
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324931898&siteId=291194637