SparkCore of Spark: RDD-Data Core/API [Cache]

1 Overview

\quad \quad One of the most important features of Spark is that it can persist (or cache) a collection into memory through various operations (operations). When you persist an RDD, each node stores all partition data involved in the calculation in memory, and these data can be reused by actions of this collection (and other collections derived from this collection). This ability makes subsequent actions faster (usually more than 10 times faster). Corresponding to iterative algorithms and fast interactive use, caching is a key tool.

2、 API

\quad \quad RDD can persist an rdd by persist()or cache()method. First, calculate the rdd in the action; then, save it in the memory of each node. Spark's caching is a fault-tolerant technology -if any partition of the RDD is lost, it can automatically repeat the calculation and create this partition through the original transformations.

2.1 cache()

Source code:

 /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()

  • The cache method calls the persist method, which is essentially persist
  • cache() is shorthand for rdd.persist(StorageLevel.MEMORY_ONLY), and the effect is exactly the same.
  • The default storage level is to store only one copy in memory, no disk, no off-heap, support deserialization, 1 copy
  • There are many other storage levels of Spark, and the storage level is defined in object StorageLevel.

Small case:
1) Create an RDD

scala> val rdd=sc.makeRDD(Array("hello,spark"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[204] at makeRDD at <console>:26

2) Convert the RDD to carry the current timestamp without caching

scala> val nocache=rdd.map(_.toString+System.currentTimeMillis)
nocache: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[205] at map at <console>:28

3) Print results multiple times

scala> nocache.collect
res36: Array[String] = Array(hello,spark1610781859495)

scala> nocache.collect
res37: Array[String] = Array(hello,spark1610781890107) 

scala> nocache.collect
res38: Array[String] = Array(hello,spark1610781949022)      

4) Convert the RDD to carry the current timestamp and cache it

scala> val nocache_cache=rdd.map(_.toString+System.currentTimeMillis).cache
nocache_cache: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[206] at map at <console>:28

5) Print the cached results multiple times

scala> nocache_cache.collect
res39: Array[String] = Array(hello,spark1610782077927)                          

scala> nocache_cache.collect
res40: Array[String] = Array(hello,spark1610782077927)                          

scala> nocache_cache.collect
res41: Array[String] = Array(hello,spark1610782077927)

  • Through the comparison of the results of 3 and 5, whether it is cached later, it will not consume practice when it is used again, and it will be directly extracted from the memory.

2.2 persist()

Source code:

 /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
  • The usage method is the same as the cache method, except that the storage level attribute StorageLevel._

3. Storage level

\quad \quad In addition, we can use different storage levels to store each persistent RDD. For example, it allows us to persist the collection to disk, persist the collection in memory as serialized Java objects, copy the collection between nodes, or store the collection in Tachyon (distributed memory file system). We can set these storage levels by passing a StorageLevel object to the persist() method. The cache() method uses the default storage level StorageLevel.MEMORY_ONLY.
\quad \quadThe storage level of Spark is defined in object StorageLevel. The source code is as follows:


object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

Insert picture description here

StorageLevel has five attributes:

    private var _useDisk: Boolean, //useDisk_是否使用磁盘
    private var _useMemory: Boolean, //useMemory_是否使用内存
    private var _useOffHeap: Boolean, //useOffHeap_是否使用堆外内存如:Tachyon,
    private var _deserialized: Boolean,//deserialized_是否进行反序列化
    private var _replication: Int = 1) //replication_备份数目。

The complete storage level introduction is as follows:
Insert picture description here

4. How to choose storage level

\quad \quad Spark's multiple storage levels mean different trade-offs between memory utilization and CPU utilization efficiency. In fact, how to store is a technical job, there are many details to think about, as follows

  • Do you use disk caching? (Insurance)

  • Do you use memory cache? (In pursuit of efficiency)

  • Do you use off-heap memory? (Heap memory outside the jurisdiction of java is not controlled by GC and is relatively insecure. It is managed by spark and is generally not used)

  • Is it serialized before caching? (Depends on the size of the data, suitable for large data)

  • Do I need to have a copy? (High availability, when data is expensive)

We recommend selecting an appropriate storage level through the following process:

  • If your RDD fits the default storage level (MEMORY_ONLY), choose the default storage level. Because this is the option with the highest cpu utilization, it will make the operation on the RDD as fast as possible.

  • If the default level is not suitable, select MEMORY_ONLY_SER. Choosing a faster serialization library improves the space usage of the object, but it can still be accessed fairly quickly.

  • Unless functions are expensive to calculate RDDs or they need to filter a large amount of data, don't store RDDs on disk. Otherwise, recalculating a partition will be as slow as repeatedly reading data from the disk.

  • If you want faster error recovery, you can use replicated storage levels. All storage levels can support complete fault tolerance by recalculating the missing data through the descent mechanism, but the repeated data enables you to continue to run tasks on the RDD without the need to recalculate the missing data.

In an environment with a large amount of memory or in an environment with multiple applications, OFF_HEAP has the following advantages:

  • It runs multiple executors sharing the same memory pool in Tachyon.

  • It significantly reduces the cost of garbage collection.

  • If a single executor crashes, the cached data will not be lost.

among them,

  • MEMORY_ONLY: The most efficient, no serialization and deserialization process
  • MEMORY_ONLY_SER: serialization and deserialization are required, which saves space

5. Clean up the cache

  • Caching is actually a practice of changing space for time, which will take up additional storage resources. How to clean it up?
  • Call rdd.unpersist() to clear the cache
  • Depending on the cache level, the location of the cache storage is also different, but you can specify to delete the cache information corresponding to the RDD using unpersist, and specify the cache level as NONE

The source code is as follows:

/**
   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
   *
   * @param blocking Whether to block until all blocks are deleted.
   * @return This RDD.
   */
  def unpersist(blocking: Boolean = true): this.type = {
    logInfo("Removing RDD " + id + " from persistence list")
    sc.unpersistRDD(id, blocking)
    storageLevel = StorageLevel.NONE
    this
  }

Following the above case: clear its cache

scala> nocache_cache.unpersist()
res43: nocache_cache.type = MapPartitionsRDD[206] at map at <console>:28
  • Then when you print again, it will consume time
scala> nocache_cache.collect
res44: Array[String] = Array(hello,spark1610783760985)

6. Application scenarios

  • When the required calculation speed is fast, when the efficiency is high

  • The resources of the cluster must be large enough to accommodate the data to be cached (cache needs to take up memory)

  • The cached data will trigger Action multiple times (call the operator of the Action class multiple times)

  • Filter first, and then cache the narrowed data in the memory

  • If the same RDD is used multiple times in the application, the RDD can be cached in the memory of the computing node. The RDD will get partitioned data based on blood relationship only in the first calculation, and it will be used in other places in the future. In the case of RDD, it will be taken directly from the cache instead of calculation based on blood relationship, which speeds up the later reuse. After using the data, release the cache, otherwise it will keep occupying resources in the memory.

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112707442