Understand the difference between Cache and CheckPoint in Spark in one article

Understand step by step

wc.txt data

hello java
spark hadoop flume kafka
hbase kafka flume hadoop

See how many lines the following code will print-------------------------(RDD2)

import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    SparkConf, SparkContext}

object Cache {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
    val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")

    val rdd2: RDD[String] = rdd1.flatMap(x => {
    
    
      println("-------------------------")
      x.split(" ")
    })

    val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))

    val rdd4: RDD[Int] = rdd2.map(x => x.size)

    rdd3.collect()
    rdd4.collect()
    
    Thread.sleep(10000000)
  }


}

The correct answer is 6 (解释一下wc.txt里面有三行数据,所以flatmap执行一次,会打印三条), because two collect() actions are executed
The general process is like this, Because rdd2 is not cached, it needs to be executed twice
Insert image description here
Insert image description here

The above problems
1. An RDD is reused in multiple jobs

  • Problem: When each job is executed, the previous processing layout of the RDD will also be petted.
  • The benefit of using persistence: After the RDD data can be persisted, subsequent jobs can directly obtain the data for calculation when executing, without having to re-read the previous data processing of the RDD.

2. If a job has a long dependency chain

  • Problem: When the dependency chain is too long, if data is lost it needs to be recalculated, which wastes a lot of space.
  • Advantages of using persistence: You can directly persist the data for calculation without having to re-calculate, saving time

Use Cache or Persist

See how many records the following code will print-------------------------(RDD2) Cache is used

import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    SparkConf, SparkContext}

object Cache {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
    val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")

    val rdd2: RDD[String] = rdd1.flatMap(x => {
    
    
      println("-------------------------")
      x.split(" ")
    })
    rdd2.cache()
    val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))

    val rdd4: RDD[Int] = rdd2.map(x => x.size)

    rdd3.collect()
    rdd4.collect()

    Thread.sleep(10000000)
  }

}

The correct answer is 3
Insert image description here

Found a green dot
Insert image description here
Found that cache is stored in memory
Insert image description here
RDD persistence is divided into
cache

  • Data storage location: In the host memory/local disk where the task is located

  • Data saving timing: Data is saved during the execution of the first Job where the cache is located

  • 使用: rdd.cache()/rdd.persist()/rdd.persist(StorageLevel.XXXX)

  • The difference between cache and persist

    • Cache only stores data in memory (the bottom layer of cache is persist())
      Insert image description here

    • persist allows you to specify to save data in memory/disk
      Insert image description here

  • Commonly used storage levels:

    • StorageLevel.MEMORY_ONLY: Only save data in memory, generally used in scenarios with small data volume
    • StorageLevel.MEMORY_AND_DISK: Only save data in memory + disk, generally used in large data volume scenarios

CheckPoint

See how many lines the following code will print-------------------------(RDD2) using CheckPoint

import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import org.apache.spark.{
    
    SparkConf, SparkContext}

object Cache {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
    sc.setCheckpointDir("hdfs://hadoop102:8020/sparkss")


    val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")

    val rdd2: RDD[String] = rdd1.flatMap(x => {
    
    
      println("-------------------------")
      x.split(" ")
    })
    rdd2.checkpoint()
    val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))

    val rdd4: RDD[Int] = rdd2.map(x => x.size)

    rdd3.collect()
    rdd4.collect()
    rdd4.collect()


    Thread.sleep(10000000)
  }

}

The correct answer is 6. No matter how many action operators you have, it will always be 6, because after the first job where the checkpoint rdd is located is executed,会单独触发一个job计算得到rdd数据之后保存。

The reason why CheckPoint is used
Caching saves data on the host disk/memory. If the server goes down and the data is lost, the data needs to be recalculated based on dependencies, which costs money. It takes a lot of time, so the data needs to be saved in the reliable storage medium HDFS to avoid subsequent data loss and recalculation.

  • Data storage location: HDFS
  • Data saving timing: After the first job where the checkpoint RDD is located is executed,会单独触发一个job计算得到rdd数据之后保存。
  • use
    • 1. Set the directory to save data: sc.setCheckpointDir(path)
    • 2. Save data: rdd.checkpoint

Checkpoint will trigger a separate job to execute and save the data after it is obtained, so the data will be repeatedly calculated. At this time, it can be used with cache: rdd.cache() + rdd.checkpoint (this will only generate 3 entries)

import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import org.apache.spark.{
    
    SparkConf, SparkContext}

object Cache {
    
    
  def main(args: Array[String]): Unit = {
    
    
    System.setProperty("HADOOP_USER_NAME", "root")
    val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
    sc.setCheckpointDir("hdfs://hadoop102:8020/sparkss")


    val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")

    val rdd2: RDD[String] = rdd1.flatMap(x => {
    
    
      println("-------------------------")
      x.split(" ")
    })
    rdd2.cache()
    rdd2.checkpoint()
    val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))

    val rdd4: RDD[Int] = rdd2.map(x => x.size)

    rdd3.collect()
    rdd4.collect()
    rdd4.collect()


    Thread.sleep(10000000)
  }

}

The difference between cache and CheckPoint

1. The data storage location is different

  • Caching is to save data on the disk/memory of the host where the task is located.
  • Checkpoint is to save data to HDFS

2. Data saving timing is different

  • The cache is to save data during the execution of the first Job where RDD is located.
  • The checkpoint is saved after the first job where the rdd is located is executed.

3. Whether the dependency relationship remains the same

  • The cache saves the data in the disk/memory of the host where the task is located, so the data is lost when the server is down, and the data needs to be recalculated based on the dependencies, so the dependencies of RDD cannot be removed.
  • Checkpoint saves data to HDFS, and the data will not be lost, so the dependencies of RDD will no longer be used and will be removed.

Guess you like

Origin blog.csdn.net/qq_46548855/article/details/134436404