Table of contents
Understand step by step
wc.txt data
hello java
spark hadoop flume kafka
hbase kafka flume hadoop
See how many lines the following code will print-------------------------(RDD2)
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
object Cache {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")
val rdd2: RDD[String] = rdd1.flatMap(x => {
println("-------------------------")
x.split(" ")
})
val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))
val rdd4: RDD[Int] = rdd2.map(x => x.size)
rdd3.collect()
rdd4.collect()
Thread.sleep(10000000)
}
}
The correct answer is 6 (解释一下wc.txt里面有三行数据,所以flatmap执行一次,会打印三条
), because two collect() actions are executed
The general process is like this, Because rdd2 is not cached, it needs to be executed twice
The above problems
1. An RDD is reused in multiple jobs
- Problem: When each job is executed, the previous processing layout of the RDD will also be petted.
- The benefit of using persistence: After the RDD data can be persisted, subsequent jobs can directly obtain the data for calculation when executing, without having to re-read the previous data processing of the RDD.
2. If a job has a long dependency chain
- Problem: When the dependency chain is too long, if data is lost it needs to be recalculated, which wastes a lot of space.
- Advantages of using persistence: You can directly persist the data for calculation without having to re-calculate, saving time
Use Cache or Persist
See how many records the following code will print-------------------------(RDD2) Cache is used
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
object Cache {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")
val rdd2: RDD[String] = rdd1.flatMap(x => {
println("-------------------------")
x.split(" ")
})
rdd2.cache()
val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))
val rdd4: RDD[Int] = rdd2.map(x => x.size)
rdd3.collect()
rdd4.collect()
Thread.sleep(10000000)
}
}
The correct answer is 3
Found a green dot
Found that cache is stored in memory
RDD persistence is divided into
cache
-
Data storage location: In the host memory/local disk where the task is located
-
Data saving timing: Data is saved during the execution of the first Job where the cache is located
-
使用: rdd.cache()/rdd.persist()/rdd.persist(StorageLevel.XXXX)
-
The difference between cache and persist
-
Cache only stores data in memory (the bottom layer of cache is persist())
-
persist allows you to specify to save data in memory/disk
-
-
Commonly used storage levels:
- StorageLevel.MEMORY_ONLY: Only save data in memory, generally used in scenarios with small data volume
- StorageLevel.MEMORY_AND_DISK: Only save data in memory + disk, generally used in large data volume scenarios
CheckPoint
See how many lines the following code will print-------------------------(RDD2) using CheckPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import org.apache.spark.{
SparkConf, SparkContext}
object Cache {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
sc.setCheckpointDir("hdfs://hadoop102:8020/sparkss")
val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")
val rdd2: RDD[String] = rdd1.flatMap(x => {
println("-------------------------")
x.split(" ")
})
rdd2.checkpoint()
val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))
val rdd4: RDD[Int] = rdd2.map(x => x.size)
rdd3.collect()
rdd4.collect()
rdd4.collect()
Thread.sleep(10000000)
}
}
The correct answer is 6. No matter how many action operators you have, it will always be 6, because after the first job where the checkpoint rdd is located is executed,会单独触发一个job计算得到rdd数据之后保存。
The reason why CheckPoint is used
Caching saves data on the host disk/memory. If the server goes down and the data is lost, the data needs to be recalculated based on dependencies, which costs money. It takes a lot of time, so the data needs to be saved in the reliable storage medium HDFS to avoid subsequent data loss and recalculation.
- Data storage location: HDFS
- Data saving timing: After the first job where the checkpoint RDD is located is executed,
会单独触发一个job计算得到rdd数据之后保存。
- use
- 1. Set the directory to save data: sc.setCheckpointDir(path)
- 2. Save data: rdd.checkpoint
Checkpoint will trigger a separate job to execute and save the data after it is obtained, so the data will be repeatedly calculated. At this time, it can be used with cache: rdd.cache() + rdd.checkpoint (this will only generate 3 entries)
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import org.apache.spark.{
SparkConf, SparkContext}
object Cache {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
val sc = new SparkContext(new SparkConf().setMaster("local[4]").setAppName("test"))
sc.setCheckpointDir("hdfs://hadoop102:8020/sparkss")
val rdd1: RDD[String] = sc.textFile("src/main/resources/wc.txt")
val rdd2: RDD[String] = rdd1.flatMap(x => {
println("-------------------------")
x.split(" ")
})
rdd2.cache()
rdd2.checkpoint()
val rdd3: RDD[(String, Int)] = rdd2.map(x => (x, 1))
val rdd4: RDD[Int] = rdd2.map(x => x.size)
rdd3.collect()
rdd4.collect()
rdd4.collect()
Thread.sleep(10000000)
}
}
The difference between cache and CheckPoint
1. The data storage location is different
- Caching is to save data on the disk/memory of the host where the task is located.
- Checkpoint is to save data to HDFS
2. Data saving timing is different
- The cache is to save data during the execution of the first Job where RDD is located.
- The checkpoint is saved after the first job where the rdd is located is executed.
3. Whether the dependency relationship remains the same
- The cache saves the data in the disk/memory of the host where the task is located, so the data is lost when the server is down, and the data needs to be recalculated based on the dependencies, so the dependencies of RDD cannot be removed.
- Checkpoint saves data to HDFS, and the data will not be lost, so the dependencies of RDD will no longer be used and will be removed.