（RDD）Cache 缓存使用详解

1）persist()和cache()都是默认缓存在内存

2）cache只是一个transformation，是lazy的，只有出发action才会真正记缓存

Q：spark里面的cache是lazy的还是legal的？

A：spark-code里面的cache是lazy的，spark-sql里面的是legal的

3）缓存类型

NONE : 什么类型都不是

DISK_ONLY ：磁盘

DISK_ONLY_2 ：磁盘双副本

MEMORY_ONLY ：内存反序列化把RDD作为反序列化的方式存储，假如RDD的内容存不下，剩余的分区在以后需要时会重新计算，不会刷到磁盘上。（大不了不存..）

MEMORY_ONLY_2 ：内存反序列化双副本

MEMORY_ONLY_SER ：内存序列化这种序列化方式，每一个partition以字节数据存储，好处是能带来更好的空间存储，但CPU耗费高

MEMORY_ONLY_SER_2 : 内存序列化双副本

MEMORY_AND_DISK ：内存 + 磁盘反序列化双副本 RDD以反序列化的方式存内存，假如RDD的内容存不下，剩余的会存到磁盘

MEMORY_AND_DISK_2 : 内存 + 磁盘反序列化双副本

MEMORY_AND_DISK_SER ：内存 + 磁盘序列化

MEMORY_AND_DISK_SER_2 ：内存 + 磁盘序列化双副本

*********** 序列化能有效减少存储空间，默认MEMORY_ONLY

4）如何选择存储级别

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

在CPU和内存之间做权衡

1.If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

如果RDD对于默认的存储级别是满足的，就不要选择其他了。这是性能最优的，最搞笑的（前提内存要足够，这是第一选择）

2.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

如果MEMORY_ONLY不一定满足（即：内存不够），可以尝试使用MEMORY_ONLY_SER再加上一个序列化框架（kyro），这样内存的空间更好。序列化就是为了减少空间

RDDA ==> RDDB ==> RDDC

3.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

不要把数据写到磁盘，成本是非常高的。当数据太大的时候，可以过滤一部分数据再存，这样的话可能会更快

4.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

可以使用副本的存储级别能更快的容错，所以的storage level都提供了副本机制，这个机制能让你继续再RDD上运行task，并不需要等待重新计算。（从另外的节点拿）

************************首选第1种方式，满足不了再使用第2种。后两种不推荐

5）移除缓存数据

Spark会自动地监控每个节点的使用情况，以一种LRU的机制（least-recently-used：最近很少使用）去自动移除。如果想手工代替这种自动去移除，可以使用RDD.unpersist()去处理

unpersist()是删缓存的，legal的

（RDD）Cache 缓存使用详解

猜你喜欢