Endurance of Sanko

cache

  • cache()=persist()=persist(StroageLevel.MEMROY_ONLY)

persist can manually specify the level of persistence

  • persist(StorageLevel.MEMORY_ONLY)
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
    • note:
      • Try to avoid using DISK_ONLY level
      • Try to avoid using "_2" level

Note the use of cache and persist matters:

  • cache and persist minimum unit partition, it is lazy execution, operator action required to trigger execution
  • After a use of the RDD cache or persist, can be assigned to a variable, the variable is directly next use of persistent data
  • Operators can not keep up action after the cache and persist
  • When the application is executed after completion of persistent data will be cleared

checkpoint

  • Data can be persisted to disk, it can also cut the dependency between the RDD
  • When the lineage is very long and complex calculation, you can use the checkpoint to RDD for persistence, when the application is finished
  • The checkpoint data will not be cleared
    • checkpoint implementation process
      • After the action is triggered when the application has to perform, job finished 3 will move forward from the back
      • What is checkpoint marks RDD do have to go back
      • After completion of back recalculated checkpoint'RDD data, the result is written in the specified directory checkpoint
      • Cut dependence of RDD
      • Optimization: prior to RDDcheckpoint, a good idea to lower cache
Published 39 original articles · won praise 13 · views 2301

Guess you like

Origin blog.csdn.net/qq_43205282/article/details/103987005