resilient distributed datasets 读后笔记

1.Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs.

2.RDD是延迟加载的,就是说直到action被触发,才真正有动作。

 

3. RDD之间的关系分为narrow dependency 和 wide dependency,看图很好理解


4.spark的scheuler会把程序逻辑和RDD变成DAG图来,分stage执行



 

 


 

猜你喜欢

转载自tcxiang.iteye.com/blog/2098528