spark of the soul: RDD and DataSet

spark based on an abstract RDD, the demand data into different processing of RDD, and then a series of RDD operator operation to obtain the results.
RDD is a fault tolerant, parallel data structure, the data may be stored to disk and memory, and can control the data partitioning, and provides a rich API to manipulate the data.

1: definition of the five characteristics of RDD and analysis
RDD is an abstraction of the distributed memory, a highly restricted shared memory model, a set of read-only recording RDD partition i.e., capable of parallel computing nodes across all the cluster, it is an abstract model working sets.
(1) a list of partitions
(2) each partition has a calculation function
(3) dependent on other RDD list
(4) key-value data type RDD partitioner
(5) each partition has a priority list
2 : DataSet definition and analysis of internal mechanisms

Guess you like

Origin blog.51cto.com/wangyichao/2436090