- A group of partitions
can be regarded as the basic unit of data set. For RDD, each partition will be processed by a computing task, which determines the granularity of parallel computing. - The calculation of the function
Spark RDD for each partition is based on shards, and each RDD will implement the compute function to achieve this goal. - Dependency relationship with other RDDs (lineage|lineage)
RDD will generate a new RDD every time it is converted, and a dependency relationship will be formed between RDDs. This relationship is called kinship or lineage. During the calculation process, if the data of a certain partition is lost, RDD will recalculate the data of the missing data partition based on the dependency, instead of recalculating the data of all partitions. - For RDDs that store key-value pairs, there is an optional partitioner.
For RDDs that store kv key-value pairs, there will be a Partitioner. The Partitioner not only determines the number of partitions of the RDD, but also determines the number of partitions output during shuffle.
Spark’s philosophy is that “mobile data is not as good as mobile computing” to store a list of the priority positions of each slice . When scheduling tasks, Spark will try to allocate computing tasks to the nodes where the data to be processed is located.
Features of Spark RDD
Guess you like
Origin blog.csdn.net/FlatTiger/article/details/114916492
Ranking