Features of Spark RDD

  • A group of partitions
            can be regarded as the basic unit of data set. For RDD, each partition will be processed by a computing task, which determines the granularity of parallel computing.
  • The calculation of the function
            Spark RDD for each partition is based on shards, and each RDD will implement the compute function to achieve this goal.
  • Dependency relationship with other RDDs (lineage|lineage)
            RDD will generate a new RDD every time it is converted, and a dependency relationship will be formed between RDDs. This relationship is called kinship or lineage. During the calculation process, if the data of a certain partition is lost, RDD will recalculate the data of the missing data partition based on the dependency, instead of recalculating the data of all partitions.
  • For RDDs that store key-value pairs, there is an optional partitioner.
            For RDDs that store kv key-value pairs, there will be a Partitioner. The Partitioner not only determines the number of partitions of the RDD, but also determines the number of partitions output during shuffle.

  •         Spark’s philosophy is that “mobile data is not as good as mobile computing” to store a list of the priority positions of each slice . When scheduling tasks, Spark will try to allocate computing tasks to the nodes where the data to be processed is located.

Guess you like

Origin blog.csdn.net/FlatTiger/article/details/114916492