Detailed explanation of Spark's RDD concept

Overview

RDD (A Resilient Distributed Dataset): A Resilient Distributed Dataset is the most basic data abstraction in Spark. It is used to represent distributed collections and supports distributed operations.

Birth background
There is no RDD/Dataset to do Wordcount (big data calculation), you can use:

  1. Native collection: List in Java/Scala but only supports stand-alone version! Distributed is not supported. If you want to do distributed computing, you need to do a lot of extra work, such as thread/process communication, fault tolerance, automatic balance, etc., troublesome, all will be born A framework to solve these problems.
  2. MapReduce has low efficiency, low operation efficiency, low development efficiency)-it has been eliminated long ago,
    so a distributed data abstraction is needed, that is, with this abstraction, a distributed collection can be represented, then operations based on this distributed collection can be very convenient Complete the distributed Word Count! (The bottom layer of the distributed collection should encapsulate the implementation details and provide a simple and easy-to-use API)

Five attributes

In the RDD object, each RDD has five main attributes:

  • Partition list: A list of partitions
  • Calculation function: A function for computing each split
  • Dependency: A list of dependencies on other RDDs
  • 分区器: Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • 计算位置:Optionally, a list of preferred locations to compute each split on (e.g. block locations for
    an HDFS file)

Five attributes of RDD in WordCount

Insert picture description here

Guess you like

Origin blog.csdn.net/zh2475855601/article/details/115029506