Spark RDD

Spark RDD

1. Course Objectives

  • 1. Master the principle of RDD

  • 2. Proficiency in using RDD operators to complete computing tasks

  • 3. Master the width and narrow dependencies of RDD

  • 4. Master the caching mechanism of RDD

  • 5. Master the division of stages

2. Overview of RDDs

  • 1. What is RDD

    • RDD (Resilient Distributed Dataset) is called a Resilient Distributed Dataset. It is the most basic data abstraction in Spark. It represents an immutable, partitionable collection whose elements can be computed in parallel.

      • Dataset

        • It means that RDD is a data collection, which stores a lot of data

      • Distributed

        • The data of RDD is distributed and stored, which is helpful for distributed computing

      • Resilient

        • Elastic ----> It means that the data in the RDD can be stored in memory or on disk

3. Five properties of RDD

  • A list of partitions

         a list of partitions

  • A function for computing each split

         function in each partition

  • A list of dependencies on other RDDs

          One rdd will depend on multiple other rdds, spark's fault tolerance mechanism is based on this feature

  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

         (Optional) There will be a partition function for KV type RDDs (as long as shuffle occurs), which determines the source and destination of data

  • Optionally, a list of preferred locations to compute each split on (e.g. block locations foran HDFS file)

        (Optional) A set of optimal data block locations for data locality and data optimality.

 4. Create RDD

* 1. From an existing scala collection
* val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8))
* 2. Call textFile through sparkContext to read external data source
* val rdd2=sc.textFile("/words.txt") * 3. Generate a new RDD * val rdd3=rdd2.flatMap(_.split(" "))
through the operator operation of the existing RDD

 5. RDD operator classification

* Transformation (transformation)
* After an RDD is operated, a new RDD is generated; it is lazy loaded and will not be executed immediately, it just records a series of operations acting on the RDD.
* Action (action)
* It is an action that will trigger the actual running of the entire task.

6. RDD dependencies

* Narrow dependency
* Each partition of the parent RDD is used by at most one partition of the child RDD
* Analogy: only child
* Wide dependency
* Multiple partitions of child RDD will depend on the same partition of the parent RDD
* Analogy: super birth

 7. lineage

* It will record the metadata information on the current RDD and the corresponding conversion behavior on the RDD. If the data of a certain partition of a current RDD is lost, it can be recovered through bloodlines.

8. RDD caching mechanism

* cache
* The cache method is called for the RDD that needs to be cached. By default, the data is cached in the memory, but it is not cached after the cache operation is performed, but an action operation is required later. Its essence is to call the persist method.
*persist
* have rich cache levels.
You can delete the cached data by calling RDD.persist(StorageLevel.cachelevel)
* rdd.unpersist(boolean)
* ture means blocking and waiting for the cache data to be deleted
* false means not blocking, perform the following related operations while deleting

 9. Checkpoint mechanism

* Checkpoint is for relatively more reliable persistent data. Data that needs to be persistent can be saved on HDFS
* How to set checkpoint
* Use sc.setCheckpointDir (HDFS directory)
* Call the checkpoint method for RDD that needs to be persistent
* rdd .checkpoint
* The operator operation that triggers the action is also required later . The
difference between cache, persist and checkpoint
* 1. Data can be cached
* 2. The essence of cache is to call the persist method, and the default is to cache data in memory
* 3. persist There are rich cache levels, and these cache levels are all in the Storagelevel object.
* 4. After checkpoint is executed, a new RDD is generated as checkpointRDD. At this time, the dependencies of the RDD have changed. If the data is lost, it cannot pass Recalculate to recover. The lineage lineages of cache and persist have not changed, and they can still be recalculated after data loss.
* 5. After the data is lost, the order of recovery
* First search in the cache, and get it directly. If the cache is not set, go to the checkpoint to find it. If the checkpoint is not set, it can only be recalculated at this time.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325061690&siteId=291194637