RDD related information compilation

What are RDDs?

Official website description

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel

RDD is the acronym of three words, which stands for elastic distributed data set. It is the most basic data abstraction of spark and represents an immutable, partitionable set whose elements can be operated in parallel.

  • DataSet means it is a collection that stores a lot of data.
  • Distributed means that the data is stored in a distributed manner. The data is split into small pieces and stored on different physical machines, which facilitates distributed parallel computing in the future.
  • Resilient means that RDD data can be stored in memory or on disk. You can set it yourself and save it in memory by default.

Five major characteristics of RDD

(1) - A list of partitions
(2) - A function for computing each split
(3) - A list of dependencies on other RDDs
(4) - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
(5) - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
  1. A collection of partition lists, that is, an RDD containing multiple (data) partitions. When the spark task executes the RDD later, a partition is divided into a task, and multiple tasks are executed in parallel.

  2. Functions that act on each shard (partition). The developer only wrote one line of operation statements in the code. When it is actually executed, the function will be executed on each partition of the RDD.

  3. One RDD will depend on multiple other RDDs, and the dependencies between upper and lower RDDs form a lineage. The fault tolerance mechanism of spark tasks is based on this feature.

  4. (Optional) There is a partition function only for kv type RDD (shuffle must be generated). If it is not a kv type RDD, its partition function is None, which means there is none.

    There are two partitioning functions in spark:

    The first one: HashPartitioner function, removes the hashcode value from the key, and then takes the remainder of the number of partitions to obtain the partition number. [Use this by default]

    The second type: RangePartitioner function, which partitions according to a certain range. Keys in the same range will enter the same partition.

  5. (Optional) A list of optimal data block locations, the locality of the data, and the optimal location of the data. Spark task computing will give priority to nodes with data to start computing tasks, that is, where the data is, the computing tasks will be placed there, thus reducing network transmission and improving performance.

Operator classification of RDD

RDD operators are mainly divided into two types:

  1. transformation
    • It can convert one RDD into another RDD. When the program executes this type of RDD code, it only builds dependency (lineage) relationships and does not trigger real operations. (This is somewhat similar to the style of hardware-oriented programming and tensorflw, which draws a "picture" first and then triggers execution)
    • For example: flatMap, map, reduceByKey.
  2. action
    • It will initiate real execution and execute according to the "picture" drawn previously.
    • For example: collect, saveAsTextFile

Dependencies between RDDs

Two types of dependencies: wide dependencies and narrow dependencies.

Narrow dependency: The partition of the parent RDD is only used by at most one partition of the child RDD (one-to-one, many-to-one). For example, map, filter, flatMap and other operator operations.

Wide dependency: Multiple partition data of the child RDD will depend on the same partition data of the parent RDD (one-to-many). For example: reduceByKey, groupByKey, sortByKey and other operators.

Narrow dependencies will not cause shuffle, but wide dependencies will.

image-20221201193959090

Dependencies between RDDs in your code form a lineage. If the partition data of a certain RDD is lost, the lost part of the partition data can be recovered by recalculating the lineage relationship.

What is shuffle

To put it simply, the shuflle process is to pull the same key distributed on multiple nodes in the cluster to the same node for aggregation and join operations. Operators such as reduceByKey and join will trigger shuffle operations. During the shuffle process, the same key on each node will be written to the local disk first, and then other nodes need to pull the same key from the disk file on each node through network transmission. Moreover, when the same key is pulled to the same node to perform an aggregation operation, it is possible that too many keys are processed on one node, resulting in insufficient memory for storage, and then overflowing and writing to the disk. Therefore, a large amount of disk IO and network IO will occur during the shuffle process, which is also the main reason for the performance difference of shuffle.

RDD caching mechanism

What is RDD cache?

Save the RDD data in memory or disk. When you need to use the data later, you can read it directly from the cache to avoid repeated calculations.

How to set up RDD cache?

Refer to principle three of this link

What is the difference between cache and persist?

The essence of cache is to call the pesist method, which puts data into the cache. persist can store data in memory or disk. This method can pass in different cache levels, which are defined in StorageLevel.

How to clear RDD cache

Method 1: The system automatically clears the cached data. When the application is executed, the cached data disappears.

Method 2: Manual clearing, calling unpersist(true)method.

Partition of DAG Directed Acyclic Graph

Directed Acyclic Graph It is a directed but acyclic graph generated according to the dependencies between RDDs in the program.

1. Why should we divide stages?

There may be a large number of wide and narrow dependencies in a job task. Wide dependencies will cause shuffle but narrow dependencies will not. After the stages are divided, there are only narrow dependencies and no wide dependencies in the same stage. These narrow dependencies can be independently parallelized.

2. How to divide stages?

Pushing forward from the last RDD, create a stage first, which is actually the last stage. If a narrow dependency is encountered, the RDD is added to the stage. If a wide dependency is encountered, the stage is cut, a new stage is re-created, and the process is continued until the initial RDD is reached. The entire process of dividing stages is over.

3. How to execute stage?

The stage is divided into multiple tasks according to the data partition. When the spark task is executed, the stages are executed sequentially according to the relationship between them. Each task inside the stage executes the data on its own partition. These tasks perform the same operation in parallel.

reference link

https://www.bilibili.com/video/BV1AJ411R7rb/?p=11&spm_id_from=pageDriver&vd_source=8a9f7d97e5a2fbdbe8a4a83a47d251b9

https://tech.meituan.com/2016/04/29/spark-tuning-basic.html

Guess you like

Origin blog.csdn.net/yy_diego/article/details/128221373