1. Preface
Simply copy it down, record it, and point out any problems with the translation.
Shuffle operations
<!--?xml version="1.0" encoding="UTF-8" standalone="no"?-->
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
The events triggered by
operating spark include
shuffle
.
Shuffle is a way for spark to shuffle
data through cross-partition operations.
Usually involves
copying data between
executors and machines, making
shuffle a very expensive operation.
Background
To understand what happens during the shuffle we can consider the example of the
operation. The operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
reduceByKey
reduceByKey
To understand
what happens during shuffle, we can refer to
the example. The operation will generate a new RDD, and merge it into a tuple by key (similar to: map), and execute the reduce function by key to get the execution result. The challenge is not all keys are distributed in the same partition, or even the same machine. But they have to be merged to get the result.
reduceByKey
reduceByKey
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single
reduceByKey
reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the
shuffle
.
In spark, data is usually not across partitions, and specific operations are performed in a necessary place. During computation, a single task will operate on a single partition, so the collated data will be executed in a single reduce task. Spark needs to perform all these operations. It has to find all the keys from all the partitions, and then pool the data and merge each key to get the final result. This is shuffle.
reduceByKey
Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:
-
mapPartitions
to sort each partition using, for example,.sorted
-
repartitionAndSortWithinPartitions
to efficiently sort partitions while simultaneously repartitioning -
sortBy
to make a globally ordered RDD
Although the elements of each new partition
after shuffled are determined in , these elements themselves are not in order. If you need to get sorted shuffled
data, you can use:
mapPartitions: Use sorting for each partition, e.g. .sorted
repartitionAndSortWithinPartitions: sort from the new partition
sortBy
to make a globally ordered RDD: make a globally ordered RDD
Operations which can cause a shuffle include
repartition
operations like
and , ‘ByKey operations (except for counting) like and , and join operations like and .
repartition
coalesce
groupByKey
reduceByKey
cogroup
join
The partition operations that will cause
the shuffle operation include
and , "ByKey" operations such as and , and join operations such as and .
repartition
coalesce
groupByKey
reduceByKey
cogroup
join