Spark shuffer introduction, and operation

1. Preface

     Simply copy it down, record it, and point out any problems with the translation.

 

Shuffle operations

<!--?xml version="1.0" encoding="UTF-8" standalone="no"?-->
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
 
The events triggered by operating  spark include shuffle . Shuffle is a way for spark to shuffle data through cross-partition operations.
Usually involves  copying data between executors and machines, making shuffle a very expensive operation.
 
 
Background
To understand what happens during the shuffle we can consider the example of the   operation. The  operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result. reduceByKey reduceByKey
 
To understand what happens during shuffle, we can refer to  the example.  The operation will generate a new RDD, and merge it into a tuple by key (similar to: map), and execute the reduce function by key to get the execution result. The challenge is not all keys are distributed in the same partition, or even the same machine. But they have to be merged to get the result. reduceByKey reduceByKey
 
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single  reduceByKey  reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the  shuffle .
 
In spark, data is usually not across partitions, and specific operations are performed in a necessary place. During computation, a single task will operate on a single partition, so the collated data will be executed in a single  reduce task. Spark needs to perform all these operations. It has to find all the keys from all the partitions, and then pool the data and merge each key to get the final result. This is shuffle.   reduceByKey
 

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

  • mapPartitions to sort each partition using, for example, .sorted
  • repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
  • sortBy to make a globally ordered RDD
 
Although the elements of each new partition after shuffled are determined in , these elements themselves are not in order. If you need to get sorted shuffled data, you can use:
mapPartitions: Use sorting for each partition, e.g. .sorted
repartitionAndSortWithinPartitions: sort from the new partition
sortBy  to make a globally ordered RDD: make a globally ordered RDD
 
Operations which can cause a shuffle include  repartition  operations like   and ‘ByKey operations (except for counting) like and , and join operations like  and . repartition coalesce groupByKey reduceByKey cogroup join
 
 
The partition operations that will cause the shuffle operation include   and  , "ByKey" operations such as  and  , and  join  operations such as   and  . repartition coalesce groupByKey reduceByKey cogroup join
 
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327016843&siteId=291194637