Spark Shuffle operations- official website translation

Reference official website: http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations

Shuffle operations

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Some operations will lead to spark the "shuffle" operation. "Shuffle" spark redistribution mechanism is data, the data packets back to a different partition. This usually involves copying data between the actuator and the machine, so that a complicated and expensive shuffle operation.

Background

To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result

In order to better understand the "shuffle" of what happened during we can reduceByKey example to analyze
. reduceByKey elements have the same key operation is performed to reduce a polymerizable function generates a new tuple RDD. A key challenge is all the values corresponding to the same partition is not necessarily, or even probably not on the same machine, but they must be common calculations.

In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

In the spark, some specific operations required data is not distributed across partitions. During calculation, a task execution on a partition, all the data to reduce tasks are run on a single reduceByKey, we need to perform an all-to-all operation. It must read all of the key and key values ​​corresponding to all from all partitions and aggregated across partitions to calculate the results for each key - This process is called shuffle.

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

Although each new partition shuffle data set will be determined by the order of partition itself, too, but the order of these data is uncertain. If you want the data shuffle is ordered, you can use:

  • mapPartitions to sort each partition using, for example, .sorted

  • repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning

  • sortBy to make a globally ordered RDD

  • and sorting each partition mapPartitions partition, e.g., .sorted

  • repartitionAndSortWithinPartitions efficient sorting partition in the partition at the same time.

  • sortBy globally sort of RDD

Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

repartition shuffle operation triggers include operations such as repartition and coalesce, 'ByKey operation (except Counting) and the like groupByKey reduceByKey, and join operations, and the like cogroup join.

Performance Impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations

Shuffle is a costly operation, to the disk I / O, data serialization, network I / O. In order to organize the data shuffle, Spark generated set of tasks - map data organization task, reduce aggregation data do tasks. These terms come from MapReduce, nothing to do with the operation and reduce operational map of Spark.

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

Internally, a map task all the resulting data will be stored in memory until the memory can not store all up. Then, these data are sorted based on the target partition and writes a separate file. When reduce, the task associated with the read data block sorted.

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and ‘ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Some shuffle operation will consume a lot of memory space heap, because shuffle operation before and after data conversion, data structures required in the use of the data memory organization. It should be particularly noted that, reduceByKey and aggregateByKey create these data structures at map, 'ByKey operation to create these data structures when reduce. When the memory is full time, Spark will overflow of data stored on disk, which will lead / O overhead of additional disk I and garbage collection overhead.

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.

shuffle operation will generate a large number of intermediate files on disk. In Spark 1.3, these files will be held until the corresponding date and RDD not use garbage collection. The benefit of this is that if recalculated in RDD Spark kinship (lineage), shuffle these intermediate files generated by the operation does not need to be recreated. If Spark long-term application of a reference to the RDD, or less frequent garbage collection, which will result in garbage collection cycle longer. This means that the long-running Spark task may consume a lot of disk space. Temporary data storage path can be configured by setting parameters spark.local.dir in SparkContext.

Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.

Behavior shuffle operations can be set by adjusting a plurality of parameters. See detailed description "Shuffle Behavior" section Spark Configuration Guide.

Transfer blog: https://blog.csdn.net/soul_code/article/details/78069982

Guess you like

Origin blog.csdn.net/liweihope/article/details/92072371