Spark kernel analysis-Spark shuffle6 (6)

1. Spark Shuffle process

1.1 Introduction to the Shuffle process of MapReduce

The original meaning of Shuffle is to shuffle, shuffle, and try to convert a set of data with certain rules into a set of irregular data. The more random the better. Shuffle in MapReduce is more like the reverse process of shuffling, trying to convert a set of irregular data into a set of data with certain rules.
Why does the MapReduce computing model require the Shuffle process? We all know that the MapReduce computing model generally includes two important stages: Map is mapping, responsible for filtering and distributing data; Reduce is protocol, responsible for calculating and merging data. The data of Reduce comes from Map, and the output of Map is the input of Reduce. Reduce needs to obtain data through Shuffle.
The entire process from Map output to Reduce input can be broadly called Shuffle. Shuffle spans the Map side and the Reduce side. The Map side includes the Spill process, and the Reduce side includes the copy and sort processes, as shown in the figure:

Insert image description here

1.1.1Spill process

The Spill process includes output, sorting, overwriting, merging and other steps, as shown in the figure:
Insert image description here

1.1.1.1 Collect

Each Map task continuously outputs data in the form of <key, value> pairs into a ring data structure constructed in memory. The purpose of using a ring data structure is to use memory space more efficiently and place as much data as possible in memory.
This data structure is actually a byte array, called Kvbuffer, as the name suggests, but it not only places <key, value> data, but also places some index data, giving the area where the index data is placed an alias of Kvmeta. An IntBuffer (the byte order uses the platform's own byte order) vest is worn on an area of ​​the Kvbuffer. The <key, value> data area and the index data area are two adjacent and non-overlapping areas in Kvbuffer. A dividing point is used to divide the two areas. The dividing point is not eternal, but will be updated after each Spill. . The initial dividing point is 0, the storage direction of <key, value> data is upward growth, and the storage direction of index data is downward growth, as shown in the figure:

Insert image description here
The storage pointer bufindex of Kvbuffer keeps growing upward. For example, the initial value of bufindex is 0. After an Int type key is written, bufindex grows to 4. After an Int type value is written, bufindex grows to 8.
The index is the index of <key, value> in kvbuffer. It is a four-tuple, including: the starting position of value, the starting position of key, partition value, and the length of value. It occupies four Int lengths and is the storage pointer of Kvmeta. Kvindex jumps down four "grids" every time, and then fills in the quadruple data one grid at a time. For example, the initial position of Kvindex is -4. After the first <key, value> is written, the position of (Kvindex+0) stores the starting position of value, the position of (Kvindex+1) stores the starting position of key, ( The location of Kvindex+2) stores the value of the partition, and the location of (Kvindex+3) stores the length of the value. Then Kvindex jumps to the -8 position. After the second <key, value> and index are written, Kvindex jumps to - 32 positions.
Although the size of Kvbuffer can be set through parameters, it is only that big in total. <key, value> and index are constantly increasing. Adding up, there will always be a day when Kvbuffer is not enough. What should I do? The process of flushing data from memory to disk and then writing data to memory. The process of flushing data in Kvbuffer to disk is called Spill. What a clear name. When the data in memory is full, it will automatically spill to a larger size. disk space.
Regarding the conditions for Spill triggering, that is, to what extent Kvbuffer is used to start Spill, we still need to pay attention to it. If the Kvbuffer is used so hard that Spill is started when there is no gap left, then the Map task will need to wait until Spill is completed to free up space before it can continue to write data; if the Kvbuffer is only full to a certain extent, such as 80% Just start Spill, and the Map task can continue to write data while Spilling. If Spill is fast enough, Map may not need to worry about free space. Whichever is greater between the two benefits is generally chosen.
The important process of Spill is undertaken by the Spill thread. After receiving the "order" from the Map task, the Spill thread begins to work formally. The work it does is called SortAndSpill. It turns out that it is not just Spill. There was a controversial one before Spill. Sort.

1.1.1.2 Sort

First, sort the data in Kvbuffer in ascending order according to the two keywords of partition value and key. Only the index data is moved. The sorting result is that the data in Kvmeta are gathered together according to the partition unit, and the data in the same partition are ordered according to the key.

1.1.1.3 Spill

The Spill thread creates a disk file for this Spill process: it searches all local directories for a directory that can store such a large space, and creates a file similar to "spill12.out" in it after finding it. The Spill thread spits the <key, value> data into this file partition by partition according to the sorted Kvmeta. After the data corresponding to one partition is spit out, it spits out the next partition sequentially until all partitions are traversed. The data corresponding to a partition in the file is also called a segment.

All data corresponding to partitions are placed in this file. Although they are stored sequentially, how do you directly know the starting position of a partition stored in this file? Powerful indexing appears again. There is a triplet that records the index of the data corresponding to a certain partition in the file: the starting position, the original data length, and the compressed data length. One partition corresponds to a triplet. Then store the index information in the memory. If it cannot fit in the memory, the subsequent index information needs to be written to the disk file: rotate through all local directories to find a directory that can store such a large space, and find it in it. Create a file similar to "spill12.out.index". The file not only stores index data, but also stores crc32 verification data. (spill12.out.index is not necessarily created on the disk. If it can fit in the memory (default 1M space), it is placed in the memory. Even if it is created on the disk, it is not necessarily in the same directory as the spill12.out file. Down.)

Each Spill process will generate at least one out file, and sometimes an index file will be generated. The number of Spills is also imprinted in the file name. The corresponding relationship between the index file and the data file is shown in the figure below:
Insert image description here
While the Spill thread is performing SortAndSpill work in full swing, the Map task will not stop because of this, but will continue to output data as before. Map still writes data to kvbuffer, so the problem arises: <key, value> only cares about growing upward according to the bufindex pointer, and kvmeta only cares about growing downward according to Kvindex, keeping the starting position of the pointer unchanged and continuing to run. Or find another way? If you keep the starting position of the pointer unchanged, bufindex and Kvindex will soon meet. After the meeting, restarting or moving the memory will be troublesome and undesirable. Map takes the middle position of the remaining space in kvbuffer, and sets this position as the new dividing point. The bufindex pointer moves to this dividing point, and Kvindex moves to the -16 position of this dividing point. Then the two can harmoniously follow their own established The data is placed on the track. When the Spill is completed and the space is freed up, no changes are needed to move forward. The conversion of the dividing points is shown in the figure below:

Insert image description here
The Map task always writes the output data to the disk. Even if the output data is very small and can fit in the memory, the data will be flushed to the disk in the end.

1.1.2Merge

Insert image description here
If the Map task outputs a large amount of data, Spill may be performed several times, and a large number of out files and Index files will be generated and distributed on different disks. Finally, the merge process of merging these files makes its debut.
How does the Merge process know where the generated Spill files are? Scan all local directories for the generated Spill files and store the paths in an array. How does the Merge process know Spill's index information? That's right, Index files are scanned from all local directories, and then the index information is stored in a list. Here, we encountered another place worth wondering. Why not store this information directly in the memory during the previous Spill process? Why add this additional scanning operation? Especially Spill's index data. In the past, the data was written to the disk when the memory exceeded the limit. Now, the data needs to be read from the disk, and it still needs to be loaded into more memory. The reason why this is unnecessary is because at this time, kvbuffer, a large memory user, is no longer in use and can be recycled, and there is memory space to store these data. (For rich people with large memory space, it is worth considering using memory to save these two IO steps.)
Then create a file called file.out and a file called file.out.Index for the merge process. Store the final output and index.
Merge output one partition at a time. For a certain partition, all index information corresponding to this partition is queried from the index list, and each corresponding segment is inserted into the segment list. That is, this partition corresponds to a segment list, recording the file name, starting position, length, etc. of the data corresponding to this partition in all Spill files.
Then merge all the segments corresponding to this partition, with the goal of merging them into one segment. When this partition corresponds to many segments, it will be merged in batches: first take out the first batch from the segment list, place it into a minimum heap with key as the key, and then take out the smallest <key each time from the minimum heap , value> output to a temporary file, so that this batch of segments is merged into a temporary segment, and added back to the segment list; then the second batch is taken out from the segment list and merged and output to a temporary segment. , add them to the list; repeat this process until the remaining segments are a batch, and output them to the final file.
The final index data is still output to the Index file.
The Shuffle process on the Map side ends here.

1.1.3Copy

The Reduce task drags the data it needs to each Map task through HTTP. Each node will start a resident HTTP server, and one of the services is to pull Map data in response to Reduce. When an HTTP request from MapOutput comes, the HTTP server reads the data corresponding to the Reduce part in the corresponding Map output file and outputs it to the Reduce through the network stream.
The Reduce task drags the data corresponding to a certain Map. If the data can be accommodated in the memory, the data will be written directly to the memory. Reduce needs to drag data to each Map. Each Map in the memory corresponds to a piece of data. When the space occupied by the Map data stored in the memory reaches a certain level, it starts to merge in the memory and output the data merge in the memory to in a file on disk.
If the Map data cannot fit in the memory, write the Map data directly to the disk, create a file in the local directory, read the data from the HTTP stream and write it to the disk. The buffer size used is 64K. Dragging a Map data over will create a file. When the number of files reaches a certain threshold, disk file merge will be started to merge and output these files into one file.
Some Maps have small data and can be placed in the memory, while some Maps have large data and need to be placed on the disk. In this way, some of the data dragged over by the Reduce task will be placed in the memory and some on the disk. In the end, these will be Let’s do a global merge.
12.1.4Merge Sort
The Merge process used here is the same as the Merge process used on the Map side. The output data of Map is already in order, and Merge performs a merge sort. The so-called sort process on the Reduce side is this merge process. Generally, Reduce copies and sorts at the same time, that is, the copy and sort stages overlap rather than being completely separated.
The Shuffle process on the Reduce side ends.

1.2 Introduction to the HashShuffle process

Spark has enriched task types. Data flow between some tasks does not need to go through Shuffle, but some tasks still need to pass data through Shuffle, such as group by key of wide dependency.
Map tasks in Spark that require Shuffle output will create corresponding buckets for each Reduce. The results generated by the Map will get the corresponding bucketId according to the set partitioner, and then fill it into the corresponding bucket. The output result of each Map may contain all the data required by Reduce, so each Map will create R buckets (R is the number of reducers), and M Maps will create a total of M*R buckets.
The bucket created by Map actually corresponds to a file on the disk. The result of Map written to each bucket is actually written to that disk file. This file is also called blockFile, which is the hash value of the file name passed by the Disk Block Manager. Created in a subdirectory corresponding to the local directory. Each Map creates R disk files on the node for result output. The results of the Map are directly output to the disk file. The 100KB memory buffer is used to create the Fast Buffered OutputStream output stream. One problem with this method is that there are too many Shuffle files.

Insert image description here
1) Each Mapper creates the same number of buckets as the Reducers. The bucket is actually a buffer with a size of spark.shuffle.file.buffer.kb (default 32KB).
2) The results generated by Mapper will be filled into each bucket according to the set partition algorithm, and then written to the disk file.
3) Reducer finds the corresponding file and reads data from the remote or local block manager.
To address the problem of too many files generated by the above Shuffle process, Spark has another improved Shuffle process: consolidation Shuffle, in order to significantly reduce the number of Shuffle files. In consolidation Shuffle, each bucket does not correspond to a file, but to a segment in the file. The Job's map is executed for the first time on a certain node, and the output files corresponding to the buckets are created for each reduce, and these files are organized into ShuffleFileGroup. After the map is executed this time, the ShuffleFileGroup can be released for the next recycling; when When another map is executed on this node, there is no need to create a new bucket file. Instead, the file that has been created in the last ShuffleFileGroup is obtained and a segment is appended. The current map has not been executed yet, and the ShuffleFileGroup has not been released. At this time, if a new map is executed on this node, this ShuffleFileGroup cannot be recycled. Instead, new bucket files can only be created to form a new ShuffleFileGroup to write output.

Insert image description here
For example, a job has 3 Maps and 2 reducers: (1) If there are 3 nodes in the cluster with empty slots at this time, and each node has an idle core, the 3 Maps will be scheduled to be executed on these 3 nodes. Each Map will create 2 Shuffle files, and a total of 6 Shuffle files will be created; (2) If there are 2 nodes in the cluster with empty slots at this time, and each node has an idle core, the 2 Maps will be scheduled to these 2 nodes first. When executed on the map, each Map will create 2 Shuffle files, and then one of the nodes will schedule the execution of another Map after executing the Map. Then this Map will not create a new Shuffle file, but append the result output to the one created by the previous Map. Shuffle file; a total of 4 Shuffle files are created; (3) If there are 2 nodes in the cluster with empty slots at this time, one node has 2 empty cores and one node has 1 empty core, then one node schedules 2 Maps and one node On a node that schedules 1 Map and 2 Maps, if one Map creates a Shuffle file, the subsequent Map will still create a new Shuffle file because the previous Map is still being written and the ShuffleFileGroup it created has not been released; a total of 6 are created Shuffle files.
Advantages
1) Fast - no need for sorting, no need to maintain hash table
2) No need for extra space for sorting
3) No need for extra IO - data only needs to be written to the disk once and read only once
Disadvantages
1) When partitions When large, a large number of files (cores * R) are output, and the performance begins to decrease.
2) A large number of file writes cause the file system to become random writes, and the performance is 100 times lower than sequential writes.
3) The cache space occupies a large amount
of Reduce. To drag Map output data, Spark provides two different data pulling frameworks: fetching data through socket connection; using netty framework to fetch data.
The Executor of each node will create a BlockManager, which will create a BlockManagerWorker to respond to the request. When Reduce's GET_BLOCK request comes in, the local file is read and the blockId data is returned to Reduce. If you are using the Netty framework, BlockManager will create a ShuffleSender to send Shuffle data.
Not all data is read through the network. For the Map data on this node, Reduce directly reads it from the disk without going through the network framework.
How does Reduce store the data after dragging it over? The data output by Spark Map is not sorted, and the data from Spark Shuffle will not be sorted. Spark believes that sorting during the Shuffle process is not necessary. Not all types of data required by Reduce need to be sorted. Forcibly sorting will only Increase the burden on Shuffle. The data dragged by Reduce will be placed in a HashMap. The HashMap also stores <key, value> pairs. The key is the key output by the Map. All the values ​​corresponding to the key output by the Map form the value of the HashMap. Spark inserts or updates each <key, value> pair retrieved by Shuffle into the HashMap and processes them one by one. HashMap is all placed in memory.
All data retrieved by Shuffle is stored in memory. For Shuffle data with a relatively small amount of data or that has been merged on the Map side, the memory space will not be too large. However, for operations such as group by key, Reduce needs to obtain All the values ​​corresponding to the key are placed in an array in memory, so when the amount of data is large, more memory is required.
When there is not enough memory, it will either fail or use the old method to move the data in the memory to the disk. Spark realizes the shortcomings of processing data that is much larger than the memory space, and introduces a solution with external sorting. The shuffled data is first placed in the memory. When the number of <key, value> pairs stored in the memory exceeds 1000 and the memory usage exceeds 70%, it is judged that the available memory on the node is still sufficient, and the memory buffer size is doubled. If If the available memory is no longer enough, the <key, value> pairs in the memory are sorted and written to a disk file. Finally, the data in the memory buffer is sorted and combined with those disk files to form a minimum heap, and the smallest data is read from the minimum heap each time. This is similar to the merge process in MapReduce.

1.3 Introduction to the SortShuffle process

Starting from 1.2.0, the default is sort shuffle (spark.shuffle.manager = sort). The implementation logic is similar to Hadoop MapReduce. Hash Shuffle generates a file for each reducer, but Sort Shuffle only generates an indexable file sorted by reducer id. This way, just get the location information about the relevant data block in the file and fseek to read the data for the specified reducer. But when the number of rueducers is relatively small, Hash Shuffle is obviously faster than Sort Shuffle, so Sort Shuffle has a "fallback" plan. For the number of reducers less than "spark.shuffle.sort.bypassMergeThreshold" (200 by default), we use The fallback plan hashes related data into separate files, and then merges these files into one. The specific implementation is BypassMergeSortShuffleWriter.

Insert image description here
Sort on the map, and apply Timsort[1] on the reduce side for merging. Whether spill is allowed on the map side is set by spark.shuffle.spill. The default is true. Set to false. If there is not enough memory to store the output of the map, it will cause an OOM error, so use it with caution.
The memory used to store map output is: "JVM Heap Size" * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction. The default is "JVM Heap Size" * 0.2 * 0.8 = "JVM Heap Size" * 0.16. If you run multiple threads in the same executor (set spark.executor.cores/spark.task.cpus to more than 1), the storage space of each map task is "JVM Heap Size" * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction / spark.executor.cores * spark.task.cpus, the default is 2 cores, then it is 0.08 * "JVM Heap Size".
Spark uses AppendOnlyMap to store map output data, and uses the open source hash function MurmurHash3 and the square detection method to save key and value in the same array. This saving method can be combined by spark. If spill is true, it will be sorted before spilling.
Compared with hash shuffle, each Mapper in sort shuffle only generates one data file and one index file. The data in the data file is sorted according to the reducer, but the data belonging to the same reducer is not sorted. The data generated by Mapper is first put into the AppendOnlyMap data structure. If there is not enough memory, the data will be spilled to disk and finally merged into a file.
Compared with Hash shuffle, the number of shuffle files is reduced and memory usage is more controllable. But sorting affects speed.
Advantages
1) Map creates fewer files
2) A small number of random IO operations, most of which are sequential reading and writing
Disadvantages
1) It is slower than Hash Shuffle, and you need to set the appropriate value through spark.shuffle.sort.bypassMergeThreshold.
2) If you use an SSD disk to store shuffle data, Hash Shuffle may be more suitable.

1.4 Introduction to the TungstenShuffle process

Tungsten-sort is not a brand-new shuffle solution. It is based on a similar existing Sort Based Shuffle processing process in specific scenarios, and has made great optimizations in memory/CPU/Cache usage. While it brings high efficiency, it also limits its usage scenarios. If Tungsten-sort finds itself unable to process it, it will automatically use Sort Based Shuffle to process it. Tungsten means tungsten wire in Chinese. Tungsten Project is a plan proposed by Databricks to optimize memory and CPU usage of Spark. In the early days of this plan, Spark SQL seemed to be optimized the most. However, some RDD APIs and Shuffle also benefit from this.
The optimization points of Tungsten-sort mainly lie in three aspects:
1) Sorting directly on serialized binary data instead of java objects reduces memory overhead and GC overhead.
2) Provide cache-efficient sorter, using an 8bytes pointer to convert sorting into sorting of a pointer array. 3) The merge process of spill can complete these optimizations
without deserialization, which leads to the introduction of a new memory management model, similar to OS Page. The corresponding actual data structure is MemoryBlock, which supports both off-heap and in-heap.
a pattern. In order to be able to locate the Record in these MemoryBlocks, the concept of Pointer (pointer) is introduced.
If you still remember the object PartitionedAppendOnlyMap that stores data in Sort Based Shuffle, this is a normal object placed in the JVM heap. In Tungsten-sort, it is replaced with an object similar to the operating system memory page. If you cannot apply for a new Page, you need to perform a spill operation at this time, which is an operation of writing to the disk. The specific trigger conditions are similar to Sort Based Shuffle.
Spark enables Sort Based Shuffle by default. If you want to enable Tungsten-sort, please set
spark.shuffle.manager=tungsten-sort .
The corresponding implementation class is:
org.apache.spark.shuffle.unsafe.UnsafeShuffleManager
. The origin of the name is because of the use of Extensive JDK Sun Unsafe API.
The new Shuffle method will be used when and only if the following conditions are met:
1) Shuffle dependency cannot have aggregation or the output needs to be sorted
2) Shuffle's serializer needs to be KryoSerializer or some of Spark SQL's custom serialization methods .
3) The number of Shuffle files cannot be greater than 16777216
4) During serialization, a single record cannot be greater than 128 MB
. As you can see, the conditions for use are quite strict.
Where do these restrictions come from
? See the following code, page size:

this.pageSizeBytes = (int) Math.min(PackedRecordPointer.MAXIMUM_PAGE_SIZE_BYTES,shuffleMemoryManager.pageSizeBytes());

This ensures that the page size does not exceed the value of PackedRecordPointer.MAXIMUM_PAGE_SIZE_BYTES, which is defined as 128M.
As for the specific design reasons for this limitation, we need to carefully analyze Tungsten’s memory model:

Insert image description here
This picture actually draws an on-heap memory logic diagram, in which the #Page part is 13bit and the Offset is 51bit. You will find 2^51 >>128M. However, during the Shuffle process, 51 bits are compressed and 27 bits are used. The details are as follows:
[24 bit partition number][13 bit memory page number][27 bit offset in page]
The 24 bits reserved here are given to the partition number. , for later sorting. Several of the above limitations are actually caused by this pointer:
One is the limitation of partition. The previous number 16777216 comes from the partition number represented by 24bit.
The second is the page number
and the third is the offset, which can be expressed up to 2^27=128M. The memory that a task can manage is limited by this pointer, which is up to 2^13 * 128M, which is about 1TB.
With this pointer, we can locate and manage memory in off-heap or on-heap. This model is still very beautiful, and the memory management is also very efficient. I remember that it was very difficult to estimate the memory of PartitionedAppendOnlyMap before, but with the current memory management mechanism, it is very fast and accurate.
Regarding the first limitation, that is because the sort part of the subsequent Shuffle Write only sorts the previous 24-bit partition number. The value of the key is not encoded into this pointer, so there is no way to perform ordering at the same time
, because the whole process is to pursue non-reversal. It is serialized, so aggregation cannot be done.
Shuffle Write
core class:
org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter
data will be written to the serOutputStream serialized output stream one by one through UnsafeShuffleExternalSorter.insertRecordIntoSorter.
The memory consumed here is
serBuffer = new MyByteArrayOutputStream (1024 * 1024).
The default is 1M, which is similar to the ExternalSorter in Sort Based Shuffle. The corresponding one in Tungsten Sort is UnsafeShuffleExternalSorter. After the record is serialized, it is put into the sorter through the sorter.insertRecord method. went.
Here, the sorter is responsible for applying for Page, releasing Page, and determining whether to perform spill, all of which are completed in this class. The code framework is actually the same as Sort Based.

Insert image description here
(In addition, it is worth noting that the Exeception bug caused by performing a spill operation while checking the memory availability in this picture has been fixed in version 1.5.1. Ignore that path.)
Whether the memory is sufficient is still determined by shuffleMemoryManager , that is, the total Page memory requested by all task shuffles cannot be greater than the following value:
ExecutorHeapMemeory * 0.2 * 0.8
The above number can be changed through the following two configurations:
spark.shuffle.memoryFraction=0.2
spark.shuffle.safetyFraction=0.8
UnsafeShuffleExternalSorter is responsible Apply for memory and generate the last logical address of the record, which is the Pointer mentioned earlier.
Then the Record will continue to flow to UnsafeShuffleInMemorySorter. This object maintains an array of pointers:
private long[] pointerArray;
the initial size of the array is 4096. If it is not enough in the future, it will be expanded by twice the size each time.
Assuming 1 million records, the array is about 8M, so it is actually very small. Once spilled, the UnsafeShuffleInMemorySorter will be assigned to null and recycled.
Let's look back at spill. In fact, the logic is extremely simple. UnsafeShuffleInMemorySorter will return an iterator. Each element of the iterator is a pointer. Then, the real record can be obtained based on the pointer, and then written to the disk. , because these records have been serialized as soon as they enter UnsafeShuffleExternalSorter, so here it becomes purely writing a byte array. The formed structure is still consistent with Sort Based Shuffle. The data of different partitions in a file are represented by fileSegment, and the corresponding information is stored in an index file.
In addition, a buffer is also needed when writing files:
spark.shuffle.file.buffer = 32k.
In addition, the data is obtained from the memory and put into DiskWriter. There is also a transfer in the middle, which is through
final byte[] writeBuffer = new byte[DISK_WRITE_BUFFER_SIZE =1024 * 1024];
is done in memory, so it is very fast.
Before the end of the task, we need to do a mergeSpills operation and then form a shuffle file. This is actually quite complicated.
If it is turned on
spark.shuffle.unsafe.fastMergeEnabled=true
and not turned on
spark.shuffle.compress=true
or the compression method is:
LZFCompressionCodec
, it can be merged very efficiently, called transferTo. However, no matter what the merge is, there is no need to deserialize it.
Shuffle Read
Shuffle Read completely reuses HashShuffleReader, see Sort-Based Shuffle for details.

1.5 Comparison of MapReduce and Spark processes

The comparison of the Shuffle process of MapReduce and Spark is as follows:

Insert image description here

Guess you like

Origin blog.csdn.net/qq_44696532/article/details/135391939