Let's talk about Spark's shuffle in the vernacular

Speaking of spark shuffle, we have to mention hadoop shuffle first, but I won’t talk about the process. If you’re interested, you can read the blog post on MR principle that I posted earlier, which says how Hadoop’s MR shuffle works https: //blog.csdn.net/dudadudadd/article/details/111593379

In general, even if the default of hadoop up to now is a set of shuffle form, of course, hadoop also supports you to customize the grouping to the shuffle result.

The default shuffle processing method is hash. At the beginning, spark followed the big brother hadoop in line. The shuffle method imitated the big brother’s hash idea, using the hash value of the key to overflow the file to perform shuffle.

Knock on the blackboard and pay attention! I need to remind you that in the first generation of Spark's shuffle, it just imitated the ideas of hadoop, not copying it!
Some people think they are exactly the same, and there are differences between them. Hadoop is based on the ring buffer mode based on the memory overflow file. For details, please refer to the mr principle that I posted before, and the first generation of spark There is no advanced operations such as buffers and sorting. It directly overflows the data according to the hash of the key.

But then I used it to find something wrong, and found that Big Brother was abandoning physical space in exchange for the calculation results. He went further and further on the road, never looking back. He had to follow the steps of Big Brother, and Big Brother will produce a result based on the disk at a time. , But Spark itself chain calls, a series of adjustments, if the physical space is broken, it is not a joke.

Here is the first knowledge point of this paper, which is the first generation of Spark's hashshuffle. First of all, I will give you a picture of the
Insert picture description here
first generation. You can see the picture above at a glance . In the first generation, because of a The task will generate multiple maps, and the default number of maps is determined by the number of data slices. For example, if a large file is stored in five slices, then five maps will be activated by default to pull data. The number of maps is we It can be changed, but from here it is shuffle, so it is not involved. If you are interested, you can go to see other information. To
get back to the subject, in the first generation, the file overflowing data between each map is not Interoperability, that is to say, how many reduce tasks there are, then each map will open reduce overflow files, the number of reduce is determined by the number of partitions, which also leads to, at the beginning, we set The larger the number of partitions, the larger the physical space consumed. For example, if the map is 3 and the reduce is 3, there will be 9 overflowed files. This is the first generation of Spark's shuffle, which has a
large number of overflowed files. The output stream and memory consumption are huge, and if one is not good, it will be OOM

Spark was a little panicked at this time, and said that Big Brother, you really cheated me. You give up so much space to get a result and place the order, but I am different. I chain calls and run a large algorithm, which wastes physics so extravagantly. Space, what if one day breaks?

So at this time, spark started to strengthen its shuffle mechanism, and there was the era of spark shuffle2.0. In 2.0, the general direction was Hashshuffle, but the overflow file method was optimized. The overflow file is not interoperable, but Taking the executor as the unit, the data of the same partition overflowed by all maps in each executor will only overflow into one file.
Insert picture description here
However, Hashshuffle after this optimization is still slowly unable to meet the market demand, because it is directly output. There is no sorting, and this cannot be said to be a defect of Spark itself, but a design inevitable. Spark's rdd set is quite flexible, and it is out of the range of MR's key-value key-value pair data. So at the beginning of Spark design, I didn’t care about sorting operations, and the format of overflow files is still a lot for Spark.

Therefore, when these requirements become more and more necessary, Spark also immediately upgraded its shuffle, and the 3.0 era of Spark appeared, called sort shuffle

As soon as sort shuffle was born, good guys, that’s terrible. Even the overflow file of the executor is not needed. So the map is output to a file, and when the data overflows in this file, it is not directly entered. Enter after being sorted according to the original partition id. The partition id mentioned here is the identifier of each partition of Spark. This method is a bit like the grouping idea of ​​Hadoop, except that Spark is used for partitioning, and the data in the partition will be defaulted. Sorted by Hash value, but what is the range of each partition, what is the offset before the partition and the partition, these data are recorded in another index file, that is, sort shuffle will only have two files It can be said to make up for all the drawbacks of Hash shuffle, but the index file needs to be recorded, so the performance is still not at the peak

This is not over yet, it can be said that Spark can develop so hot for no reason.

There are three important versions of Spark:

The first is version 1.5.0 . In this version, Spark upgraded sort shuffle again, discarded the index file, and changed the sort based on the object structure to sort directly on the binary serialized data, and used pointers instead of partitions. The range and offset of this kind of shuffle is called Unsafe shuffle. This kind of sort shuffle does not need to consume extra cpu and io streams to deal with index files, so it is more excellent in resource usage

The second one is version 1.6.0 . Spark has high requirements for using Unsafe shuffle, so it integrates ordinary sort shuffle and Unsafe shuffle, and Spark automatically recognizes which shuffle to use when the task is running.

The third is version 2.0.0 . In this version, Spark completely eliminated Hash shuffle and deleted Hash shuffle from the underlying code. From then on, Spark shuffle only has sort shuffle, until now.

Guess you like

Origin blog.csdn.net/dudadudadd/article/details/114186779