Spark partitioner hashpartitioner, RangePartitioner, distributed sorting principle

Most operators of spark use the default partitioner HashPartitioner. HashPartitioner will calculate key.hascode%numpartitions for the key of the data, and the obtained value will be placed in the corresponding partition, so that the data can be distributed to the partition in a more balanced manner.
RangePartitioner:
It is a partitioner that will be used in sorting operators, such as sortbykey, sortby, orderby, etc. The partitioner first samples the keys of the input data to estimate the distribution of the keys, and then divides the ranges according to the specified order, trying to make the keys in the range corresponding to each partition evenly distributed.
As a commonly used distributed sorting operator in spark, sortBykey uses RangePartitioner. By splitting the range, each partition is ordered, and each partition is sorted internally to achieve the order of the overall data.
As shown below:
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_39719415/article/details/107844410