hive中order by ,sort by ,distribute by 和 cluster by

Always say:
general look, these four are the sort of role in the hive and gather in, however, they start in the implementation of MR was different.

Go into detail:
the Order by:

All data have given the order by ordering globally, and will "wake up" a reducer work. It is like a fools, like, no matter how much data, only start a reducer to deal with. Therefore, a small amount of data that can be, but once the amount of data becomes large order by will become very difficult, even "strike."

sort by:

sort by a local sort. Compared order by lazy confused, sort by just the opposite, it is not only very diligent, and have spare function. starts to sort by a plurality of data amount according to the size reducer to work, and it produces a file of sorting each reducer before entering reduce. The advantage is to improve the efficiency of the overall ranking.

distribute by:

function is to distribute by: distribute by the control map distribution of the results, it will have the same field distribution map output to make reduce a processing node. That is, under certain circumstances, we need to control a particular row to a reducer, this operation is usually gathered in preparation for a subsequent operation may occur.

https://blog.csdn.net/qq_40795214/article/details/82190827

Guess you like

Origin www.cnblogs.com/gouhaiping/p/12652983.html