The difference between order by, sort by, distribute by, cluster by in Hive

1.order by

order by can specify desc/asc
order by to sort the input globally, so there is only one reduce (multiple reducers cannot guarantee global sorting), but one reducer will cause the calculation time to be longer when the input scale is large.

2.sort by

Sort by is not a global sort. The data is sorted before entering the reducer. Therefore, if sort by is used for sorting, and the weapon lets you set mapped.reduce.task>1, sort by will only ensure that the output of each reducer is in order. Ensure global order. (The realization of full sorting: first sort by and then order by.

3.distribute by (important)

distribute by controls how to split data on the map side to the reducer side. Hive will distribute according to the following column of distribute by, corresponding to the number of reduce, and someone uses the hash algorithm. sort by generates a sort file for each reducer. In some cases, you need to control which reducer a particular row should go to. This is usually for the aggregation operation for later learning. distribute by can just do this. Therefore, distribute by is usually used in conjunction with sort by.

select * from store distribute by merid sort by money desc;

4.cluster by

Cluster by has the combined function of distribute by and sort by. But the sorting can only be ascending, and the sorting rule cannot be specified as ASC or DESC

Guess you like

Origin blog.csdn.net/Cxf2018/article/details/109308867