The difference between order by, sort by, distribute by and cluster by in hive

hive中order by ,sort by ,distribute by 和 cluster by的区别

always say:

Generally speaking, these four have the functions of sorting and aggregation in hive, however, the MRs that they start when they are executed are different.

In detail:

order by:

Order by will sort all the given data globally, and will only "wake up" one reducer to work. It is like a fool, no matter how much data comes, only one reducer is started to process it. Therefore, the amount of data is okay, but once the amount of data becomes larger, order by will become extremely laborious and even "strike".

sort by:

Sort by is partial sorting. Compared with the laziness and confusion of order by, sort by is the opposite. It is not only very diligent, but also has a clone function. Sort by will start one or more reducers to work according to the amount of data, and it will generate a sort file for each reducer before entering reduce. The advantage of this is to improve the efficiency of global sorting.

distribute by:

The function of distribute by is: distribute by controls the distribution of map results, it will distribute the map output with the same fields to a reduce node for processing. That is, under certain circumstances, we need to control a specific row to a reducer. This operation is generally to prepare for the subsequent aggregation operations that may occur.
Give a most common example:

from records2
select year,temperature
distribute by year

Continue from above:

sort by year asc,temperature desc

When sorting meteorological data based on year and temperature, we want to see the data of the same year be placed in the same reducer for processing. Therefore, this result must also be globally sorted. In particular, because distribute by is usually used together with sort by, when distribute by encounters sort by, distribute by should be placed first. This is not difficult to understand, because the data to be processed must be processed from the map side through distribute by first Distribute, in this way, sort by, who is good at partial sorting, can work freely. Otherwise, if there is no distribute by distribution, then sort by will process all the data, that is, global sorting. This is not the job of sort by, and doing so can only slow down the efficiency of the cluster.

cluster by:

If all the columns in sort byand distribute byare the same, it can be abbreviated to in cluster byorder to specify the columns used by both at the same time

from recrds2
select year , temperature
cluster by year;

Guess you like

Origin blog.csdn.net/qq_42578036/article/details/110139638