SELECT TOP N method implemented Hive

TOP N is the maximum / N data small.

Given the Hive provides a limit keyword, with the sort function can be easily implemented.

But in order by Hive can only generate a reduce, the amount of data if the table is too large, order by will be powerless

例如SQL:select a from t_test order by a limit 10;

Console output: Number of reduce tasks determined at compile time: 1

Description reduce the number of starts is determined at compile time to view the SQL execution plan and found only start a Job

If the table data is very large, and we just want to take the Top 10, so it is very unreasonable to do so

 

This can be considered sort by, we can solve this problem

select a from t_test sort by a limit 10;

Console Output: Number of reduce tasks not specified Estimated from input data size:. 1

Description decided to reduce the number of not compile time, but dynamically determined based on the input file size.

sort by start can reduce a plurality, each reduce do local sort, sort by limit N to which is enough.

From the execution plan view, sort by limit N launched two Job, Job done first partial ordering in each reduce, the Top N are removed, then the second Job globally ordered to do, you want to come out Top N the result of.

Assumptions: The first x number of Job Start reduce, the second pair of x Job reduce sorted x * N pieces of data do global ordering, take Top N, to obtain the desired results.

This will greatly enhance select efficiency.

Guess you like

Origin www.cnblogs.com/zbw1112/p/12550751.html