hive sort order by, sort by, distribute by using

Prerequisite: Sorting hive are used in oder by, sort by, distribute By , cluster By
the specific use of the following

1.order by: using the order by clause to make global ordering, Hive analysis to achieve the underlying data is MapReduce,
order by globally ordered to do, is done by only one reducer

hive (default)> select * from emp order by sal desc;

2.sort By: sort by each reducer to produce a sort file. Each Reducer internal sort, not the sort of global result sets
very low for large-scale data sets order by efficiency. In many cases, does not require global ordering, then you can sort by using
simply the partition sorted

hive (default)> set mapreduce.job.reduces=3;
hive (default)> select * from emp sort by deptno desc;
hive (default)> insert overwrite local directory '/opt/module/datas/sortby-result' select * from emp sort by deptno desc;

3.distribute By: In some cases, we need to control to which a particular line should reducer, typically for subsequent aggregation operation. distribute by clause can do it. Similarly the MR distribute by partition (custom partitions), partition, sort by combined use.
It is simply: follow the partition that field
tested to distribute by, be sure to allocate more than reduce processing, or can not see distribute by results.
Case practical operation:
first by department number partitions, each partition and then sorted in descending order of number of employees.

hive (default)> set mapreduce.job.reduces=3;
hive (default)> insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

Note:
1. After distribute by partitioning rules is carried out according to the number of modulo hash code and reduce field partition remainder assigned to a same area.
2. Hive requirements DISTRIBUTE BY statement before SORT BY statement to write.

  1. cluster By
    When distribute by field and sorts by the same, may be used cluster by mode.
    Two equivalent wording
hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;

Note:
Cluster by addition to the functions also distribute by both sort by function.
But can only be sorted in ascending order, you can not specify a collation for the ASC or DESC.

Published 53 original articles · won praise 4 · Views 931

Guess you like

Origin blog.csdn.net/weixin_43548518/article/details/104087412