Sorting query of Hive entry (order by, sort by, distribute by, cluster by ...)

HQL query is basically the same as the general SQL statement query. This article mainly summarizes some basic sort queries in hive.

data preparation:

SELECT e.ename, d.dname, l.loc_name
FROM   emp e 
JOIN   dept d
ON     d.deptno = e.deptno 
JOIN   location l
ON     d.loc = l.loc;
7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20
7839	KING	PRESIDENT		1981-11-17	5000.00		10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
7900	JAMES	CLERK	7698	1981-12-3	950.00		30
7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
7934	MILLER	CLERK	7782	1982-1-23	1300.00		10

Insert this table and data into hive:
Insert picture description here

1.order by:

Use order by to indicate global sorting, there is only one mapreduce,
ASC (ascend): ascending (default)
DESC (descend): descending

  1. Query employee information in ascending salary order:
select * from emp order by sal;

Insert picture description here

  1. Query employee information in descending order of salary:
 select * from emp order by sal desc;

Insert picture description here

  1. Sort by alias, sort by 2 times of employee salary:
select ename, sal*2 twosal from emp order by twosal;

Insert picture description here

  1. Sort multiple columns in ascending order by department and salary:
select ename, deptno, sal from emp order by deptno, sal ;

Insert picture description here

2.Sort by

Sort By: The efficiency of order by for large-scale data sets is very low. In many cases, global sorting is not required, and sort by can be used at this time.
Sort by generates a sort file for each reducer. Each Reducer is sorted internally, not for the global result set.
1. Set the number of reduce

set mapreduce.job.reduces=3;

2. View the set reduce number

set mapreduce.job.reduces;

Insert picture description here

3. View employee information in descending order according to department number

 select * from emp sort by deptno desc;

Insert picture description here

4. Import the query results into the file (sorted in descending order by department number)

insert overwrite local directory '/home/hive/datas/sortby-result'
 select * from emp sort by deptno desc;

Insert picture description hereInsert picture description here
Insert picture description here

3.Distribute by

Distribute By: In some cases, we need to control which reducer a particular row should go to, usually for subsequent aggregation operations. The distribute by clause can do this. distribute by is similar to partition (custom partition) in MR, partitioning is used in combination with sort by.
For the test of distribute by, you must allocate multiple reduce for processing, otherwise you cannot see the effect of distribute by.
Case practice:
(1) First partition by department number, then sort by employee number in descending order.

 set mapreduce.job.reduces=3;
insert overwrite local directory '/home/hive/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

Note:
1. The distribution rule of distribute by is to divide the hash code of the partition field and the number of reduce by the same number and divide the remainder into one area.
2. Hive requires the DISTRIBUTE BY statement to be written before the SORT BY statement.
Insert picture description hereInsert picture description hereInsert picture description here

4.Cluster by

When the distribute by and sorts by fields are the same, the cluster by method can be used.
In addition to the function of distribute by, cluster by also has the function of sort by. However, the sorting can only be in ascending order, and you cannot specify the sorting rule as ASC or DESC.
1) The following two writing equivalents

 select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;

Note: Division by department number is not necessarily a fixed value, but divisions 20 and 30 can be divided into one division.

Published 39 original articles · won praise 1 · views 4620

Guess you like

Origin blog.csdn.net/thetimelyrain/article/details/104169732