HQL query is basically the same as the general SQL statement query. This article mainly summarizes some basic sort queries in hive.
Article Directory
data preparation:
SELECT e.ename, d.dname, l.loc_name
FROM emp e
JOIN dept d
ON d.deptno = e.deptno
JOIN location l
ON d.loc = l.loc;
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
Insert this table and data into hive:
1.order by:
Use order by to indicate global sorting, there is only one mapreduce,
ASC (ascend): ascending (default)
DESC (descend): descending
- Query employee information in ascending salary order:
select * from emp order by sal;
- Query employee information in descending order of salary:
select * from emp order by sal desc;
- Sort by alias, sort by 2 times of employee salary:
select ename, sal*2 twosal from emp order by twosal;
- Sort multiple columns in ascending order by department and salary:
select ename, deptno, sal from emp order by deptno, sal ;
2.Sort by
Sort By: The efficiency of order by for large-scale data sets is very low. In many cases, global sorting is not required, and sort by can be used at this time.
Sort by generates a sort file for each reducer. Each Reducer is sorted internally, not for the global result set.
1. Set the number of reduce
set mapreduce.job.reduces=3;
2. View the set reduce number
set mapreduce.job.reduces;
3. View employee information in descending order according to department number
select * from emp sort by deptno desc;
4. Import the query results into the file (sorted in descending order by department number)
insert overwrite local directory '/home/hive/datas/sortby-result'
select * from emp sort by deptno desc;
3.Distribute by
Distribute By: In some cases, we need to control which reducer a particular row should go to, usually for subsequent aggregation operations. The distribute by clause can do this. distribute by is similar to partition (custom partition) in MR, partitioning is used in combination with sort by.
For the test of distribute by, you must allocate multiple reduce for processing, otherwise you cannot see the effect of distribute by.
Case practice:
(1) First partition by department number, then sort by employee number in descending order.
set mapreduce.job.reduces=3;
insert overwrite local directory '/home/hive/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
Note:
1. The distribution rule of distribute by is to divide the hash code of the partition field and the number of reduce by the same number and divide the remainder into one area.
2. Hive requires the DISTRIBUTE BY statement to be written before the SORT BY statement.
4.Cluster by
When the distribute by and sorts by fields are the same, the cluster by method can be used.
In addition to the function of distribute by, cluster by also has the function of sort by. However, the sorting can only be in ascending order, and you cannot specify the sorting rule as ASC or DESC.
1) The following two writing equivalents
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;
Note: Division by department number is not necessarily a fixed value, but divisions 20 and 30 can be divided into one division.