Hive使用row_number()函数有重复值,顺序固定吗

分组topN。求出每个部门工资最高的员工信息

准备数据

建表语句

create table employee (empid int ,deptid int ,salary decimal(10,2));

插入数据

insert into employee values(1,10,5500.00);
insert into employee values(2,10,4500.00);
insert into employee values(3,20,1900.00);
insert into employee values(4,20,4800.00);
insert into employee values(5,40,6500.00);
insert into employee values(6,40,14500.00);
insert into employee values(7,40,44500.00);
insert into employee values(8,50,6500.00);
insert into employee values(9,50,7500.00);

数据查询

数据显示为:

+-------+--------+----------+
| empid | deptid | salary   |
+-------+--------+----------+
| 3     | 20     | 1900.00  |
| 8     | 50     | 6500.00  |
| 9     | 50     | 7500.00  |
| 1     | 10     | 5500.00  |
| 6     | 40     | 14500.00 |
| 5     | 40     | 6500.00  |
| 4     | 20     | 4800.00  |
| 7     | 40     | 44500.00 |
| 2     | 10     | 4500.00  |
+-------+--------+----------+

需求

需求:根据部门分组,显示每个部门工资最高的员工

根据员工的工资等级来排序,使用排名函数row_number()

SELECT empid
	, deptid
	, salary
	, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC) AS rank
FROM employee;

结果显示

+-------+--------+----------+------+
| empid | deptid | salary   | rank |
+-------+--------+----------+------+
| 4     | 20     | 4800.00  | 1    |
| 3     | 20     | 1900.00  | 2    |
| 1     | 10     | 5500.00  | 1    |
| 2     | 10     | 4500.00  | 2    |
| 7     | 40     | 44500.00 | 1    |
| 6     | 40     | 14500.00 | 2    |
| 5     | 40     | 6500.00  | 3    |
| 9     | 50     | 7500.00  | 1    |
| 8     | 50     | 6500.00  | 2    |
+-------+--------+----------+------+

怎么显示出分数排名第一的员工的信息了,很容易犯的错误是直接使用where或者having来进行筛选

使用where来直接筛选

SELECT empid
	, deptid
	, salary
	, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC) AS rank
FROM employee
WHERE rank = 1

结果报错:表中没有rank字段

ERROR: AnalysisException: Could not resolve column/field reference: 'rank'

使用having来进行筛选

FAILED: SemanticException HAVING specified without GROUP BY

结果报错,having语句要接在group by子句的后面

很常见的一个解决方式是,构建出临时表,使用子查询来进行排序筛选或者分页

SELECT * from
(SELECT empid, deptid, salary, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC) AS rank
FROM employee) t where rank = 1

结果显示

+-------+--------+----------+------+
| empid | deptid | salary   | rank |
+-------+--------+----------+------+
| 4     | 20     | 4800.00  | 1    |
| 1     | 10     | 5500.00  | 1    |
| 7     | 40     | 44500.00 | 1    |
| 9     | 50     | 7500.00  | 1    |
+-------+--------+----------+------+

如果有多个数据都是最大,第一名会怎么取呢?

再插入三条数据

insert into employee values(10,20,4800.00);
insert into employee values(11,10,5500.00);
insert into employee values(12,50,7500.00);

可以看到结果是变化

+-------+--------+----------+------+
| empid | deptid | salary   | rank |
+-------+--------+----------+------+
| 10    | 20     | 4800.00  | 1    |
| 11    | 10     | 5500.00  | 1    |
| 7     | 40     | 44500.00 | 1    |
| 9     | 50     | 7500.00  | 1    |
+-------+--------+----------+------+

可能也有

+-------+--------+----------+------+
| empid | deptid | salary   | rank |
+-------+--------+----------+------+
| 10    | 20     | 4800.00  | 1    |
| 1     | 10     | 5500.00  | 1    |
| 7     | 40     | 44500.00 | 1    |
| 12    | 50     | 7500.00  | 1    |
+-------+--------+----------+------+

造成结果不一致,常用的方法是

  1. 业务属性确保 salary字段不能唯一,则需要针对排序结果再加上 一个唯一字段的排序 比如empid
SELECT * from(SELECT empid, deptid, salary, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC,empid) AS rank
FROM employee) t where rank = 1

这样结果是唯一的

+-------+--------+----------+------+
| empid | deptid | salary   | rank |
+-------+--------+----------+------+
| 1     | 10     | 5500.00  | 1    |
| 7     | 40     | 44500.00 | 1    |
| 9     | 50     | 7500.00  | 1    |
| 4     | 20     | 4800.00  | 1    |
+-------+--------+----------+------+

一定会是这几条数据

  1. 是否可以去掉不必要的order by,这个是由业务逻辑决定的

总结

1)row_number函数: row_number() 按指定的列进行分组生成行序列, 从 1 开始, 如果两行记录的分组列相同, 行序列也是+1。有重复值,数据的顺序是不固定的,可以采用再加上一个字段来达到类似去重复的效果。

2)over 函数:是一个窗口函数.
over(order by salary) 按照salary排序进行累计,order by是个默认的开窗函数.
over(partition by deptid)按照部门分区
over(partition by deptid order by salary desc)按照部分分区,并按着薪水降序排序.

发布了35 篇原创文章 · 获赞 12 · 访问量 6635

猜你喜欢

转载自blog.csdn.net/u012955829/article/details/102874222