09-20 hive SQL窗口函数+ 练习重点练习题

Hive SQL窗口函数:

核心语法主框架:

8-Select
1 -From (left table)
3-   (join_type)join(right_table) 
2-	On
4--Where
5-Group by
6- With
7-Having
9-Order by
10-Limit

1、 sum()、avg()用于累计窗口函数
2、 row_number(),rank() 用于创建排序窗口函数
3、 ntile()分组查询窗口函数
4、 lag(),lead()偏析分析窗口函数

累计窗口函数:

sum(…) over(….) over的作用是指定累计的条件(字段)

eg1:
—2018年每月的支付总额和当年累计支付总额

Select a.month,
a.pay_amount,
sum(a.pay_amount) over (order by a.month)

from
	(select month(dt) month,
		Sum(pay_amount) pay_amount
	From user_trade
	Where year(dr)=2018
	Group by month(dt))a;

1、 partition by 起到分组的作用(下述例子中先按照年份(2017,2018)分组,再每个组内按照月份排序),
2、Order by 按照什么进行排序进行累加, asc 升序,desc 降序,默认是升序;

Eg:

2017-2018年每月的支付总额和当年的累计总额

Select a.year,
a.month,
a.pay_amount,
		sum(pay_amount) over(partition by a.year order by a.month)

	# partition by的作用,数据按年份分组,不然排序中就会有 2018-01,2017-03
From
		(select year(dt)year
Month(dt) month
Sum(pay_amount) pay_amount
From user_trade
Where year(dt) in (2017,2018)
Group by year(dt)
Month(dt) a);

常见错误:
A. 没有分组

Sum(pay_amount) over( order by a.month) 
24个月排序后18和17年的月份混淆
 
B.	分组依据(字段)设置错误
Sum(pay_amount) over ( partition by a.year, a.month order by a.month)

上述分组后24个月各是一组,就无法实现统计;
冬眠
时间的过客
回到那个夏天

Avg(……)over(……)

Eg1:
2018年每个月的近三月移动平均支付金额

移动平均值: 测定值(x1,x2,x3,x4,x5,x6,x7),则移动平均值为 (x1+x2+x3)/3,(x2+x3+x4)/3,
(x3+x4+x5)/3

Hive实现:

Select a.month,  # f返回的是 月份列
a.pay_amount, #返回的每月的支付金额列 
avg(pay_amount) over( order by a.month rows between 2 preceding and current row )
	--返回求得的移动平均值列
from
	(select month(dt) month,
		Sum(pay_amount) pay_amount
	From user_trade
	Where year(dt)=’2018’
	Group by month(dt))a;

总结:
1、 sum(a)over (partition by …b… order by c rows between d1 and d2)
2、 avg(a)over (partition by …b… order by c rows between d1 and d2)

a.需要被加工的字段
b.分组字段的名称
c.排序的字段名称
d.计算的行数范围

rows  between unbounded preceding and current row; 
--包括本行和之前所有的行--可以省略
rows  between current now  and unbanded following ;
 本行之后所有的行;
rows between 3 preceding and current row ;
 包括本行和之前的三行;
rows between 3 preceding and 1 following;
从前三行到下一行(共5行,包括本行)

分区排序窗口函数:(面试考点)

1、 row_number()over(……) 为查询每一行结果生成一个序号且排序,不会重复;
2、 rank()over(……) 和dense_rank()over(……) 计算的字段结果值相同,所得的序号相同;

eg1:
2018年1月用户购买商品品类数量的排名

select user_name,
count( distinct goods_catrgory),
row_number()over(order by count(distinct goods_category)),
rank()over( order by count(distinct goods_category)),
dence_rank()over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)=’2018-01’
group by user_name;

返回结果区别简示如下:

Goods_category Row_number() Rank() Dence_rank()
1 1 1 1
1 2 1 1
2 3 3 2

业务场景:
Row_number() 取前2个人
Rank() 高考成绩排名(上例中第二名没有人)
Dence_rank() 比赛奖牌获得者(分数相同 并列)

Eg2:

选出2019年支付金额排名在第10名,20名,30名的用户

经过函数处理后的字段重命名as可以省略?

Select a.user_name,
a.pay_amount,
    a.rank
From
	(select user_name,  --姓名列
Sum(pay_amount) pay_amount,  --每个人的支付总额列
Row_number()over( order by sum(pay_amount) desc ) rank --支付总额排序列
	 From user_trade
	 Where year(dt)=’2019’
	Group by user_name)a
Where a.rank in (10,20,30);

分组排序窗口函数:

Ntile(n) over( partition by a order by b)
N: 切片的片数
A:分组的字段名称
B:排序的字段名称

Ntile(n) –用于将分组数据按照顺序切片,返回切片数;
Ntile不支持 rows between ;
如果切片(分组)不均匀,默认增加第一个切片的分布

Eg:将2019年1月的支付用户,按照支付金额分成5组;

Select user_name,
	Sum(pay_amount) pay_amount,
	Ntile(5)over( order by  sum(pay_amount) desc ) level  --切片分组的字段和排序字段是同一条件
From user_trade
Where substr(dt,1,7)=’2019-01’
Group by user_name;

Eg2:
选出2019年退款金额排名前百分之10的用户

Select a.user_name,
	a.refund_amount,
    a.level

From
	(select user_name,
Sum(refund_amount)refund_amount,
Ntile(10)over(order by sum(refund_amount) desc) level
 	  From user_refund
	  Where year(dt)=’2019’
	  Group by user_name)a
Where a.level=1; 

偏析分析窗口函数

1、 lag(……)over(…) 取排好序的字段的前n行(leg都是往前偏移)
2、 lead(….)over(……) 取后n行 (lead都是往后偏移)
lag( exp_str, offset,defval) over ( partition by … order by ….)
lead( exp_str, offset,defval) over ( partition by … order by ….)

exp_str: 字段名称
offset:偏移量,上一个或上n个的值,offset的默认值是1;
defavl: 默认值,取值范围超出表的范围时,我们就用默认值代替;

eg1:
alice和alexander的各种时间偏移

select user_name,
dt,
lag(dt,1,dt) over (partition by user_name order by dt),
lag(dt) over(partition by user_name order by dt),
lag(dt,2,dt)over(partition by user_name order by dt),
lag(dt,2) over(partition by user_name order by dt)
from user_trade
where dt>‘0’
and user_name in ( ‘alice’,‘alexander’);

注释: lag(dt),对dt取偏移,offset默认值为1,defavl默认为null,
partition by user_name 以不同的人为条件分组,分别取不同的人的时间偏移;

eg2:
支付时间超过100天的用户

select count( distinct a.user_name)
from
	(select user_name,
				dt,
		 lead(dt,1) over( partition by user_name order by dt) lead_dt
		 from user_trade
		 where dt>'0')a
where a.lead_dt is not null 
	and datadiff(a.lead_dt,dt)>100;

重点练习:

每个城市,不同性别,2018年支付金额最高的top3用户

重点在: 
                  上海     男 top3
                           女 top3
                  北京      男 top3
                           女 top3

自己答:
先求每个人2018年全年支付总额
再按城市,性别排序
左后筛选
select a.user_name,
(select user_name,
sum(pay_amount) over( partition by city and sex order by sum(pay_amount) ) pay_amount
from user_trade
where year(dt)=‘2018’
group by user_name)a
from user_trade
where a. > 3 and a. =3;

修改–改错:

	select  c.user_name,
			c.city,
			c.sex,
			c.pay_amount
			c.rank
	from
		(select a.user_name,
				b.city,
				b.sex,
				a.pay_amount,
				row_number() over (partition by b.city,b.sex order by a.pay_amount  desc)  rank
		from 
			(select user_name,
				   sum(pay_amount) pay_amount
			from user_trade
			where year(dt)='2018'
			group by user_name)a
			left join user_info b on a.user_name=b.user_name) c

	where c.rank <=3;

猜你喜欢

转载自blog.csdn.net/weixin_46400833/article/details/108695739
今日推荐