hive分析窗口函数

1、Sum avg max min
2、Ntile 可以看成是:把有序的数据集合平均分配到指定的数量个桶中,将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
语法是:ntile(num) over([partition_clause] order_by_clause) as xxx
然后可以根据桶号,选取前或后n分之几的数据。
数据会完整展示出来,只是会给相应的数据打标签;具体要取几分之几的数据,需要再嵌套一层根据标签取出。

比如,统计一个cookie,pv数最多的前1/3的天
Select * 
from (Select cookieid,
			createtime,pv,
			ntile(3) over(partition by cookieid order by pv desc) as rn from t
	) a
where rn=1;

3、Row_number rank dense_rank
4、这两个序列函数不是很常用:
cume_dist:小于等于当前值的行数/分组内总行数
比如,统计小于等于当前心水的人数,所占总人数的比例。
percent_rank:分组内当前行的rank值-1/分组内总行数-1
5、Lag(col,n,default) 用于统计窗口内往上第n行值

Select cookieid,createtime,url,
		row_number() over(partition by cookieid order by createtime) as rn, 
		lag(createtime,1,1970-01-01 00:00:00) over(partition by cookieid order by createtime) as last_1_time,
	    lag(createtime,2) over(partition by cookieid order by createtime) as last_2_time 
from t;

6、Lead用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行,第三个参数为默认值(当往下第n行为Null时候,取默认值,如不指定,则为null)
7、First_value 取分组内排序后,截止到当前行,第一个值
8、Last_value 取分组内排序后,截止到当前行,最后一个值

9、Grouping sets 是一种将多个group by逻辑写在一个sql语句中的便利写法。等价于将不同维度的group by结果集进行union all。
Grouping_id,表示结果属于哪一个分组集合。

Select month,day,count(distinct cookieid) as uv,grouping__id
From t
Group by month,day
Grouping sets(month,day) --根据grouping sets中的分组条件month,day,1代表month,2代表day。
Order by grouping_id;
--等价于
Select month,null,count(distinct cookieid) as uv,1 as grouping__id 
from t 
group by month
Union all
Select null as month,day,count(distinct cookieid) as uv,2 as grouping__id 
from t 
group by day;

再如:

Select month,day,count(distinct cookieid) as uv,grouping__id from t
Group by month,day
Grouping sets(month,day,(month,day))
Order by grouping__id;
--等价于
Select month,null,count(distinct cookieid) as uv,1 as grouping__id from t group by month
Union all
Select null,day,count(distinct cookieid) as uv,2 as groupomg_id from t group by day
Union all
Select month,day,count(distinct cookieid) as uv,3 as grouping__id from t group by month,day;

10、Cube 根据group by 的维度的所有组合进行聚合。00 10 01 11
11、Rollup 是cube的子集,以最左侧的维度为主,从该维度进行层级聚合。即00 10 11
语法:

select month,day,count(distinct cookieid) as uv,grouping__id 
from t 
group by month,day 
With rollup 
Order by grouping__id;

猜你喜欢

转载自blog.csdn.net/AnlaGodness/article/details/106837205