1.等比例抽样
此场景在工作中遇到过,笔者原创。有用户表user,字段user_id, city。现运营同事要选10w人发调查问卷,要求人群的city分布,和全量用户的city分布一致。
with city_fenbu as (
select city, user_cnt/ sum(user_cnt) over() as zhanbi
from (
select city,count(user_id) as user_cnt
from user_info
group by city
) t1
)
,user_shuffle as (
select city, user_id,
row_number() over (partition by city order by rand()) as rk --用户集打乱
from user_info
)
select aa.*, bb.*
from user_shuffle aa join city_fenbu bb on aa.city=bb.city
where round(100000*bb.zhanbi) >= aa.rk;
2.获取连续活跃天数>=n天的用户集
经典场景,必掌握。有用户活跃日志表user_active_log, 字段:日期dt(yyyy-MM-dd),user_id。直接上sql
with uid_dt as (
select dt, user_id
from user_active_log
group by dt,user_id
)
,uid_flg as (
select user_id, dt,
date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_active_days as (
select user_id, flg_dt, count(dt) as continue_active_days
from uid_with_flg
group by user_id, flg_dt
)
select user_id
from uid_active_days
group by user_id
having max(continue_active_days) >=7;
引申题,求每个用户连续活跃区间的起止日期。
with uid_dt as (
select dt, user_id
from user_active_log
group by dt,user_id
)
,uid_flg as (
select user_id, dt,
date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_start_end_dt as (
select user_id, dt,
first_value(dt) over (partition by user_id, flg_dt order by dt) as start_dt,--同一连续活跃区间的起始日期
first_value(dt) over (partition by user_id, flg_dt order by dt desc) as end_dt --结束日期
)
select user_id,start_dt,end_dt
from uid_start_end_dt
group by user_id,start_dt,end_dt;
3.最大同时在线人数
经典场景,必掌握。有用户登陆、登出日志表user_login_log,字段:user_id, login_time, logout_time。
方法1:对于每个用户A,找出与它同时在线的用户。再按用户id分组,取最大在线人数。
select max(online_cnt)
from (
select a.user_id, a.login_time, count(distinct b.user_id) as online_cnt
from user_login_log a full join user_login_log b
where (
(a.login_time>=b.login_time and a.login_time<=b.logout_time)
or (a.logout_time>=b.login_time and a.logout_time<=b.logout_time)
)
group by a.user_id, a.login_time
) t;
方法1问题很明显,需要full join自身,用户数即使刚刚过万,full join后也有超过1亿条记录,大概率sql执行不成功。
方法2:若实时在线人数在服务器端需要打点,一般会怎么做。维护一个全局的计数器online_cnt,登陆+1,登出-1。用sql实现此逻辑
with ts_onlinecnt as (--每一时刻同时在线人数
select t, cnt, sum(cnt) over (order by t) as online_cnt --按t排序,求和范围是从第一行到当前行
from (
select login_time as t, 1 as cnt
from user_login_log where login_time is not null
union all
select logout_time as t, -1 as cnt
from user_login_log where logout_time is not null
) detail
)
select max(online_cnt)
from ts_onlinecnt;
用实际数据帮助理解,select * from user_login_log;
user_id | login_time | logout_time |
张七 | 2020-09-09 11:12:20 | 2020-09-09 13:13:30 |
张六 | 2020-09-09 11:12:20 | 2020-09-10 13:13:30 |
张五 | 2020-09-10 11:12:20 | 2020-09-10 13:13:30 |
张四 | 2020-09-10 12:12:20 | 2020-09-10 12:13:30 |
张三 | 2020-09-10 11:12:30 | 2020-09-10 12:12:30 |
ts_onlinecnt结果如下:
t | cnt | online_cnt |
2020-09-09 11:12:20 | 1 | 2 |
2020-09-09 11:12:20 | 1 | 2 |
2020-09-09 13:13:30 | -1 | 1 |
2020-09-10 11:12:20 | 1 | 2 |
2020-09-10 11:12:30 | 1 | 3 |
2020-09-10 12:12:20 | 1 | 4 |
2020-09-10 12:12:30 | -1 | 3 |
2020-09-10 12:13:30 | -1 | 2 |
2020-09-10 13:13:30 | -1 | 0 |
2020-09-10 13:13:30 | -1 | 0 |
插段题外话,关于hive窗口函数,可以参考http://lxw1234.com/archives/tag/hive-window-functions。大佬介绍的非常清晰,值得收藏。
此处有一个窗口函数的特殊之处需要注意:给ts_onlinecnt子表加一列:
select t, cnt, sum(cnt) over (order by t) as online_cnt,
sum(cnt) over (order by t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as online_cnt_1
from (
select start_time as t, 1 as cnt
from user_log where start_time is not null
union all
select end_time as t, -1 as cnt
from user_log where end_time is not null
) detail
t | cnt | online_cnt | online_cnt_1 |
2020-09-09 11:12:20 | 1 | 2 | 2 |
2020-09-09 11:12:20 | 1 | 2 | 1 |
2020-09-09 13:13:30 | -1 | 1 | 1 |
2020-09-10 11:12:20 | 1 | 2 | 2 |
2020-09-10 11:12:30 | 1 | 3 | 3 |
2020-09-10 12:12:20 | 1 | 4 | 4 |
2020-09-10 12:12:30 | -1 | 3 | 3 |
2020-09-10 12:13:30 | -1 | 2 | 2 |
2020-09-10 13:13:30 | -1 | 0 | 0 |
2020-09-10 13:13:30 | -1 | 0 | 1 |
从标红的四行可以看到,在排序列t等值时,online_cnt 和 online_cnt_1并不相等。这与hive官方文档描述不一致。如果是笔者理解有误,还请评论指正。
When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
窗口函数中,若order by没有声明开窗范围,默认为起始行到当前行。