Hive面试题1:复杂sql

1.等比例抽样

此场景在工作中遇到过，笔者原创。有用户表user，字段user_id, city。现运营同事要选10w人发调查问卷，要求人群的city分布，和全量用户的city分布一致。

with city_fenbu as (
    select city, user_cnt/ sum(user_cnt) over() as zhanbi 
    from (
        select city,count(user_id) as user_cnt
        from user_info 
        group by city
    ) t1
)
,user_shuffle as (
    select city, user_id, 
    row_number() over (partition by city order by rand()) as rk --用户集打乱
    from user_info
)
select aa.*, bb.*  
from user_shuffle aa join city_fenbu bb on aa.city=bb.city
where round(100000*bb.zhanbi) >= aa.rk;

2.获取连续活跃天数>=n天的用户集

经典场景，必掌握。有用户活跃日志表user_active_log, 字段：日期dt(yyyy-MM-dd)，user_id。直接上sql

with uid_dt as (
    select dt, user_id
    from user_active_log
    group by dt,user_id
)
,uid_flg as (
    select user_id, dt, 
    date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期，但实际意义不明。同一个连续活跃区间内，flg_dt相同
)
,uid_active_days as (
    select user_id, flg_dt, count(dt) as continue_active_days
    from uid_with_flg
    group by user_id, flg_dt
)
select user_id
from uid_active_days
group by user_id 
having max(continue_active_days) >=7;

引申题，求每个用户连续活跃区间的起止日期。

with uid_dt as (
    select dt, user_id
    from user_active_log
    group by dt,user_id
)
,uid_flg as (
    select user_id, dt, 
    date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期，但实际意义不明。同一个连续活跃区间内，flg_dt相同
)
,uid_start_end_dt as (
    select user_id, dt, 
    first_value(dt) over (partition by user_id, flg_dt order by dt) as start_dt,--同一连续活跃区间的起始日期
    first_value(dt) over (partition by user_id, flg_dt order by dt desc) as end_dt --结束日期
)
select user_id,start_dt,end_dt 
from uid_start_end_dt
group by user_id,start_dt,end_dt;

3.最大同时在线人数

经典场景，必掌握。有用户登陆、登出日志表user_login_log，字段：user_id, login_time, logout_time。

方法1：对于每个用户A，找出与它同时在线的用户。再按用户id分组，取最大在线人数。

select max(online_cnt)
from (
    select a.user_id, a.login_time, count(distinct b.user_id) as online_cnt
    from user_login_log a full join user_login_log b 
    where (
        (a.login_time>=b.login_time and a.login_time<=b.logout_time)
        or (a.logout_time>=b.login_time and a.logout_time<=b.logout_time)
    )
    group by a.user_id, a.login_time
) t;

方法1问题很明显，需要full join自身，用户数即使刚刚过万，full join后也有超过1亿条记录，大概率sql执行不成功。

方法2：若实时在线人数在服务器端需要打点，一般会怎么做。维护一个全局的计数器online_cnt，登陆+1，登出-1。用sql实现此逻辑

with ts_onlinecnt as (--每一时刻同时在线人数
    select t, cnt, sum(cnt) over (order by t) as online_cnt --按t排序，求和范围是从第一行到当前行
    from (
        select login_time as t, 1 as cnt
        from user_login_log where login_time is not null
        union all 
        select logout_time as t, -1 as cnt 
        from user_login_log where logout_time is not null
    ) detail
)
select max(online_cnt)
from ts_onlinecnt;

用实际数据帮助理解，select * from user_login_log;

user_id	login_time	logout_time
张七	2020-09-09 11:12:20	2020-09-09 13:13:30
张六	2020-09-09 11:12:20	2020-09-10 13:13:30
张五	2020-09-10 11:12:20	2020-09-10 13:13:30
张四	2020-09-10 12:12:20	2020-09-10 12:13:30
张三	2020-09-10 11:12:30	2020-09-10 12:12:30

ts_onlinecnt结果如下：

t	cnt	online_cnt
2020-09-09 11:12:20	1	2
2020-09-09 11:12:20	1	2
2020-09-09 13:13:30	-1	1
2020-09-10 11:12:20	1	2
2020-09-10 11:12:30	1	3
2020-09-10 12:12:20	1	4
2020-09-10 12:12:30	-1	3
2020-09-10 12:13:30	-1	2
2020-09-10 13:13:30	-1	0
2020-09-10 13:13:30	-1	0

插段题外话，关于hive窗口函数，可以参考http://lxw1234.com/archives/tag/hive-window-functions。大佬介绍的非常清晰，值得收藏。

此处有一个窗口函数的特殊之处需要注意：给ts_onlinecnt子表加一列：

    select t, cnt, sum(cnt) over (order by t) as online_cnt, 
    sum(cnt) over (order by t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as online_cnt_1
    from (
        select start_time as t, 1 as cnt
        from user_log where start_time is not null
        union all 
        select end_time as t, -1 as cnt 
        from user_log where end_time is not null
    ) detail

t	cnt	online_cnt	online_cnt_1
2020-09-09 11:12:20	1	2	2
2020-09-09 11:12:20	1	2	1
2020-09-09 13:13:30	-1	1	1
2020-09-10 11:12:20	1	2	2
2020-09-10 11:12:30	1	3	3
2020-09-10 12:12:20	1	4	4
2020-09-10 12:12:30	-1	3	3
2020-09-10 12:13:30	-1	2	2
2020-09-10 13:13:30	-1	0	0
2020-09-10 13:13:30	-1	0	1

从标红的四行可以看到，在排序列t等值时，online_cnt 和 online_cnt_1并不相等。这与hive官方文档描述不一致。如果是笔者理解有误，还请评论指正。

When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

窗口函数中，若order by没有声明开窗范围，默认为起始行到当前行。