Hive面试题1:复杂sql

1.等比例抽样

此场景在工作中遇到过,笔者原创。有用户表user,字段user_id, city。现运营同事要选10w人发调查问卷,要求人群的city分布,和全量用户的city分布一致。

with city_fenbu as (
    select city, user_cnt/ sum(user_cnt) over() as zhanbi 
    from (
        select city,count(user_id) as user_cnt
        from user_info 
        group by city
    ) t1
)
,user_shuffle as (
    select city, user_id, 
    row_number() over (partition by city order by rand()) as rk --用户集打乱
    from user_info
)
select aa.*, bb.*  
from user_shuffle aa join city_fenbu bb on aa.city=bb.city
where round(100000*bb.zhanbi) >= aa.rk;

2.获取连续活跃天数>=n天的用户集

经典场景,必掌握。有用户活跃日志表user_active_log, 字段:日期dt(yyyy-MM-dd),user_id。直接上sql

with uid_dt as (
    select dt, user_id
    from user_active_log
    group by dt,user_id
)
,uid_flg as (
    select user_id, dt, 
    date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_active_days as (
    select user_id, flg_dt, count(dt) as continue_active_days
    from uid_with_flg
    group by user_id, flg_dt
)
select user_id
from uid_active_days
group by user_id 
having max(continue_active_days) >=7;

引申题,求每个用户连续活跃区间的起止日期。

with uid_dt as (
    select dt, user_id
    from user_active_log
    group by dt,user_id
)
,uid_flg as (
    select user_id, dt, 
    date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_start_end_dt as (
    select user_id, dt, 
    first_value(dt) over (partition by user_id, flg_dt order by dt) as start_dt,--同一连续活跃区间的起始日期
    first_value(dt) over (partition by user_id, flg_dt order by dt desc) as end_dt --结束日期
)
select user_id,start_dt,end_dt 
from uid_start_end_dt
group by user_id,start_dt,end_dt;

3.最大同时在线人数

经典场景,必掌握。有用户登陆、登出日志表user_login_log,字段:user_id, login_time, logout_time。

方法1:对于每个用户A,找出与它同时在线的用户。再按用户id分组,取最大在线人数。

select max(online_cnt)
from (
    select a.user_id, a.login_time, count(distinct b.user_id) as online_cnt
    from user_login_log a full join user_login_log b 
    where (
        (a.login_time>=b.login_time and a.login_time<=b.logout_time)
        or (a.logout_time>=b.login_time and a.logout_time<=b.logout_time)
    )
    group by a.user_id, a.login_time
) t;

方法1问题很明显,需要full join自身,用户数即使刚刚过万,full join后也有超过1亿条记录,大概率sql执行不成功。

方法2:若实时在线人数在服务器端需要打点,一般会怎么做。维护一个全局的计数器online_cnt,登陆+1,登出-1。用sql实现此逻辑

with ts_onlinecnt as (--每一时刻同时在线人数
    select t, cnt, sum(cnt) over (order by t) as online_cnt --按t排序,求和范围是从第一行到当前行
    from (
        select login_time as t, 1 as cnt
        from user_login_log where login_time is not null
        union all 
        select logout_time as t, -1 as cnt 
        from user_login_log where logout_time is not null
    ) detail
)
select max(online_cnt)
from ts_onlinecnt;

用实际数据帮助理解,select * from user_login_log;

user_id login_time logout_time
张七 2020-09-09 11:12:20 2020-09-09 13:13:30
张六 2020-09-09 11:12:20 2020-09-10 13:13:30
张五 2020-09-10 11:12:20 2020-09-10 13:13:30
张四 2020-09-10 12:12:20 2020-09-10 12:13:30
张三 2020-09-10 11:12:30 2020-09-10 12:12:30

ts_onlinecnt结果如下:

t cnt online_cnt
2020-09-09 11:12:20 1 2
2020-09-09 11:12:20 1 2
2020-09-09 13:13:30 -1 1
2020-09-10 11:12:20 1 2
2020-09-10 11:12:30 1 3
2020-09-10 12:12:20 1 4
2020-09-10 12:12:30 -1 3
2020-09-10 12:13:30 -1 2
2020-09-10 13:13:30 -1 0
2020-09-10 13:13:30 -1 0

插段题外话,关于hive窗口函数,可以参考http://lxw1234.com/archives/tag/hive-window-functions。大佬介绍的非常清晰,值得收藏。

此处有一个窗口函数的特殊之处需要注意:给ts_onlinecnt子表加一列:

    select t, cnt, sum(cnt) over (order by t) as online_cnt, 
    sum(cnt) over (order by t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as online_cnt_1
    from (
        select start_time as t, 1 as cnt
        from user_log where start_time is not null
        union all 
        select end_time as t, -1 as cnt 
        from user_log where end_time is not null
    ) detail
t cnt online_cnt online_cnt_1
2020-09-09 11:12:20 1 2 2
2020-09-09 11:12:20 1 2 1
2020-09-09 13:13:30 -1 1 1
2020-09-10 11:12:20 1 2 2
2020-09-10 11:12:30 1 3 3
2020-09-10 12:12:20 1 4 4
2020-09-10 12:12:30 -1 3 3
2020-09-10 12:13:30 -1 2 2
2020-09-10 13:13:30 -1 0 0
2020-09-10 13:13:30 -1 0 1

从标红的四行可以看到,在排序列t等值时,online_cnt 和 online_cnt_1并不相等。这与hive官方文档描述不一致。如果是笔者理解有误,还请评论指正。

When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

窗口函数中,若order by没有声明开窗范围,默认为起始行到当前行。

猜你喜欢

转载自blog.csdn.net/mr_cuber/article/details/113100287