HiveSQL一天一个小技巧:如何统计当前时间点状态情况【辅助变量+累计变换思路】

0 需求描述

  1 问题分析

目标:需要求出每个时间段,有客人在住的房间数量。

如果只考虑一人一房,我们可以借助于同时在线人数统计的思路,入住时间加辅助标记记为1,离店时间加辅助标记记为-1,并按照时间进行顺序排序,求当前累计值,具体SQL如下:

参考文章:SQL同时在线问题分析(求最高在线人数以及最高峰时间段)【累加器思想,流式分析】--HiveSql面试题29_sql统计同时在线时间_莫叫石榴姐的博客-CSDN博客

类似于公交车上车下车场景,统计公交车某一时刻车上人数,上车加1,下车减1,最终某一时刻人数通过加1减1这种辅助标记来进行累加,累加反应的是某一时间节点处,截止当前状态情况。

with data as
         (
             select 7 user_id, 2004 room_num, '2021-03-05' in_time, '2021-03-07' out_time
             union all
             select 23 user_id, 2010 room_num, '2021-03-05' in_time, '2021-03-06' out_time
             union all
             select 7 user_id, 1003 room_num, '2021-03-07' in_time, '2021-03-08' out_time
             union all
             select 8 user_id, 2014 room_num, '2021-03-07' in_time, '2021-03-08' out_time
             union all
             select 14 user_id, 3001 room_num, '2021-03-07' in_time, '2021-03-10' out_time
             union all
             select 18 user_id, 3002 room_num, '2021-03-08' in_time, '2021-03-10' out_time
             union all
             select 23 user_id, 3020 room_num, '2021-03-08' in_time, '2021-03-09' out_time
             union all
             select 25 user_id, 2006 room_num, '2021-03-09' in_time, '2021-03-12' out_time
         )

select *
from (select time as start_time, lead(time) over (order by time) as end_time, acc_cnt
      from (select time, sum(num) over (order by time) as acc_cnt
            from (
                     select in_time as time, 1 as num
                     from data
                     union all
                     select out_time as time, -1 as num
                     from data
                 ) t
           ) t
      group by time, acc_cnt
     ) t
where end_time is not null

 

但是这里面有个问题,实际场景是一人一房间吗?如果有多个人共住一房间,今天退了一个人,明天又退了一个人,后天的时候才退完,虽然在退,但房间还是有人住的,这种情况,是不是也算?如果考虑这种情况则需要对累加的状态进行调整,此时需要考虑每个房间中截止当前时间的人数情况。

第一步:先求出每个房间当前时间的人数累计值,作为状态判断辅助条件

增加一行数据:

union all
select 100 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-08' out_time

具体SQL如下:

with data as
(
select 7 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-07' out_time
union all
select 23 user_id,2010 room_num,'2021-03-05' in_time,'2021-03-06' out_time
union all
select 7 user_id,1003 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 8 user_id,2014 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 14 user_id,3001 room_num,'2021-03-07' in_time,'2021-03-10' out_time
union all
select 18 user_id,3002 room_num,'2021-03-08' in_time,'2021-03-10' out_time
union all
select 23 user_id,3020 room_num,'2021-03-08' in_time,'2021-03-09' out_time
union all
select 25 user_id,2006 room_num,'2021-03-09' in_time,'2021-03-12' out_time
union all
select 100 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-08' out_time
)
select time
                            , room_num
                            , sum(user_cnt) over (partition by room_num ORDER BY TIME) user_cnt
                       from (
                            select in_time as time, room_num, count(user_id) user_cnt
                             from data
                             group by in_time, room_num
                             union all
                             select out_time as time, room_num, -1 * count(user_id) user_cnt
                             from data
                             group by out_time, room_num
                            ) t

 

第二步:基于累计的每个房间人数的状态变量进行判断:如果房间有人就标记1,没有人时候就标记为-1。标记思路实现:

方法1:case when 判断当user_cnt > 0时标记1,否则标记-1

case when user_cnt > 0 then 1 else -1 end num1

方法2:判断每个房间截止当前人数的最小值,如果大于0 标记为1,否则标记为-1

case when min(user_cnt) over (partition by room_num order by time) > 0 then 1 else -1 end

具体SQL如下:

with data as
(
select 7 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-07' out_time
union all
select 23 user_id,2010 room_num,'2021-03-05' in_time,'2021-03-06' out_time
union all
select 7 user_id,1003 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 8 user_id,2014 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 14 user_id,3001 room_num,'2021-03-07' in_time,'2021-03-10' out_time
union all
select 18 user_id,3002 room_num,'2021-03-08' in_time,'2021-03-10' out_time
union all
select 23 user_id,3020 room_num,'2021-03-08' in_time,'2021-03-09' out_time
union all
select 25 user_id,2006 room_num,'2021-03-09' in_time,'2021-03-12' out_time
union all
select 100 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-08' out_time
)

select time
          , room_num
          , user_cnt
          , case when user_cnt > 0 then 1 else -1 end num1
          , case when min(user_cnt) over (partition by room_num order by time) > 0 then 1 else -1 end  num2
     from (select time
                , room_num
                , sum(user_cnt) over (partition by room_num ORDER BY TIME) user_cnt
           from (select in_time as time, room_num, count(user_id) user_cnt
                 from data
                 group by in_time, room_num
                 union all
                 select out_time as time, room_num, -1 * count(user_id) user_cnt
                 from data
                 group by out_time, room_num
                ) t
          ) t

第三步:

基于上述结果,累计num值(其实本质就是通过构造辅助条件对辅助标记值进行变换,累加值变了),反应截止当前时间点 有人入住的房间数量,最终SQL如下:

with data as
(
select 7 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-07' out_time
union all
select 23 user_id,2010 room_num,'2021-03-05' in_time,'2021-03-06' out_time
union all
select 7 user_id,1003 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 8 user_id,2014 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 14 user_id,3001 room_num,'2021-03-07' in_time,'2021-03-10' out_time
union all
select 18 user_id,3002 room_num,'2021-03-08' in_time,'2021-03-10' out_time
union all
select 23 user_id,3020 room_num,'2021-03-08' in_time,'2021-03-09' out_time
union all
select 25 user_id,2006 room_num,'2021-03-09' in_time,'2021-03-12' out_time
-- union all
-- select 100 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-08' out_time
)

select *
from
    (select time as start_time, lead(time) over (order by time) as end_time, acc_cnt
     from (select time
          , sum(case when user_cnt > 0 then 1 else -1 end) over (order by time) acc_cnt
                 from (select time
                            , room_num
                            , sum(user_cnt) over (partition by room_num ORDER BY TIME) user_cnt
                       from (select in_time as time, room_num, count(user_id) user_cnt
                             from data
                             group by in_time, room_num
                             union all
                             select out_time as time, room_num, -1 * count(user_id) user_cnt
                             from data
                             group by out_time, room_num
                            ) t
                      ) t
          ) t
     group by time, acc_cnt
    ) t
where end_time is not null

最终结果如下:

 新增测试数据结果如下:

with data as
(
select 7 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-07' out_time
union all
select 23 user_id,2010 room_num,'2021-03-05' in_time,'2021-03-06' out_time
union all
select 7 user_id,1003 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 8 user_id,2014 room_num,'2021-03-07' in_time,'2021-03-08' out_time
union all
select 14 user_id,3001 room_num,'2021-03-07' in_time,'2021-03-10' out_time
union all
select 18 user_id,3002 room_num,'2021-03-08' in_time,'2021-03-10' out_time
union all
select 23 user_id,3020 room_num,'2021-03-08' in_time,'2021-03-09' out_time
union all
select 25 user_id,2006 room_num,'2021-03-09' in_time,'2021-03-12' out_time
union all
select 100 user_id,2004 room_num,'2021-03-05' in_time,'2021-03-08' out_time
)

select *
from
    (select time as start_time, lead(time) over (order by time) as end_time, acc_cnt
     from (select time, sum(case when user_cnt > 0 then 1 else -1 end) over (order by time) acc_cnt
                 from (select time
                            , room_num
                            , sum(user_cnt) over (partition by room_num ORDER BY TIME) user_cnt
                       from (select in_time as time, room_num, count(user_id) user_cnt
                             from data
                             group by in_time, room_num
                             union all
                             select out_time as time, room_num, -1 * count(user_id) user_cnt
                             from data
                             group by out_time, room_num
                            ) t
                      ) t
          ) t
     group by time, acc_cnt
    ) t
where end_time is not null

 2 小结

本文总结了一种当前时间点状态统计的思路和方法,对于此类问题主要采用构造辅助计数变量及累加变换思路进行求解。常见的场景有:直播同时在线人数、服务器实时并发数、公家车当前时间段人数、某个仓库的货物积压数量,某一段时间内的同时处于服务过程中的最大订单量等

猜你喜欢

转载自blog.csdn.net/godlovedaniel/article/details/129881211