HiveSQL一天一个小技巧:如何精准计算非连续日期累计值【闪电快车面试题】

0 需 求

稀疏字段累计求和问题

1 问题分析

根据图片中数据变换的形式,可以看出是根据字段term补齐数据中缺失的日期,term为连续日期的个数,当为12时,表明由2018-12-21到2019-01-02连续日期个数为12,当补齐日期后,根据日期顺序求amount的累计值,注意的是,当日期补齐后,补齐的日期值是空的。此类问题在业务中经常出现,特别在求累计值时,如果日期不是连续的,很容易漏掉部分日期累计值,造成数据不完整。这类问题的核心点就是数据日期非连续,需要补齐连续的日期,那么如何补齐连续日期呢?看过我SQLBOY1000题专栏的同学应该明白有类似的题目,这里给出链接。

SQL重叠交叉区间问题分析--HiveSQL面试题30_莫叫石榴姐的博客-CSDN博客

HiveSql一天一个小技巧:如何构造连续日期_hive生成连续的日期_莫叫石榴姐的博客-CSDN博客

步骤1:根据数据日期,补全需要的连续日期

对于补齐连续日期,我们给出模板及核心语句

lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val

其中space()函数表示取空格,目的是为了扩展数据使用,取多少空格由里面参数决定,split()中的正则(?!$)表示不是以空格结尾的就匹配,因为split()函数会多切出一个空格,我们需要去掉。

利用posexplode()函数生成索引,根据数据中的起始日期(min(value_date))+增长步长的方式可以补齐所有的日期。注意这里面是按月增长的,我们使用add_months函数,即

add_months(value_date, pos)

整体生成连续日期语句如下:

with data as
(
 select 'AAAA' as contract,'2018-12-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-03-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-06-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-09-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'BBBB' as contract,'2018-12-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-02-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-06-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-09-02' as value_date,9439.30 as amount,10 as term
)
select contract
          , add_months(value_date, pos) value_date
          ,term
     from (
              select contract
                   , min(value_date) value_date
                   , max(amount)     amount
                   , max(term)       term
              from data
              group by contract
          ) t1 lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
 

步骤2:用补齐的连续日期作主表关联数据表,并计算累计值。

注意:这里一定要用生成连续日期做主表与关联数据表,这样才能做累计计算时候不重不漏,此时

sum() over(partition by order by )中sum的值一定是数据表右表中的值,partition by和order by的值是主表中的值。

在准确计算非连续日期累计值的核心点也在于此,生成补齐的日期维度表一定是主表,然后去关联数据表。

最终具体SQL如下:

with data as
(
 select 'AAAA' as contract,'2018-12-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-03-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-06-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-09-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'BBBB' as contract,'2018-12-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-02-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-06-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-09-02' as value_date,9439.30 as amount,10 as term
)

select  dim.contract
       ,dim.value_date
       ,cast(sum(d.amount) over(partition by dim.contract order by dim.value_date)  as decimal(18,2)) amount
       ,dim.term
from
    (select contract
          , add_months(value_date, pos) value_date
          ,term
     from (
              select contract
                   , min(value_date) value_date
                   , max(amount)     amount
                   , max(term)       term
              from data
              group by contract
          ) t1 lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
    ) dim
left join
(
  select contract
        ,value_date
        ,amount
  from data
) d
on dim.contract = d.contract and dim.value_date = d.value_date

结果如下:

问题补充:如何将缺失日期及缺失值补充完整呢?

直接根据相邻日期缺失的时间间隔,利用posexplode()函数将缺失日期及数据展开补齐,具体SQL如下:

with data as
         (
             select 'AAAA' as contract, '2018-12-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'AAAA' as contract, '2019-03-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'AAAA' as contract, '2019-06-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'AAAA' as contract, '2019-09-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'BBBB' as contract, '2018-12-02' as value_date, 9439.30 as amount, 10 as term
             union all
             select 'BBBB' as contract, '2019-02-02' as value_date, 9439.30 as amount, 10 as term
             union all
             select 'BBBB' as contract, '2019-06-02' as value_date, 9439.30 as amount, 10 as term
             union all
             select 'BBBB' as contract, '2019-09-02' as value_date, 9439.30 as amount, 10 as term
         )
select contract,
       add_months(value_date, pos) value_date,
       amount
from (
         select contract,
                value_date,
                amount,
                term,
                lead(value_date, 1, value_date) over (partition by contract order by value_date) next_value_date
         from data) tmp lateral view  posexplode (
        split (space( cast(months_between(next_value_date, value_date) as int)), " (?!$)")
) tmp AS pos,val;

结果如下:

2 小结

本文给出了一种非连续日期准确求解累计值的通用方法。通过本文可以学习到:

(1)连续日期的构造方法

(2)非连续日期准确求解累计值的方法

注意此类问题又叫稀疏字段累计求和问题

猜你喜欢

转载自blog.csdn.net/godlovedaniel/article/details/129318378