HiveSQL a little trick a day: how to accurately calculate the cumulative value of non-consecutive dates [Lightning Express Interview Questions]

0 needs

Sparse Field Cumulative Sum Problem

1 Problem Analysis

According to the form of data transformation in the picture, it can be seen that the missing dates in the data are filled according to the field term. term is the number of consecutive dates. When it is 12, it means that it is continuous from 2018-12-21 to 2019-01-02 The number of dates is 12. When the date is completed, the cumulative value of the amount is calculated according to the order of the date. Note that when the date is completed, the value of the completed date is empty. Such problems often occur in business, especially when calculating cumulative values. If the dates are not continuous, it is easy to miss some cumulative values ​​of dates, resulting in incomplete data. The core point of this kind of problem is that the data dates are not continuous, and continuous dates need to be completed, so how to complete the continuous dates? Students who have read my SQLBOY1000 question column should know that there are similar questions, and here is a link.

Analysis of SQL overlapping intersecting interval problems--HiveSQL interview question 30

A little trick for HiveSql a day: how to construct continuous dates_hive generates continuous dates_Mo Ming Pomegranate Sister's Blog-CSDN Blog

Step 1: Complete the required consecutive dates according to the date of the data

For filling consecutive dates, we give templates and core statements

lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val

Among them, the space() function means to take a space, the purpose is to expand the use of data, how many spaces to take is determined by the parameters inside, the regularity (?!$) in split() means to match if it does not end with a space, because the split() function will One more space is cut out, we need to remove it.

Use the posexplode() function to generate an index, and all dates can be completed according to the starting date (min(value_date)) + growth step in the data. Note that there is a monthly increase, we use the add_months function, that is

add_months(value_date, pos)

The overall generation of continuous date statement is as follows:

with data as
(
 select 'AAAA' as contract,'2018-12-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-03-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-06-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-09-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'BBBB' as contract,'2018-12-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-02-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-06-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-09-02' as value_date,9439.30 as amount,10 as term
)
select contract
          , add_months(value_date, pos) value_date
          ,term
     from (
              select contract
                   , min(value_date) value_date
                   , max(amount)     amount
                   , max(term)       term
              from data
              group by contract
          ) t1 lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
 

Step 2: Use the completed consecutive dates as the main table to associate the data table and calculate the cumulative value.

Note: It is necessary to use the generated continuous date as the main table and the associated data table , so that the cumulative calculation can be done without repetition. At this time

The value of sum in sum() over (partition by order by ) must be the value in the right table of the data table, and the values ​​of partition by and order by are the values ​​in the main table.

在准确计算非连续日期累计值的核心点也在于此,生成补齐的日期维度表一定是主表,然后去关联数据表。

最终具体SQL如下:

with data as
(
 select 'AAAA' as contract,'2018-12-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-03-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-06-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'AAAA' as contract,'2019-09-21' as value_date,9439.30 as amount,12 as term
 union all
 select 'BBBB' as contract,'2018-12-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-02-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-06-02' as value_date,9439.30 as amount,10 as term
 union all
 select 'BBBB' as contract,'2019-09-02' as value_date,9439.30 as amount,10 as term
)

select  dim.contract
       ,dim.value_date
       ,cast(sum(d.amount) over(partition by dim.contract order by dim.value_date)  as decimal(18,2)) amount
       ,dim.term
from
    (select contract
          , add_months(value_date, pos) value_date
          ,term
     from (
              select contract
                   , min(value_date) value_date
                   , max(amount)     amount
                   , max(term)       term
              from data
              group by contract
          ) t1 lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
    ) dim
left join
(
  select contract
        ,value_date
        ,amount
  from data
) d
on dim.contract = d.contract and dim.value_date = d.value_date

结果如下:

问题补充:如何将缺失日期及缺失值补充完整呢?

直接根据相邻日期缺失的时间间隔,利用posexplode()函数将缺失日期及数据展开补齐,具体SQL如下:

with data as
         (
             select 'AAAA' as contract, '2018-12-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'AAAA' as contract, '2019-03-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'AAAA' as contract, '2019-06-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'AAAA' as contract, '2019-09-21' as value_date, 9439.30 as amount, 12 as term
             union all
             select 'BBBB' as contract, '2018-12-02' as value_date, 9439.30 as amount, 10 as term
             union all
             select 'BBBB' as contract, '2019-02-02' as value_date, 9439.30 as amount, 10 as term
             union all
             select 'BBBB' as contract, '2019-06-02' as value_date, 9439.30 as amount, 10 as term
             union all
             select 'BBBB' as contract, '2019-09-02' as value_date, 9439.30 as amount, 10 as term
         )
select contract,
       add_months(value_date, pos) value_date,
       amount
from (
         select contract,
                value_date,
                amount,
                term,
                lead(value_date, 1, value_date) over (partition by contract order by value_date) next_value_date
         from data) tmp lateral view  posexplode (
        split (space( cast(months_between(next_value_date, value_date) as int)), " (?!$)")
) tmp AS pos,val;

结果如下:

2 小结

本文给出了一种非连续日期准确求解累计值的通用方法。通过本文可以学习到:

(1)连续日期的构造方法

(2)非连续日期准确求解累计值的方法

注意此类问题又叫稀疏字段累计求和问题

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/129318378