How does the new latitude of grouping sets in spark sql make the group id not change

Recently, using spark sql for offline development in the project encountered many requirements for deduplication of latitude combination. The first idea is to use grouping sets for latitude combination, and grouping_id() is used as grouping id to realize the demand.
The difference between spark sql grouping sets and hive mainly lies in the different calculation methods of group id:

  1. In spark sql, grouping_id() is used to obtain the grouping id, and Hive obtains the grouping id through grouping__id (two _)
  2. The spark sql group id is 0 when the latitude is selected, and 1 when it is not selected, but the opposite is true for hive. It is 1 when selected and 0 when it is not selected. For
    example:
--spark sql
select grouping_id() group_id
  from temp_tb
group by a, b, c, d
grouping sets (
  (a, b),  --0011=1+2=3
  (a, c),  --0101=1+4=5
  (b, d),  --1010=2+8=10
  (a, d)   --0110=2+4=6
)
;

--hive sql
select grouping_id() group_id
  from temp_tb
group by a   --1
        ,b   --2
        ,c   --4
        ,d   --8
grouping sets (
  (a, b),  --0011=1+2=3
  (a, c),  --0101=1+4=5
  (b, d),  --1010=2+8=10
  (a, d)   --1001=1+8=9
)
;

You will find that the (a, d) combination id is different, and the other ones are just the same after the calculation. Please refer to my previous blog for the usage details of hive grouping sets.

Enhanced features of group by in hive: grouping__id, grouping, groupin sets, cube, rollup

From the above example, you will find that hive adds a field after the existing field of group by (adding the previous combination id before or after the previous field will change) and latitude combination. The previous combination id will never change, but In spark sq, no matter which field is added after the group by, the new field will change the previous combination id, and the conditions used by the downstream tasks need to be modified accordingly, so adding latitude will be very troublesome.

How can the group id not change with the increase of the field when encountering this kind of problem?

Idea: Re-assemble into a binary code and then convert to a decimal id, the way the binary code is spelled is the same as that of hive

Direct code explanation

-----首先四个字段进行纬度组合
select --为保证前期group_id不变,后续添加的纬度字段请在最前面添加,顺序和group by字段的顺序相反
       conv(cast(
              concat(if(d is not null, 1, 0), if(c is not null, 1, 0)
                    ,if(b is not null, 1, 0), if(a is not null, 1, 0))
       as int), 2, 10) as group_id
  from 
  ( --先将纬度字段空值处理未非空
    select nvl(a, '') as a
          ,nvl(b, '') as b
          ,nvl(c, '') as c
          ,nvl(d, '') as d
      from temp_tb
  ) tt
group by a    --1
        ,b    --2
        ,c    --4
        ,d    --8
grouping sets (
  (a, b),  --1+2=3
  (a, c),  --1+4=5
  (b, d),  --2+8=10
  (a, d)   --1+8=9
)
;


Add new fields on the basis of the original four fields

-----在原四个字段的基础上新增字段
select --为保证前期group_id不变,后续添加的纬度字段请在最前面添加,顺序和group by字段的顺序相反
       conv(cast(
              concat(if(f is not null, 1, 0), if(e is not null, 1, 0)
                    ,if(d is not null, 1, 0), if(c is not null, 1, 0)
                    ,if(b is not null, 1, 0), if(a is not null, 1, 0))
       as int), 2, 10) as group_id
  from 
  ( --先将纬度字段空值处理未非空,因为后续拼二进制码是需要通过非空判断
    select nvl(a, '') as a
          ,nvl(b, '') as b
          ,nvl(c, '') as c
          ,nvl(d, '') as d
          ,nvl(e, '') as e
          ,nvl(f, '') as f
      from temp_tb
  ) tt
group by a    --1
        ,b    --2
        ,c    --4
        ,d    --8
        ,e    --16
        ,f    --32
grouping sets (
  (a, b),  --1+2=3
  (a, c),  --1+4=5
  (b, d),  --2+8=10
  (a, d),  --1+8=9
  (a, e),  --1+16=17
  (e, f),  --16+32=48
  (d, f)   --8+32=40
)
;

In this way, even if the latitude needs to be added in the subsequent requirements, the downstream group id does not need to be synchronized to modify the filter conditions, because the previous group id has not changed.

Guess you like

Origin blog.csdn.net/lz6363/article/details/114950131