olap function of spark-sql

Table of Contents

GROUPING SETS

GROUPING__ID

ROLLUP

CUBE


Spark-sql and support for online multi-dimensional analysis functions, here are a few more commonly used functions: grouping sets, grouping__id, rollup and cube.

Prepare data

ldldwd.ec_tm_business_store表:

businessID   businessName   ID
1                     CPD                11
3                     LD                   12
4                     PPD                13
2                     ACD                14

GROUPING SETS

Group the group after group again, and then pass in the grouping combination after sets. It can be understood as: if grouping sets are used, then the group by field only limits the grouping field, which actually groups the combinations in the grouping sets.

For example, without grouping sets, use group by

    select businessID,businessName 
    from ldldwd.ec_tm_business_store
    group by  businessID,businessName

 The result is:

    businessID    businessName
    2                    ACD
    1                    CPD
    4                    PPD
    3                    LD    

Use grouping sets to divide businessID and businessName into three sub-groups, namely group one: businessID, businessName group two: businessID group three: businessName, sql is as follows:

    select businessID,businessName 
    from ldldwd.ec_tm_business_store
    group by  businessID,businessName
    grouping sets((businessID,businessName),businessID,businessName)
    ;

The result is:

       businessID    businessName
       3                     NULL
       NULL              PPD
       2                     ACD
       1                     NULL
       4                     PPD
       2                     NULL
       1                     CPD
       NULL              LD
       NULL              ACD
       3                     LD
       NULL             CPD
       4                    NULL    

Its meaning is: for the ec_tm_business_store table, group by (businessID, businessName) and businessID and businessName respectively. And combine the query results of the three groups. Equivalent to:

select businessID,businessName 
from ldldwd.ec_tm_business_store
group by  businessID,businessName

union

select businessID,null as businessName 
from ldldwd.ec_tm_business_store
group by  businessID

union

select null as businessID,businessName 
from ldldwd.ec_tm_business_store
group by  businessID,businessName

Because group two and group three only group one field, the other field should be null.
It should be noted that:
1. If grouping sets are used, only the sets in the sets will be grouped, and the whole fields after group by will not be grouped, such as:

    select businessID,businessName 
    from ldldwd.ec_tm_business_store
    group by  businessID,businessName
    grouping sets(businessID,businessName)
    ;

The result is:

businessID    businessName
3                         NULL
NULL                  PPD
1                         NULL
2                         NULL
NULL                  LD
NULL                  ACD
NULL                  CPD
4                         NULL

The results of group by businessID, businessName are not displayed.

Two: Fields not included after group by cannot appear in grouping sets, such as:

    select businessID,businessName 
    from ldldwd.ec_tm_business_store
    group by  businessID,businessName
    grouping sets(ID,businessID,businessName)
    ;

There is an ID field in the above SQL sets, and the group by does not contain the ID field, this will be an error.

Error in query: id#35 doesn't show up in the GROUP BY list ArrayBuffer(businessID#36 AS businessID#50, businessName#37 AS businessName#51);

GROUPING__ID

grouping__id can be used in conjunction with grouping sets to distinguish which group the data belongs to. Its value indicates which fields in the current row are not involved in the grouping, such as:

     select businessID,businessName,ID,grouping__id
     from ldldwd.ec_tm_business_store
     group by  businessID,businessName,ID
     grouping sets((businessID,businessName),businessID,businessName,ID)
     order by grouping__id
     ;

The result is:

businessID    businessName    ID          grouping__id
3                    LD                       NULL             1
2                    ACD                    NULL             1
1                    CPD                    NULL             1
4                    PPD                    NULL             1
2                    NULL                   NULL             3
1                    NULL                   NULL             3
3                    NULL                   NULL             3
4                    NULL                    NULL            3
NULL             LD                       NULL             5
NULL             PPD                    NULL             5
NULL             CPD                    NULL             5
NULL             ACD                    NULL             5
NULL             NULL                     11                6
NULL             NULL                     12                6
NULL             NULL                     13                6
NULL             NULL                     14                6

At first glance, is it confusing?
In fact, grouping__id attaches an id value to the grouping field according to the bitmap strategy. According to the binary system, the field close to the group by is set to the high bit, and the field far away from the group by is set to the low bit. Then accumulate the id values ​​of the current sub-groups that are not participating in the grouping field.
In this example, group by is followed by businessID, businessName, ID, then their corresponding id values ​​are:

businessID:4
businessName:2:
ID:1

Then we can explain how the value of grouping__id comes from:

businessID    businessName    ID          grouping__id
--ID未参与分组,grouping__id等于ID字段的id值,为1
3             LD              NULL             1  
2             ACD             NULL             1
1             CPD             NULL             1
4             PPD             NULL             1
--ID字段和businessName字段未参与分组,grouping__id等于 (ID字段的id值+businessName字段的id值)=1+2=3
2             NULL            NULL             3
1             NULL            NULL             3
3             NULL            NULL             3
4             NULL            NULL             3
--ID字段和businessID字段未参与分组,grouping__id等于 (ID字段的id值+businessID字段的id值)=1+4=5
NULL          LD              NULL             5
NULL          PPD             NULL             5
NULL          CPD             NULL             5
NULL          ACD             NULL             5
--businessID和businessName未参与分组,grouping__id等于 (businessID字段的id值+businessName字段的id值)=2+4=6
NULL          NULL            11               6
NULL          NULL            12               6
NULL          NULL            13               6
NULL          NULL            14               6

By predicting the value of grouping__id, data can be effectively distinguished.

ROLLUP

The use of rollup is usually to add with rollup directly after group by. Its function is to treat the fields after group by as a hierarchical relationship, and scroll up according to the hierarchy. Such as: group by a,b,c with rollup is equivalent to group by a,b,c grouping sets((a,b,c),(a,b),(a)).
That is: group by all the fields of group byd, then remove the rightmost field for grouping, and finally union all the results.
for example:

Prepare data:

year   month   day    amount
2020    01     05     10
2020    01     08     10
2020    02     08     10
2020    02     05     20
2020    03     05     30
2020    03     08     30

sql:

SELECT year,month,sum(amount) as amount
from sellout_tm
group by year,month
with  rollup 

result:

year   month   amount
2020    01      20        
2020    02      30
2020    03      60
2020    null    110


The above sql is equivalent to

SELECT year,month,sum(amount) as amount
from sellout_tm
group by year,month
grouping sets((year,month),year)

CUBE

Like kylin's cube, all fields of group by are combined, such as: group by a,b,c with cube is equivalent to group by a,b,c grouping sets((a,b,c),(a,b ),(a,c),(b,c),a,b,c). I won't go into details here.

 

 

 

 

Guess you like

Origin blog.csdn.net/x950913/article/details/106853667