Table of Contents
Spark-sql and support for online multi-dimensional analysis functions, here are a few more commonly used functions: grouping sets, grouping__id, rollup and cube.
Prepare data
ldldwd.ec_tm_business_store表:
businessID businessName ID
1 CPD 11
3 LD 12
4 PPD 13
2 ACD 14
GROUPING SETS
Group the group after group again, and then pass in the grouping combination after sets. It can be understood as: if grouping sets are used, then the group by field only limits the grouping field, which actually groups the combinations in the grouping sets.
For example, without grouping sets, use group by
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
The result is:
businessID businessName
2 ACD
1 CPD
4 PPD
3 LD
Use grouping sets to divide businessID and businessName into three sub-groups, namely group one: businessID, businessName group two: businessID group three: businessName, sql is as follows:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
grouping sets((businessID,businessName),businessID,businessName)
;
The result is:
businessID businessName
3 NULL
NULL PPD
2 ACD
1 NULL
4 PPD
2 NULL
1 CPD
NULL LD
NULL ACD
3 LD
NULL CPD
4 NULL
Its meaning is: for the ec_tm_business_store table, group by (businessID, businessName) and businessID and businessName respectively. And combine the query results of the three groups. Equivalent to:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
union
select businessID,null as businessName
from ldldwd.ec_tm_business_store
group by businessID
union
select null as businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
Because group two and group three only group one field, the other field should be null.
It should be noted that:
1. If grouping sets are used, only the sets in the sets will be grouped, and the whole fields after group by will not be grouped, such as:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
grouping sets(businessID,businessName)
;
The result is:
businessID businessName
3 NULL
NULL PPD
1 NULL
2 NULL
NULL LD
NULL ACD
NULL CPD
4 NULL
The results of group by businessID, businessName are not displayed.
Two: Fields not included after group by cannot appear in grouping sets, such as:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
grouping sets(ID,businessID,businessName)
;
There is an ID field in the above SQL sets, and the group by does not contain the ID field, this will be an error.
Error in query: id#35 doesn't show up in the GROUP BY list ArrayBuffer(businessID#36 AS businessID#50, businessName#37 AS businessName#51);
GROUPING__ID
grouping__id can be used in conjunction with grouping sets to distinguish which group the data belongs to. Its value indicates which fields in the current row are not involved in the grouping, such as:
select businessID,businessName,ID,grouping__id
from ldldwd.ec_tm_business_store
group by businessID,businessName,ID
grouping sets((businessID,businessName),businessID,businessName,ID)
order by grouping__id
;
The result is:
businessID businessName ID grouping__id
3 LD NULL 1
2 ACD NULL 1
1 CPD NULL 1
4 PPD NULL 1
2 NULL NULL 3
1 NULL NULL 3
3 NULL NULL 3
4 NULL NULL 3
NULL LD NULL 5
NULL PPD NULL 5
NULL CPD NULL 5
NULL ACD NULL 5
NULL NULL 11 6
NULL NULL 12 6
NULL NULL 13 6
NULL NULL 14 6
At first glance, is it confusing?
In fact, grouping__id attaches an id value to the grouping field according to the bitmap strategy. According to the binary system, the field close to the group by is set to the high bit, and the field far away from the group by is set to the low bit. Then accumulate the id values of the current sub-groups that are not participating in the grouping field.
In this example, group by is followed by businessID, businessName, ID, then their corresponding id values are:
businessID:4
businessName:2:
ID:1
Then we can explain how the value of grouping__id comes from:
businessID businessName ID grouping__id
--ID未参与分组,grouping__id等于ID字段的id值,为1
3 LD NULL 1
2 ACD NULL 1
1 CPD NULL 1
4 PPD NULL 1
--ID字段和businessName字段未参与分组,grouping__id等于 (ID字段的id值+businessName字段的id值)=1+2=3
2 NULL NULL 3
1 NULL NULL 3
3 NULL NULL 3
4 NULL NULL 3
--ID字段和businessID字段未参与分组,grouping__id等于 (ID字段的id值+businessID字段的id值)=1+4=5
NULL LD NULL 5
NULL PPD NULL 5
NULL CPD NULL 5
NULL ACD NULL 5
--businessID和businessName未参与分组,grouping__id等于 (businessID字段的id值+businessName字段的id值)=2+4=6
NULL NULL 11 6
NULL NULL 12 6
NULL NULL 13 6
NULL NULL 14 6
By predicting the value of grouping__id, data can be effectively distinguished.
ROLLUP
The use of rollup is usually to add with rollup directly after group by. Its function is to treat the fields after group by as a hierarchical relationship, and scroll up according to the hierarchy. Such as: group by a,b,c with rollup is equivalent to group by a,b,c grouping sets((a,b,c),(a,b),(a)).
That is: group by all the fields of group byd, then remove the rightmost field for grouping, and finally union all the results.
for example:
Prepare data:
year month day amount
2020 01 05 10
2020 01 08 10
2020 02 08 10
2020 02 05 20
2020 03 05 30
2020 03 08 30
sql:
SELECT year,month,sum(amount) as amount
from sellout_tm
group by year,month
with rollup
result:
year month amount
2020 01 20
2020 02 30
2020 03 60
2020 null 110
The above sql is equivalent to
SELECT year,month,sum(amount) as amount
from sellout_tm
group by year,month
grouping sets((year,month),year)
CUBE
Like kylin's cube, all fields of group by are combined, such as: group by a,b,c with cube is equivalent to group by a,b,c grouping sets((a,b,c),(a,b ),(a,c),(b,c),a,b,c). I won't go into details here.