Reference article:
1. The grouping set, cube, roll up functions in hive
https://blog.csdn.net/weixin_34352449/article/details/92169948
2.GROUPING function
https://blog.csdn.net/u013053796/article/details/18619315
3.GROUP BY ... WITH ROLL UP Total after grouping statistics
https://blog.csdn.net/u013677636/article/details/52353812
When using Hive, we often perform aggregate statistical operations.
There are many operation functions for aggregate statistics, such as our most commonly used GROUP BY function.
But often we need multi-dimensional statistics, this time we will use Hive's aggregate statistical function
Here we explain the meaning and usage of ROLLUP, GROUPING SETS, CUBE.
Let's explain the usage rules of these functions with a case
Data statistics background:
We now have multiple companies, multiple departments, and multiple employees' salaries. Now we need to count salary according to multiple dimensions.
First we build the basic table
use data_warehouse_test;
CREATE TABLE IF NOT EXISTS datacube_salary_org (
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,user_id BIGINT COMMENT '用户id'
,user_name STRING COMMENT '用户姓名'
,salary DECIMAL(10,2) COMMENT '薪水'
,create_time DATE COMMENT '创建时间'
,update_time DATE COMMENT '修改时间'
)
PARTITIONED BY(
pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;
CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,user_id BIGINT COMMENT '用户id'
,salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
company_name STRING COMMENT '公司名称'
,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
Then, we import the basic data in 2 ways
1. Insert directly through SQL
2. Via file LOAD
1. Insert through SQL
INSERT OVERWRITE TABLE datacube_salary_org PARTITION (pt = '20200407') VALUES
('s.zh','enginer',1,'szh',28000.0,'2020-04-07','2020-04-07'),
('s.zh','enginer',2,'zyq',26000.0,'2020-04-03','2020-04-03'),
('s.zh','tester',3,'gkm',20000.0,'2020-04-07','2020-04-07'),
('x.qx','finance',4,'pip',13400.0,'2020-04-07','2020-04-07'),
('x.qx','finance',5,'kip',24500.0,'2020-04-07','2020-04-07'),
('x.qx','finance',6,'zxxc',13000.0,'2020-04-07','2020-04-07'),
('x.qx','kiccp',7,'xsz',8600.0,'2020-04-07','2020-04-07')
;
2. Via file LOAD
Create a txt file and fill in the following
s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07
We load the data in the following 2 ways
method 1.
Create Folder
Move file
Repair table
hdfs dfs -mkdir /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
hdfs dfs -put org_data.txt /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
Execute the following statement on the beeline / hive client
MSCK REPAIR TABLE datacube_salary_org;
Method 2.
Create partition
LOAD file
Standard usage of LOAD file
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Create partition
ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');
LOAD data
LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt' OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');
After that, let's look at the usage of these functions
ROLLUP, GROUPING SETS, CUBE
We discuss a little bit from GROUPING SETS, ROLLUP, CUBE.
GROUPING SETS
GROUPING SETS, as a clause of GROUP BY, allows developers to specify multiple statistical options after the GROUP BY statement, which can be simply understood as multiple group by statements that combine the query results through union all. Here are a few examples to help We understand.
First we learn about GROUPING SETS
The usage of GROUPING SETS is as follows:
SELECT
a
,b
...
,f
FROM test_table
GROUP BY
a
,b
...
,f
GROUPING SETS ((?,...,?),(xxx),(yyy))
GROUPING SETS 中间可以填写多个条件。
Where (?, ...,?) Can be any item in a ~ f that is not repeated
Specific examples are as follows:
SELECT
a
,b
,SUM(c)
FROM test_table
GROUP BY
a
,b
GROUPING SETS ((a),(a,b),())
Equivalent to
SELECT
a
,NULL
,SUM(c)
FROM test_table
GROUP BY
a
UNION ALL
SELECT
a
,b
,SUM(c)
FROM test_table
GROUP BY
a
,b
UNION ALL
SELECT
NULL
,NULL
,SUM(c)
FROM test_table
;
In actual cases, we want to count employee salaries according to the company and the whole, but we want to do it in one statement.
How do we write SQL?
The SQL is as follows:
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
GROUPING SETS ((company_name), ())
ORDER BY
grouping__id
;
The output is as follows:
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.13 sec HDFS Read: 11666 HDFS Write: 175 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.73 sec HDFS Read: 7060 HDFS Write: 188 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 860 msec
INFO : Completed executing command(queryId=hive_20200408032038_18e04047-b8c0-4d07-a5de-00ccbc7cb4cc); Time taken: 51.459 seconds
INFO : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 0 | NULL | NULL | NULL | 133500.00 |
| 1 | x.qx | NULL | NULL | 59500.00 |
| 1 | s.zh | NULL | NULL | 74000.00 |
+---------------+---------------+-----------+----------+---------------+
You can see that the grouping_id is 0, and the overall salary and
And grouping_id is 1, the salary of the branch is calculated
For the calculation method of GROUPING__ID and the GROUPING function, please refer to my other article
We just said that using GROUPING SETS is equivalent to GROUP BY + UNION ALL, but are they really the same? We run EXPLAIN to check.
The EXPLAIN results of the two SQLs with equivalent results are given below.
SQL1
EXPLAIN
SELECT
*
FROM
(
SELECT
0 AS mark
,NULL
,NULL
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
UNION ALL
SELECT
1 AS mark
,company_name
,NULL
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
) tmp
ORDER BY mark
;
SQL2
EXPLAIN
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
GROUPING SETS ((company_name), ())
ORDER BY
grouping__id
;
First post the separate EXPLAIN results of 2 SQL
UNION ALL SQL
Note: Adding more SQL to UNION ALL will increase the number of JOBs.
INFO : Starting task [Stage-4:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200408035719_94eb75b2-c6fc-4804-bacb-d3555c61e7f3); Time taken: 0.016 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1, Stage-3 |
| Stage-3 is a root stage |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200407') (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: salary (type: decimal(10,2)) |
| outputColumnNames: salary |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: sum(salary) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: decimal(20,2)) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: sum(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: 0 (type: int), null (type: string), _col0 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col4 |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Union |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col4 |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
| TableScan |
| Union |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col4 |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), null (type: void), null (type: void), VALUE._col3 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200407') (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), salary (type: decimal(10,2)) |
| outputColumnNames: company_name, salary |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: sum(salary) |
| keys: company_name (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: decimal(20,2)) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: sum(VALUE._col0) |
| keys: KEY._col0 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: 1 (type: int), _col0 (type: string), _col1 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col4 |
| Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
GROUPING SETS SQL
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200408035951_8153b57c-9f60-4c2d-bc26-92659fdc8afd); Time taken: 0.007 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200407') (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), salary (type: decimal(10,2)) |
| outputColumnNames: company_name, dep_name, user_id, salary |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: sum(salary) |
| keys: company_name (type: string), dep_name (type: string), user_id (type: bigint), 0 (type: int) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
| sort order: ++++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
| Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col4 (type: decimal(20,2)) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: sum(VALUE._col0) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: int) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col4 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: decimal(20,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint), VALUE._col3 (type: decimal(20,2)) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
Through the above analysis and comparison, it is not difficult to find that the number of GROUPING SETS assignments is less,
Actual situation: The operating efficiency of GROUPING SETS is also higher than the GROUP BY form of UNION ALL
The following is the execution time of the comparison
GROUPING SETS
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.62 sec HDFS Read: 11666 HDFS Write: 175 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.51 sec HDFS Read: 7060 HDFS Write: 188 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 130 msec
INFO : Completed executing command(queryId=hive_20200408045412_4ab9e09f-436e-4433-9a1f-a03d5b32ef3e); Time taken: 49.676 seconds
INFO : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 0 | NULL | NULL | NULL | 133500.00 |
| 1 | x.qx | NULL | NULL | 59500.00 |
| 1 | s.zh | NULL | NULL | 74000.00 |
+---------------+---------------+-----------+----------+---------------+
UNION ALL 的 GROUP BY
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.7 sec HDFS Read: 10541 HDFS Write: 119 SUCCESS
INFO : Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 4.34 sec HDFS Read: 10919 HDFS Write: 152 SUCCESS
INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 5.08 sec HDFS Read: 12932 HDFS Write: 188 SUCCESS
INFO : Total MapReduce CPU Time Spent: 13 seconds 120 msec
INFO : Completed executing command(queryId=hive_20200408045141_2033cbf6-a457-4bdb-aaec-65900b386972); Time taken: 84.365 seconds
INFO : OK
+-----------+----------+----------+----------+-------------------+
| tmp.mark | tmp._c1 | tmp._c2 | tmp._c3 | tmp.total_salary |
+-----------+----------+----------+----------+-------------------+
| 0 | NULL | NULL | NULL | 133500.00 |
| 1 | x.qx | NULL | NULL | 59500.00 |
| 1 | s.zh | NULL | NULL | 74000.00 |
+-----------+----------+----------+----------+-------------------+
It can be seen that the execution efficiency of GROUPING SETS is higher
ROLLUP
Rollup can realize multi-level statistics decreasing from right to left, showing statistics aggregation of a certain hierarchical structure.
which is
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
WITH ROLLUP
ORDER BY
grouping__id
;
Equivalent to
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
GROUPING SETS ((company_name, dep_name, user_id), (company_name, dep_name), (company_name),())
ORDER BY
grouping__id
;
Equivalent to
SELECT
company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
UNION ALL
SELECT
company_name
,dep_name
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
UNION ALL
SELECT
company_name
,NULL
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
UNION ALL
SELECT
NULL
,NULL
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
;
Operation result
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.21 sec HDFS Read: 11674 HDFS Write: 563 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.91 sec HDFS Read: 7448 HDFS Write: 602 SUCCESS
INFO : Total MapReduce CPU Time Spent: 8 seconds 120 msec
INFO : Completed executing command(queryId=hive_20200408052638_740f42b9-6f08-49a6-8123-9a77aedc6b19); Time taken: 50.563 seconds
INFO : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 0 | NULL | NULL | NULL | 133500.00 |
| 1 | s.zh | NULL | NULL | 74000.00 |
| 1 | x.qx | NULL | NULL | 59500.00 |
| 3 | s.zh | tester | NULL | 20000.00 |
| 3 | x.qx | kiccp | NULL | 8600.00 |
| 3 | x.qx | finance | NULL | 50900.00 |
| 3 | s.zh | enginer | NULL | 54000.00 |
| 7 | x.qx | kiccp | 7 | 8600.00 |
| 7 | x.qx | finance | 6 | 13000.00 |
| 7 | x.qx | finance | 5 | 24500.00 |
| 7 | x.qx | finance | 4 | 13400.00 |
| 7 | s.zh | enginer | 2 | 26000.00 |
| 7 | s.zh | enginer | 1 | 28000.00 |
| 7 | s.zh | tester | 3 | 20000.00 |
+---------------+---------------+-----------+----------+---------------+
Here we briefly talk about the calculation rules of GROUPING__ID
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
WITH ROLLUP
ORDER BY
grouping__id
;
You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id
Can be seen as a 3-bit binary
That is (0/1), (0/1), (0/1)
The low order corresponds to company_name
The high bit corresponds to user_id
If this bit is aggregated (the field exists in GROUP BY), it is 0, and the GROUPING function also returns 0
Finally, GROUPING_ID is the decimal number corresponding to the binary conversion.
1) ROLLUP's fully grouped subset
SELECT
company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
Corresponding result
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 7 | x.qx | kiccp | 7 | 8600.00 |
| 7 | x.qx | finance | 6 | 13000.00 |
| 7 | x.qx | finance | 5 | 24500.00 |
| 7 | x.qx | finance | 4 | 13400.00 |
| 7 | s.zh | enginer | 2 | 26000.00 |
| 7 | s.zh | enginer | 1 | 28000.00 |
| 7 | s.zh | tester | 3 | 20000.00 |
+---------------+---------------+-----------+----------+---------------+
Because GROUP BY has all three fields, it is 111, so GROUPING__ID is 7
2) A subset of ROLLUP company_name, dep_name grouping
SELECT
company_name
,dep_name
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
Corresponding subset
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 3 | s.zh | tester | NULL | 20000.00 |
| 3 | x.qx | kiccp | NULL | 8600.00 |
| 3 | x.qx | finance | NULL | 50900.00 |
| 3 | s.zh | enginer | NULL | 54000.00 |
+---------------+---------------+-----------+----------+---------------+
Because GROUP BY company_name, dep_name
Combine the above rules
You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id
Can be seen as a 3-bit binary
That is (0/1), (0/1), (0/1)
The low order corresponds to company_name
The high bit corresponds to user_id
Then GROUPING__ID is 011, which is 3
I believe it has been explained clearly here, let ’s take a look at the CUBE function
CUBE
cube is abbreviated as data cube, which can realize multiple queries of any dimension in hive. cube (a, b, c) will first group by (a, b, c), then (a, b), (a, c), (a), (b, c), (b), (c), and finally group by on the whole table, it will count the aggregation of all combinations of values in the selected column
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
WITH CUBE
ORDER BY
grouping__id
;
Equivalent to
(company_name,dep_name,user_id)
SELECT
company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
UNION ALL
SELECT
company_name
,dep_name
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
UNION ALL
SELECT
company_name
,NULL
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,user_id
UNION ALL
SELECT
company_name
,NULL
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
UNION ALL
SELECT
NULL
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
dep_name
,user_id
UNION ALL
SELECT
NULL
,dep_name
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
dep_name
UNION ALL
SELECT
NULL
,NULL
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
user_id
UNION ALL
SELECT
NULL
,NULL
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
;
The results are as follows:
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 0 | NULL | NULL | NULL | 133500.00 |
| 1 | s.zh | NULL | NULL | 74000.00 |
| 1 | x.qx | NULL | NULL | 59500.00 |
| 2 | NULL | finance | NULL | 50900.00 |
| 2 | NULL | kiccp | NULL | 8600.00 |
| 2 | NULL | tester | NULL | 20000.00 |
| 2 | NULL | enginer | NULL | 54000.00 |
| 3 | s.zh | tester | NULL | 20000.00 |
| 3 | s.zh | enginer | NULL | 54000.00 |
| 3 | x.qx | kiccp | NULL | 8600.00 |
| 3 | x.qx | finance | NULL | 50900.00 |
| 4 | NULL | NULL | 7 | 8600.00 |
| 4 | NULL | NULL | 5 | 24500.00 |
| 4 | NULL | NULL | 4 | 13400.00 |
| 4 | NULL | NULL | 3 | 20000.00 |
| 4 | NULL | NULL | 2 | 26000.00 |
| 4 | NULL | NULL | 1 | 28000.00 |
| 4 | NULL | NULL | 6 | 13000.00 |
| 5 | s.zh | NULL | 2 | 26000.00 |
| 5 | s.zh | NULL | 3 | 20000.00 |
| 5 | x.qx | NULL | 5 | 24500.00 |
| 5 | x.qx | NULL | 6 | 13000.00 |
| 5 | s.zh | NULL | 1 | 28000.00 |
| 5 | x.qx | NULL | 7 | 8600.00 |
| 5 | x.qx | NULL | 4 | 13400.00 |
| 6 | NULL | enginer | 1 | 28000.00 |
| 6 | NULL | finance | 4 | 13400.00 |
| 6 | NULL | tester | 3 | 20000.00 |
| 6 | NULL | finance | 5 | 24500.00 |
| 6 | NULL | kiccp | 7 | 8600.00 |
| 6 | NULL | enginer | 2 | 26000.00 |
| 6 | NULL | finance | 6 | 13000.00 |
| 7 | x.qx | finance | 5 | 24500.00 |
| 7 | x.qx | finance | 4 | 13400.00 |
| 7 | x.qx | kiccp | 7 | 8600.00 |
| 7 | s.zh | tester | 3 | 20000.00 |
| 7 | s.zh | enginer | 2 | 26000.00 |
| 7 | s.zh | enginer | 1 | 28000.00 |
| 7 | x.qx | finance | 6 | 13000.00 |
+---------------+---------------+-----------+----------+---------------+