Hive_Hive ROLLUP, GROUPING SETS, CUBE aggregate function and GROUPING function

Reference article:

1. The grouping set, cube, roll up functions in hive

https://blog.csdn.net/weixin_34352449/article/details/92169948

 

2.GROUPING function

https://blog.csdn.net/u013053796/article/details/18619315

 

3.GROUP BY ... WITH ROLL UP Total after grouping statistics

https://blog.csdn.net/u013677636/article/details/52353812

 

  When using Hive, we often perform aggregate statistical operations.

There are many operation functions for aggregate statistics, such as our most commonly used GROUP BY function.

 

But often we need multi-dimensional statistics, this time we will use Hive's aggregate statistical function

Here we explain the meaning and usage of ROLLUP, GROUPING SETS, CUBE.

 

Let's explain the usage rules of these functions with a case

 

Data statistics background:

We now have multiple companies, multiple departments, and multiple employees' salaries. Now we need to count salary according to multiple dimensions. 

 

First we build the basic table

use data_warehouse_test;


CREATE TABLE IF NOT EXISTS datacube_salary_org (
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,user_name STRING COMMENT '用户姓名'
 ,salary DECIMAL(10,2) COMMENT '薪水'
 ,create_time DATE COMMENT '创建时间'
 ,update_time DATE COMMENT '修改时间'
) 
PARTITIONED BY(
 pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;


CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;


CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;


CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
 company_name STRING COMMENT '公司名称'
 ,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;


CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
 total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;

 

Then, we import the basic data in 2 ways

1. Insert directly through SQL

2. Via file LOAD

 

1. Insert through SQL

INSERT OVERWRITE TABLE datacube_salary_org PARTITION (pt = '20200407') VALUES
 ('s.zh','enginer',1,'szh',28000.0,'2020-04-07','2020-04-07'),
 ('s.zh','enginer',2,'zyq',26000.0,'2020-04-03','2020-04-03'),
 ('s.zh','tester',3,'gkm',20000.0,'2020-04-07','2020-04-07'),
 ('x.qx','finance',4,'pip',13400.0,'2020-04-07','2020-04-07'),
 ('x.qx','finance',5,'kip',24500.0,'2020-04-07','2020-04-07'),
 ('x.qx','finance',6,'zxxc',13000.0,'2020-04-07','2020-04-07'),
 ('x.qx','kiccp',7,'xsz',8600.0,'2020-04-07','2020-04-07')
;

 

2. Via file LOAD

Create a txt file and fill in the following

s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07

We load the data in the following 2 ways

 

method 1.

 Create Folder

 Move file

 Repair table

hdfs dfs -mkdir /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
hdfs dfs -put org_data.txt /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406

Execute the following statement on the beeline / hive client

MSCK REPAIR TABLE datacube_salary_org;

 

Method 2.

   Create partition

   LOAD file

 

Standard usage of LOAD file 

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]  

  

Create partition

ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');

LOAD data

LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt'  OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');

 

 

After that, let's look at the usage of these functions

 

ROLLUP,  GROUPING SETS, CUBE

We discuss a little bit from GROUPING SETS, ROLLUP, CUBE.

 

GROUPING SETS

     GROUPING SETS, as a clause of GROUP BY, allows developers to specify multiple statistical options after the GROUP BY statement, which can be simply understood as multiple group by statements that combine the query results through union all. Here are a few examples to help We understand.

 

 

First we learn about GROUPING SETS 

The usage of GROUPING SETS is as follows:

SELECT
 a
 ,b
 ...
 ,f
FROM test_table
GROUP BY
 a
 ,b
 ...
 ,f
GROUPING SETS ((?,...,?),(xxx),(yyy))

GROUPING SETS 中间可以填写多个条件。

Where (?, ...,?) Can be any item in a ~ f that is not repeated

 

Specific examples are as follows:

SELECT 
 a
 ,b
 ,SUM(c)
FROM test_table
GROUP BY
 a
 ,b
GROUPING SETS ((a),(a,b),())

Equivalent to

SELECT 
 a
 ,NULL
 ,SUM(c)
FROM test_table
GROUP BY
 a

UNION ALL

SELECT
 a
 ,b
 ,SUM(c)
FROM test_table
GROUP BY
 a
 ,b

UNION ALL

SELECT
 NULL
 ,NULL
 ,SUM(c)
FROM test_table
;

 

In actual cases, we want to count employee salaries according to the company and the whole, but we want to do it in one statement.

How do we write SQL?

The SQL is as follows:

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
GROUPING SETS ((company_name), ())
ORDER BY
 grouping__id
;

 

The output is as follows:

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.13 sec   HDFS Read: 11666 HDFS Write: 175 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.73 sec   HDFS Read: 7060 HDFS Write: 188 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 860 msec
INFO  : Completed executing command(queryId=hive_20200408032038_18e04047-b8c0-4d07-a5de-00ccbc7cb4cc); Time taken: 51.459 seconds
INFO  : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
+---------------+---------------+-----------+----------+---------------+

You can see that the grouping_id is 0, and the overall salary and

And grouping_id is 1, the salary of the branch is calculated

For the calculation method of GROUPING__ID and the GROUPING function, please refer to my other article

 

 

 

We just said that using GROUPING SETS is equivalent to GROUP BY + UNION ALL, but are they really the same? We run EXPLAIN to check.

The EXPLAIN results of the two SQLs with equivalent results are given below.

SQL1

EXPLAIN

SELECT 
*
FROM
(
SELECT
 0 AS mark
 ,NULL
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'

UNION ALL

SELECT
 1 AS mark
 ,company_name
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
) tmp
ORDER BY mark
;

 

SQL2

EXPLAIN

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
GROUPING SETS ((company_name), ())
ORDER BY
 grouping__id
;

 

 

First post the separate EXPLAIN results of 2 SQL

UNION ALL SQL

Note: Adding more SQL to UNION ALL will increase the number of JOBs.

INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200408035719_94eb75b2-c6fc-4804-bacb-d3555c61e7f3); Time taken: 0.016 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1, Stage-3      |
|   Stage-3 is a root stage                          |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200407') (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: salary (type: decimal(10,2)) |
|               outputColumnNames: salary            |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: sum(salary)          |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col0 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: sum(VALUE._col0)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: 0 (type: int), null (type: string), _col0 (type: decimal(20,2)) |
|             outputColumnNames: _col0, _col1, _col4 |
|             Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
|                 outputColumnNames: _col0, _col1, _col4 |
|                 Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
|                 outputColumnNames: _col0, _col1, _col4 |
|                 Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), null (type: void), null (type: void), VALUE._col3 (type: decimal(20,2)) |
|           outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|           Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200407') (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), salary (type: decimal(10,2)) |
|               outputColumnNames: company_name, salary |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: sum(salary)          |
|                 keys: company_name (type: string)  |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: sum(VALUE._col0)           |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: 1 (type: int), _col0 (type: string), _col1 (type: decimal(20,2)) |
|             outputColumnNames: _col0, _col1, _col4 |
|             Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

 

GROUPING SETS SQL

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200408035951_8153b57c-9f60-4c2d-bc26-92659fdc8afd); Time taken: 0.007 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200407') (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), salary (type: decimal(10,2)) |
|               outputColumnNames: company_name, dep_name, user_id, salary |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: sum(salary)          |
|                 keys: company_name (type: string), dep_name (type: string), user_id (type: bigint), 0 (type: int) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|                 Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
|                   sort order: ++++                 |
|                   Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
|                   Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col4 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: sum(VALUE._col0)           |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: int) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col4 (type: decimal(20,2)) |
|             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: int)   |
|               sort order: +                        |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint), VALUE._col3 (type: decimal(20,2)) |
|           outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

 

Through the above analysis and comparison, it is not difficult to find that the number of GROUPING SETS assignments is less,

Actual situation: The operating efficiency of GROUPING SETS is also higher than the GROUP BY form of UNION ALL

 

The following is the execution time of the comparison

GROUPING SETS

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.62 sec   HDFS Read: 11666 HDFS Write: 175 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.51 sec   HDFS Read: 7060 HDFS Write: 188 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 130 msec
INFO  : Completed executing command(queryId=hive_20200408045412_4ab9e09f-436e-4433-9a1f-a03d5b32ef3e); Time taken: 49.676 seconds
INFO  : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
+---------------+---------------+-----------+----------+---------------+

 

 

UNION ALL 的 GROUP BY

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.7 sec   HDFS Read: 10541 HDFS Write: 119 SUCCESS
INFO  : Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 4.34 sec   HDFS Read: 10919 HDFS Write: 152 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 5.08 sec   HDFS Read: 12932 HDFS Write: 188 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 13 seconds 120 msec
INFO  : Completed executing command(queryId=hive_20200408045141_2033cbf6-a457-4bdb-aaec-65900b386972); Time taken: 84.365 seconds
INFO  : OK
+-----------+----------+----------+----------+-------------------+
| tmp.mark  | tmp._c1  | tmp._c2  | tmp._c3  | tmp.total_salary  |
+-----------+----------+----------+----------+-------------------+
| 0         | NULL     | NULL     | NULL     | 133500.00         |
| 1         | x.qx     | NULL     | NULL     | 59500.00          |
| 1         | s.zh     | NULL     | NULL     | 74000.00          |
+-----------+----------+----------+----------+-------------------+

 

It can be seen that the execution efficiency of GROUPING SETS is higher

 

 

 

ROLLUP

Rollup can realize multi-level statistics decreasing from right to left, showing statistics aggregation of a certain hierarchical structure.

 

which is

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH ROLLUP
ORDER BY
 grouping__id
;

 

Equivalent to

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
GROUPING SETS ((company_name, dep_name, user_id), (company_name, dep_name), (company_name),())
ORDER BY
 grouping__id
;

 

Equivalent to

SELECT 
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

UNION ALL

SELECT 
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

UNION ALL

SELECT 
 company_name
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name

UNION ALL

SELECT 
 NULL
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
;

 

Operation result

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.21 sec   HDFS Read: 11674 HDFS Write: 563 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.91 sec   HDFS Read: 7448 HDFS Write: 602 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 120 msec
INFO  : Completed executing command(queryId=hive_20200408052638_740f42b9-6f08-49a6-8123-9a77aedc6b19); Time taken: 50.563 seconds
INFO  : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | x.qx          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | s.zh          | tester    | 3        | 20000.00      |
+---------------+---------------+-----------+----------+---------------+

 

Here we briefly talk about the calculation rules of GROUPING__ID

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH ROLLUP
ORDER BY
 grouping__id
;
 

You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id

Can be seen as a 3-bit binary

That is (0/1), (0/1), (0/1) 

The low order corresponds to company_name

The high bit corresponds to user_id 

If this bit is aggregated (the field exists in GROUP BY), it is 0, and the GROUPING function also returns 0 

Finally, GROUPING_ID is the decimal number corresponding to the binary conversion.

 

 

1) ROLLUP's fully grouped subset

SELECT 
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

Corresponding result

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | x.qx          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | s.zh          | tester    | 3        | 20000.00      |
+---------------+---------------+-----------+----------+---------------+

Because GROUP BY has all three fields, it is 111, so GROUPING__ID is 7

 

 

2) A subset of ROLLUP company_name, dep_name grouping

SELECT 
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

Corresponding subset

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
+---------------+---------------+-----------+----------+---------------+

Because GROUP BY company_name, dep_name

Combine the above rules 

You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id

Can be seen as a 3-bit binary

That is (0/1), (0/1), (0/1) 

The low order corresponds to company_name

The high bit corresponds to user_id 

Then GROUPING__ID is 011, which is 3

I believe it has been explained clearly here, let ’s take a look at the CUBE function

 

 

 

CUBE 

  cube is abbreviated as data cube, which can realize multiple queries of any dimension in hive. cube (a, b, c) will first group by (a, b, c), then (a, b), (a, c), (a), (b, c), (b), (c), and finally group by on the whole table, it will count the aggregation of all combinations of values ​​in the selected column

 

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH CUBE
ORDER BY
 grouping__id
;


 

Equivalent to 

(company_name,dep_name,user_id)

SELECT
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

UNION ALL 

SELECT
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

UNION ALL 

SELECT
 company_name
 ,NULL
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,user_id
 

UNION ALL 

SELECT
 company_name
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name

UNION ALL 

SELECT
 NULL
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 dep_name
 ,user_id

 
UNION ALL 

SELECT
 NULL
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 dep_name

 
UNION ALL 

SELECT
 NULL
 ,NULL
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 user_id

UNION ALL 

SELECT
 NULL
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'

 
;

 

The results are as follows:

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 2             | NULL          | finance   | NULL     | 50900.00      |
| 2             | NULL          | kiccp     | NULL     | 8600.00       |
| 2             | NULL          | tester    | NULL     | 20000.00      |
| 2             | NULL          | enginer   | NULL     | 54000.00      |
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 4             | NULL          | NULL      | 7        | 8600.00       |
| 4             | NULL          | NULL      | 5        | 24500.00      |
| 4             | NULL          | NULL      | 4        | 13400.00      |
| 4             | NULL          | NULL      | 3        | 20000.00      |
| 4             | NULL          | NULL      | 2        | 26000.00      |
| 4             | NULL          | NULL      | 1        | 28000.00      |
| 4             | NULL          | NULL      | 6        | 13000.00      |
| 5             | s.zh          | NULL      | 2        | 26000.00      |
| 5             | s.zh          | NULL      | 3        | 20000.00      |
| 5             | x.qx          | NULL      | 5        | 24500.00      |
| 5             | x.qx          | NULL      | 6        | 13000.00      |
| 5             | s.zh          | NULL      | 1        | 28000.00      |
| 5             | x.qx          | NULL      | 7        | 8600.00       |
| 5             | x.qx          | NULL      | 4        | 13400.00      |
| 6             | NULL          | enginer   | 1        | 28000.00      |
| 6             | NULL          | finance   | 4        | 13400.00      |
| 6             | NULL          | tester    | 3        | 20000.00      |
| 6             | NULL          | finance   | 5        | 24500.00      |
| 6             | NULL          | kiccp     | 7        | 8600.00       |
| 6             | NULL          | enginer   | 2        | 26000.00      |
| 6             | NULL          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | s.zh          | tester    | 3        | 20000.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | x.qx          | finance   | 6        | 13000.00      |
+---------------+---------------+-----------+----------+---------------+

 

 

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/105353510