In the ROLLUP, GROUPING SETS, CUBE function, we will need to use the GROUPING function and GROUPING__ID.
For the usage of these functions, please refer to my article:
https://blog.csdn.net/u010003835/article/details/105353510
GROUPING function
Next, we first explain the GROUPING function
The GROUPING function can be used with CUBE, ROLLUP, GROUPING SETS, it can help you understand how the summary value is generated.
But it seems that my HIVE version has problems, so I can't do the experiment
Directly posted the official document link
Original text, original data
Column1 (key) |
Column2 (value) |
---|---|
1 |
NULL |
1 |
1 |
2 |
2 |
3 |
3 |
3 |
NULL |
4 |
5 |
Grouping function
The grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set.
Going back to our example above, consider the following query:
|
This query will produce the following results.
Column 1 (key) |
Column 2 (value) |
GROUPING__ID |
grouping(key, value) |
grouping(value, key) |
grouping(key) |
grouping(value) |
count(*) |
---|---|---|---|---|---|---|---|
NULL |
NULL |
3 |
3 | 3 | 1 |
1 | 6 |
1 |
NULL |
0 |
0 | 0 | 0 |
0 | 2 |
1 |
NULL |
1 |
1 | 2 | 0 |
1 | 1 |
1 |
1 |
0 |
0 | 0 | 0 |
0 | 1 |
2 |
NULL |
1 |
1 | 2 | 0 |
1 | 1 |
2 |
2 |
0 |
0 | 0 | 0 |
0 | 1 |
3 |
NULL |
0 |
0 | 0 | 0 |
0 | 2 |
3 |
NULL |
1 |
1 | 2 | 0 |
1 | 1 |
3 |
3 |
0 |
0 | 0 | 0 |
0 | 1 |
4 |
NULL |
1 |
1 | 2 | 0 |
1 | 1 |
4 |
5 |
0 |
0 | 0 | 0 |
0 | 1 |
GROUPING__ID calculation method
Below we explain the calculation method of GROUPING__ID
Data statistics background:
We now have multiple companies, multiple departments, and multiple employees' salaries. Now we need to count salary according to multiple dimensions.
Build the basic table
use data_warehouse_test;
CREATE TABLE IF NOT EXISTS datacube_salary_org (
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,user_id BIGINT COMMENT '用户id'
,user_name STRING COMMENT '用户姓名'
,salary DECIMAL(10,2) COMMENT '薪水'
,create_time DATE COMMENT '创建时间'
,update_time DATE COMMENT '修改时间'
)
PARTITIONED BY(
pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;
Import Data
Create a txt file and fill in the following:
s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07
Create partition, LOAD file
LOAD file
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Create partition
ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');
Import Data
LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt' OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');
Below we use ROLLUP, look at the calculation rule of GROUPING_ID
SELECT
grouping__id
,company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
WITH ROLLUP
ORDER BY
grouping__id
;
You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id
Can be seen as a 3-bit binary
That is (0/1), (0/1), (0/1)
The low order corresponds to company_name
The high bit corresponds to user_id
If this bit is aggregated (the field exists in GROUP BY), it is 0, and the GROUPING function also returns 0
Finally, GROUPING_ID is the decimal number corresponding to the binary conversion.
1) ROLLUP subsets grouped by all fields
Equivalent to the following SQL statement
SELECT
company_name
,dep_name
,user_id
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
,user_id
Corresponding result
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 7 | x.qx | kiccp | 7 | 8600.00 |
| 7 | x.qx | finance | 6 | 13000.00 |
| 7 | x.qx | finance | 5 | 24500.00 |
| 7 | x.qx | finance | 4 | 13400.00 |
| 7 | s.zh | enginer | 2 | 26000.00 |
| 7 | s.zh | enginer | 1 | 28000.00 |
| 7 | s.zh | tester | 3 | 20000.00 |
+---------------+---------------+-----------+----------+---------------+
Because GROUP BY has all three fields, the binary notation is 111, so GROUPING__ID is 7.
2) ROLLUP subsets grouped by company_name, dep_name
Equivalent to SQL
SELECT
company_name
,dep_name
,NULL
,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY
company_name
,dep_name
Corresponding result
+---------------+---------------+-----------+----------+---------------+
| grouping__id | company_name | dep_name | user_id | total_salary |
+---------------+---------------+-----------+----------+---------------+
| 3 | s.zh | tester | NULL | 20000.00 |
| 3 | x.qx | kiccp | NULL | 8600.00 |
| 3 | x.qx | finance | NULL | 50900.00 |
| 3 | s.zh | enginer | NULL | 54000.00 |
+---------------+---------------+-----------+----------+---------------+
Combine the above rules
You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id
Can be seen as a 3-bit binary
That is (0/1), (0/1), (0/1)
The low order corresponds to company_name
The high bit corresponds to user_id
Because the above field is GROUP BY company_name, dep_name
Then GROUPING__ID is 011, which is 3