Hive_Hive ROLLUP, GROUPING SETS, CUBE aggregate function GROUPING function and GROUPING__ID calculation method

 

    In the ROLLUP, GROUPING SETS, CUBE function, we will need to use the GROUPING function and GROUPING__ID.

For the usage of these functions, please refer to my article:

https://blog.csdn.net/u010003835/article/details/105353510

 

GROUPING function

 

Next, we first explain the GROUPING function

  The GROUPING function can be used with CUBE, ROLLUP, GROUPING SETS, it can help you understand how the summary value is generated.

But it seems that my HIVE version has problems, so I can't do the experiment

Directly posted the official document link

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Groupingfunction

 

Original text, original data

Column1 (key)

Column2 (value)

1

NULL

1

1

2

2

3

3

3

NULL

4

5

 

Grouping function

The grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set. 

Going back to our example above, consider the following query:

SELECT key, value, GROUPING__ID,

  grouping(key, value), grouping(value, key), grouping(key), grouping(value),

  count(*)

FROM T1

GROUP BY key, value WITH ROLLUP;

This query will produce the following results.

Column 1 (key)

Column 2 (value)

GROUPING__ID

grouping(key, value)

grouping(value, key)

grouping(key)

grouping(value)

count(*)

NULL

NULL

3

3 3

1

1 6

1

NULL

0

0 0

0

0 2

1

NULL

1

1 2

0

1 1

1

1

0

0 0

0

0 1

2

NULL

1

1 2

0

1 1

2

2

0

0 0

0

0 1

3

NULL

0

0 0

0

0 2

3

NULL

1

1 2

0

1 1

3

3

0

0 0

0

0 1

4

NULL

1

1 2

0

1 1

4

5

0

0 0

0

0 1

 

 

 

 

GROUPING__ID calculation method

 

Below we explain the calculation method of GROUPING__ID

 

Data statistics background:

We now have multiple companies, multiple departments, and multiple employees' salaries. Now we need to count salary according to multiple dimensions. 

 

 

Build the basic table

use data_warehouse_test;
 
 
CREATE TABLE IF NOT EXISTS datacube_salary_org (
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,user_name STRING COMMENT '用户姓名'
 ,salary DECIMAL(10,2) COMMENT '薪水'
 ,create_time DATE COMMENT '创建时间'
 ,update_time DATE COMMENT '修改时间'
) 
PARTITIONED BY(
 pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

 

Import Data

Create a txt file and fill in the following:

s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07

 

Create partition, LOAD file

LOAD file

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]  

Create partition

ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');

Import Data

LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt'  OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');

 

Below we use ROLLUP, look at the calculation rule of GROUPING_ID

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH ROLLUP
ORDER BY
 grouping__id
;

 

 

You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id

Can be seen as a 3-bit binary

That is (0/1), (0/1), (0/1) 

The low order corresponds to company_name

The high bit corresponds to user_id 

If this bit is aggregated (the field exists in GROUP BY), it is 0, and the GROUPING function also returns 0 

Finally, GROUPING_ID is the decimal number corresponding to the binary conversion.
 

 

1) ROLLUP subsets grouped by all fields

Equivalent to the following SQL statement

SELECT 
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

Corresponding result

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | x.qx          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | s.zh          | tester    | 3        | 20000.00      |
+---------------+---------------+-----------+----------+---------------+

 

Because GROUP BY has all three fields, the binary notation is 111, so GROUPING__ID is 7.

 

 

 

2) ROLLUP subsets grouped by company_name, dep_name

Equivalent to SQL

SELECT 
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

Corresponding result

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
+---------------+---------------+-----------+----------+---------------+

 

Combine the above rules

You can see that there are three fields in GROUP BY, followed by company_name, dep_name, user_id

Can be seen as a 3-bit binary

That is (0/1), (0/1), (0/1) 

The low order corresponds to company_name

The high bit corresponds to user_id 

 

Because the above field is GROUP BY company_name, dep_name

Then GROUPING__ID is 011, which is 3

 

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/105373889