Hive enhanced aggregations, cubes, grouping and summarization




The spring rain shocks the clear valley sky in spring, the summer is full of awns and the summer heat is connected; the dew in autumn is cold and frost falls, and there is snow in winter and a small severe cold in winter. Today is the last solar term of 2023: heavy snow. After the heavy snow festival, the temperature across the country dropped significantly, and the cold air in the north became more active. Everyone, please pay attention to keep warm from the cold

Getting to the point, this article mainly introduces the enhancements in Hive, Spark, and Presto query enginesGROUP BY and related syntax based on Hive. It also explains in detail the enhanced multidimensional aggregation syntax of the three engines through multidimensional scenario case analysis. The differences between them, as well as some common problems in use

1. Overview of multidimensional analysis


In multi-dimensional analysis scenarios, we may use higher-order aggregation functions, such as GROUPING SETS, CUBE, ROLLUPWait. Engines such as Hive, Spark, and Presto all provide similar high-order aggregation functions to perform aggregation statistics on data under different combinations of dimensions

Hive officially calls this analysis GROUP BY clause-enhanced aggregation, cube, grouping and summary

So, what are enhanced aggregation and multidimensional analysis?

enhanced aggregation refers to using GROUPING SETS, CUBE, ROLLUP, etc. when using group aggregation queries in SQL. sentence to operate. Common query engines basically support this syntax, such as Hive, Spark, Presto, FlinkSQL, etc. Using enhanced aggregation not only simplifies SQL code, but also improves the performance of SQL statements

Multidimensional analysis refers to the analysis of a combination of multiple dimensions, rather than the analysis of multiple dimensions. In multidimensional analysis scenarios, dimension column cluster names under any combination of dimensions can be restored to support democratized use of filters in charts.

Multidimensional analysis is mainly used for multidimensional aggregation, that is, combining and aggregating results from multiple dimensions

2. GROUPING SETS multi-dimensional grouping


Hive’s official description ofGROUPING SETS is as follows:

GROUP BYGROUPING SETSThe clause in clauses can be logically representedGROUP BY options in the same recordset. All allows us to specify multiple queries connected by GROUPING SETUNIONGROUP BY

Hive官方文档:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

Simply put, specify multiple sets of dimensions as grouping rules forGROUP BY, and then combine the results together. Its effect is equivalent to first grouping these group dimensions separatelyGROUP BY, and then combining the resultsUNION

For example, theGROUPING SET query and the equivalentGROUP BY query are as follows:

-- 示例1:
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )
-- 等效于
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b

-- 示例2:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)
-- 等效于
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

-- 示例3:
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)
-- 等效于
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

-- 示例4:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )
-- 等效于
SELECT a

Guess you like

Origin blog.csdn.net/weixin_55629186/article/details/134856036