In-depth MaxCompute -Episode 11-QUALIFY

Introduction:  MaxCompute supports QUALIFY syntax to filter the results of Window functions, making query statements more concise and easy to understand. The relationship between Window functions and QUALIFY syntax can be compared to aggregate function + GROUP BY syntax and HAVING syntax.

MaxCompute (formerly ODPS) is an industry-leading distributed big data processing platform independently developed by Alibaba Cloud. It is widely used within the group and supports the core businesses of multiple BUs. In addition to continuously optimizing performance, MaxCompute is also committed to improving the user experience and expressive capabilities of the SQL language and improving the productivity of MaxCompute developers.

MaxCompute is based on the new generation SQL engine of MaxCompute2.0, which significantly improves the ease of use of the SQL language compilation process and the expressiveness of the language. We hereby launch a series of in-depth MaxCompute articles

The first bullet -  Make good use of MaxCompute compiler errors and warnings
The second bullet -  New basic data types and built-in functions
The third bullet -  Complex types
The fourth bullet -  CTE, VALUES, SEMIJOIN
The fifth bullet -  SELECT TRANSFORM
The sixth bullet -  User Defined Type
7th bullet -  Grouping Set, Cube and Rollup
8th bullet -  Dynamic type function
9th bullet -  Script mode and parameter view
10th bullet -  IF ELSE branch statement

This article will introduce how MaxCompute supports the QUALIFY syntax. The QUALIFY syntax supports specifying filter conditions to filter the results of the window (Window) function . It is similar to the HAVING syntax for processing data after aggregate functions and GROUP BY.

QUALIFY function introduction

Syntax format

QUALIFY [expression]

QUALIFY syntax filters the results of Window functions. The relationship between Window functions and QUALIFY syntax can be compared to aggregate function + GROUP BY syntax and HAVING syntax.
The execution sequence of a typical query statement is as follows:

  1. FROM
  2. WHERE
  3. GROUP BY和Aggregation Function
  4. HAVING
  5. WINDOW
  6. QUALIFY
  7. DISTINCT
  8. ORDER BY
  9. LIMIT

Usually the execution order of QUALIFY syntax in a query statement is after the WINDOW function, which is used to filter the data processed by the window function.

scenes to be used

The results of the Window function need to be filtered. Before the QUALIFY syntax, SubQuery was generally used in the FROM statement and filtered through the WHERE condition. as follows:

SELECT col1, col2
FROM
(
SELECT
t.a as col1,
sum(t.a) over (partition by t.b) as col2
FROM values (1, 2),(2,3),(2,2),(1,3),(4,2) t(a, b)
)
WHERE col2 > 4;

Rewritten query statement:

SELECT 
t.a as col1, 
sum(t.a) over (partition by t.b) as col2 
FROM values (1, 2),(2,3),(2,2),(1,3),(4,2)  t(a, b) 
QUALIFY col2 > 4;

You can also filter Window functions directly without using aliases.

SELECT t.a as col1,
sum(t.a) over (partition by t.b) as col2
FROM values (1, 2),(2,3),(2,2),(1,3),(4,2) t(a, b)
QUALIFY sum(t.a) over (partition by t.b)  > 4;

QUALIFY and WHERE/HAVING are used in the same way, but the execution order is different, so QUALIFY syntax allows users to write some complex conditions, such as:

SELECT *
FROM values (1, 2) t(a, b)
QUALIFY sum(t.a) over (partition by t.b)  IN (SELECT a FROM t1)

QUALIFY is executed after the window function takes effect. The following more complex example can intuitively feel the execution sequence of QUALIFY syntax:

SELECT a, b, max(c)
FROM values (1, 2, 3),(1, 2, 4),(1, 3, 5),(2, 3, 6),(2, 4, 7),(3, 4, 8) t(a, b, c)
WHERE a < 3
GROUP BY a, b
HAVING max(c) > 5
QUALIFY sum(b) over (partition by a) > 3; 
--+------------+------------+------------+
--| a          | b          | _c2        |
--+------------+------------+------------+
--| 2          | 3          | 6          |
--| 2          | 4          | 7          |
--+------------+------------+------------+

Example

Example of the row_number window function , group all employees according to department (deptno) (as a window column), sort each group in descending order according to salary (sal), and obtain the serial number of the employee in his own group. If you need to query the salary of each department The top 3 information is implemented as follows

  • data preparation

    create table if not exists emp
     (empno string,
      ename string,
      job string,
      mgr string,
      hiredate string,
      sal string,
      comm string,
      deptno string);
    
    insert into table emp values
    ('7369','SMITH','CLERK','7902','1980-12-17 00:00:00','800','','20')
    ,('7499','ALLEN','SALESMAN','7698','1981-02-20 00:00:00','1600','300','30')
    ,('7521','WARD','SALESMAN','7698','1981-02-22 00:00:00','1250','500','30')
    ,('7566','JONES','MANAGER','7839','1981-04-02 00:00:00','2975','','20')
    ,('7654','MARTIN','SALESMAN','7698','1981-09-28 00:00:00','1250','1400','30')
    ,('7698','BLAKE','MANAGER','7839','1981-05-01 00:00:00','2850','','30')
    ,('7782','CLARK','MANAGER','7839','1981-06-09 00:00:00','2450','','10')
    ,('7788','SCOTT','ANALYST','7566','1987-04-19 00:00:00','3000','','20')
    ,('7839','KING','PRESIDENT','','1981-11-17 00:00:00','5000','','10')
    ,('7844','TURNER','SALESMAN','7698','1981-09-08 00:00:00','1500','0','30')
    ,('7876','ADAMS','CLERK','7788','1987-05-23 00:00:00','1100','','20')
    ,('7900','JAMES','CLERK','7698','1981-12-03 00:00:00','950','','30')
    ,('7902','FORD','ANALYST','7566','1981-12-03 00:00:00','3000','','20')
    ,('7934','MILLER','CLERK','7782','1982-01-23 00:00:00','1300','','10')
    ,('7948','JACCKA','CLERK','7782','1981-04-12 00:00:00','5000','','10')
    ,('7956','WELAN','CLERK','7649','1982-07-20 00:00:00','2450','','10')
    ,('7956','TEBAGE','CLERK','7748','1982-12-30 00:00:00','1300','','10')
    ;
    
  • Use SubQuery in the FROM statement and implement filtering through WHERE conditions, as follows:

    SELECT  a.*
    FROM    (
              SELECT  deptno
                      ,ename
                      ,sal
                      ,ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC ) AS nums
              FROM    emp
          ) a
    WHERE a.nums<=3
    ;
    
  • Implemented through QUALIFY as follows:

    SELECT  deptno
          ,ename
          ,sal
          ,ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC ) AS nums
    FROM    emp
    QUALIFY nums <= 3
    ;
    

The results are as shown below, but using QUALIFY will make the query statement more concise and easy to understand.
image.png

Precautions

  • The QUALIFY syntax requires at least one Window function in the query statement. If there is no Window function, an error will be reported when using the QUALIFY syntax: FAILED: ODPS-0130071:[3,1] Semantic analysis exception - use QUALIFY clause without window function. Examples of errors are as follows.

    SELECT * 
    FROM values (1, 2) t(a, b) 
    QUALIFY a > 1;
    
  • The QUALIFY syntax allows users to use aliases of columns in SELECT as part of the filter conditions. The example is as follows.

    SELECT 
    sum(t.a) over (partition by t.b) as c1 
    FROM values (1, 2) t(a, b) 
    QUALIFY c1 > 1;
    

Guess you like

Origin blog.csdn.net/weixin_48534929/article/details/132603218