HIVE SQL aggregate function and rows between / range between detailed explanation

1. Usage of rows between and range between

1. Analysis of relevant keywords

unbounded 无边界
preceding 往前
following 往后
unbounded preceding 往前所有行,即初始行
n preceding 往前n行
unbounded following 往后所有行,即末尾行
n following 往后n行
current row 当前行

语法
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

2. rows between ... and ...

rows: Refers to the range of the frame determined by the row number, which is a row in the physical sense.

For example, rows between 1 preceding and 1 following represents one row before and one row after the current row.

3. range between ... and ...

range: refers to the value of the current row in the window function as the base, then sorts according to order by, and finally adds and subtracts the upper and lower bounds according to the range. is a logical row.

For example, sum(score) over (PARTITION by id order by score BETWEEN 1 PRECEDING AND 1 FOLLOWING) means grouping by id, sorting by score in ascending order, and then taking the score of the current row, subtracting one from the lower bound and adding one to the upper bound, as a range, and summing up the scores in this range.

It's a bit of a mouthful, so let's see an example to understand.

Two, examples

1. Data preparation

Suppose there is a table datadev.t_student, the data is as follows

id score
stu_1 1
stu_1 2
stu_1 3
stu_1 4
stu_1 5
stu_1 5

2. Test  rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a3,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a4,
sum(score) over (PARTITION by id order by 1) as a5
from datadev.t_student;

The test results are as follows:

analyze:

  1. sum(score) over (PARTITION by id) as a1: group by id and directly add up scores, which is the most familiar to everyone
  2. sum(score) over (PARTITION by id order by score) as a2: Sort by score and add up from the start row to the current row. The difference from ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW in a3 is that when the score is the same, the same ranking will be counted and added together. Similar to the concept of rank.
  3. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): sums from the start row to the current row. Unlike a2, when the score is the same, the rankings are different and will not be summed up to the current row. Similar to the concept of row_number.
  4. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING): sum from the start row to the end row, the same as a1.
  5. sum(score) over (PARTITION by id order by 1): The function is the same as that of a2, where order by 1 is equivalent to the same score, so all are added up.

The explanations of a1 and a2 on the official website are as follows:

  • When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
  • When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Therefore, a1 and a2 are equivalent to

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a2,
from datadev.t_student;

 The official website address is as follows:

LanguageManual WindowingAndAnalytics - Apache Hive - Apache Software Foundation

3. Test  the range between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as b1,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as b2
from datadev.t_student;

The test results are as follows:

 analyze:

  1. sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 是默认值,可不写。
  2. sum(score) over (PARTITION by id order by score BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING): group by id, sort in ascending order of score, and reduce the lower bound of the score of the current row by one, and the upper bound is all (which can be regarded as infinity), as the filtering range. Finally, add the scores that meet the screening range.

The analysis of the operation process of b2 is as follows:

id score operation process operation b2
stu_1 1 [score value of the current row - 1, ∞] ==> ie [0, ∞] 1+2+3+4+5+5=20 20
stu_1 2 [score value of the current row - 1, ∞] ==> that is [1, ∞] 1+2+3+4+5+5=20 20
stu_1 3 [score value of the current row - 1, ∞] ==> that is [2, ∞] 2+3+4+5+5=19 19
stu_1 4 [score value of the current row - 1, ∞] ==> that is [3, ∞] 3+4+5+5=17 17
stu_1 5 [score value of the current row - 1, ∞] ==> that is [4, ∞] 4+5+5=14 14
stu_1 5 [score value of the current row - 1, ∞] ==> that is [4, ∞] 4+5+5=14 14

4. Compare  range between ... and ... and rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) as a,
sum(score) over (PARTITION by id order by score ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as b
from datadev.t_student;

The test results are as follows:

 analyze:

id score range operation process range operation a(range) rows operation process rows operation b(rows)
stu_1 1 [score value of the current row - 1, score value of the current row + 1] ==> ie [0, 2] 1+2=3 3
Add the score of the previous row and the next row of the current row
1+2=3 3
stu_1 2 [score value of the current row - 1, score value of the current row + 1] ==> ie [1, 3] 1+2+3=6 6
Add the score of the previous row and the next row of the current row
1+2+3=6 6
stu_1 3 [score value of the current row - 1, score value of the current row + 1] ==> ie [2, 4] 2+3+4=9 9
Add the score of the previous row and the next row of the current row
2+3+4=9 9
stu_1 4 [score value of the current row - 1, score value of the current row + 1] ==> ie [3, 5] 3+4+5+5=17 17
Add the score of the previous row and the next row of the current row
3+4+5=12 12
stu_1 5 [score value of the current row - 1, score value of the current row + 1] ==> ie [4, 6] 4+5+5=14 14
Add the score of the previous row and the next row of the current row
4+5+5=14 14
stu_1 5 [score value of the current row - 1, score value of the current row + 1] ==> ie [4, 6] 4+5+5=14 14
Add the score of the previous row and the next row of the current row
5+5=10 10

Reference Documentation: Hive Windows and Analytical Functions

Guess you like

Origin blog.csdn.net/qq_37771475/article/details/121774383