Hive query window function

Learn the window function of hive, and summarize it by the way:

The rowsets aggregated by ordinary aggregation functions are groups, and the rowsets aggregated by windowing functions are windows. Therefore, the ordinary aggregation function returns only one value per group (Group by), while the windowing function returns a value for each row in the window. A simple understanding is that there is an extra column for the result of the query. This column can be an aggregated value or a sorted value.
Windowing functions are generally divided into two categories, aggregate windowing functions and sorting windowing functions .

table of Contents

Aggregate window function

sum window function

count window function

min window function

max window function

avg windowing function

first_value window function

last_value window function

lag window function

lead window function

NTILE window function

cume_dist window function

Sort window function

RANK, DENSE_RANK, ROW_NUMBER window function

percent_rank window function


 

 First insert the data:
 

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";

data:

孙悟空	语文	10
孙悟空	数学	12
孙悟空	英语	15
大海	语文	4
大海	数学	7
大海	英语	11
宋宋	语文	22
宋宋	数学	19
宋宋	英语	3
婷婷	语文	21
婷婷	数学	8
婷婷	英语	23

Import the data into the hive table:

load data local inpath '/home/hive/score.txt' into table score;

start testing:

Aggregate window function

 

sum window function

-- 不加条件,所有的结果作为窗口
select name ,subject,score,
sum(score) over() as sum1
from score;

 

-- 以学科作为窗口的所有行
select name ,subject,score,
sum(score) over(partition by subject) as sum
from score;

 

-- 以学科作为分组,分数排序后,从第一行到当前行(含当前行)的所有行作为窗口
select name ,subject,score,
sum(score) over(partition by subject order by score) as sum
from score;

select name ,subject,score,
sum(score) over() as sum1,
sum(score) over(partition by subject) as sum2,
sum(score) over(partition by subject order by score) as sum3,
-- 由起点到当前行的窗口聚合,和sum3一样
sum(score) over(partition by subject order by score rows between UNBOUNDED PRECEDING and current row) as sum4,
-- 当前行和前面一行的窗口聚合
sum(score) over(partition by subject order by score rows between 1 PRECEDING and current row) as sum5,
-- 当前行和前面一行和后面一行的窗口聚合
sum(score) over(partition by subject order by score rows between 1 PRECEDING AND 1 FOLLOWING) as sum6,
-- 当前和后面所有的行
sum(score) over(partition by subject order by score rows between current row and UNBOUNDED FOLLOWING) as sum7
from score;

 

 

Rows must follow the Order by clause to limit the sorting results, and use a fixed number of rows to limit the number of data rows in the partition.

summary:

  • OVER (): Specify the size of the data window in which the analysis function works. The size of this data window may change as lines change.
  • CURRENT ROW: current row
  • n PRECEDING: forward n rows of data
  • n FOLLOWING: n rows of data in the future
  • UNBOUNDED: starting point, UNBOUNDED PRECEDING means starting point in front, UNBOUNDED FOLLOWING means ending point in back

Note: n must be of type int .

count window function

Take the number of items in the window

min window function

Take the minimum value in the window, similar to the sum function

max window function

Take the maximum value in the window, similar to the sum function

avg windowing function

Take the average value in the window, similar to the sum function

first_value window function

Returns the first value in the partition, similar to the sum function

last_value window function

Returns the last value in the partition, similar to the sum function

lag window function

LAG (col, n, default_val): the nth row of data forward, col is the column name, n is the number of rows up, and the default_val is taken when the nth row is null


select name ,subject,score,
 --窗口内 往上取第二个 取不到时赋默认值60
lag(score,2,60) over(partition by subject order by score) as lag1,
 --窗口内 往上取第二个 取不到时赋默认值NULL
lag(score,2) over(partition by subject order by score) as lag2
from score ;

lead window function

LEAD (col, n, default_val): the nth row of data in the future, col is the column name, n is the number of rows down, and the default_val is taken when the nth row is null

select name ,subject,score,
 --窗口内 往下取第二个 取不到时赋默认值60
LEAD(score,2,60) over(partition by subject order by score) as lead1,
 --窗口内 往下取第二个 取不到时赋默认值NULL
LEAD(score,2) over(partition by subject order by score) as lead2
from score ;

NTILE window function

NTILE (n): Distribute the rows in the ordered partition to the specified data group, each group has a number, the number starts from 1, for each row, NTILE returns the number of the group to which this row belongs. Note: n must be of type int .

-- 将结果按分数排序并分成4个组
select name ,subject,score, 
ntile(4) over(order by score) sorted
from score;

cume_dist window function

cume_dist () calculates the cumulative distribution of a value in a window or partition. Assuming an ascending order, use the following formula to determine the cumulative distribution: the
number of rows less than or equal to the current value x / the total number of rows in the window or partition. Where x is equal to the value in the current row of the column specified in the order by clause.

select name ,subject,score,
-- 统计小于等于当前分数的人数占总人数的比例
cume_dist() over(order by score) as cume_dist1,
-- 统计分区内小于等于当前分数的人数占总人数的比例
cume_dist() over(partition by subject order by score) as cume_dist3
from score;

 

Explain the first line: the
total number of lines is 12, less than or equal to the first line are 3, 4, 7, the ratio is 3/12 = 0.25; after grouping by subject, there are 4 mathematics, less than or equal to 7 is only 7, so it is 0.25

-- 统计大于等于当前分数的人数占总人数的比例
select name ,subject,score,
cume_dist() over(order by score desc) as cume_dist2
from score;

Sort window function

RANK, DENSE_RANK, ROW_NUMBER window function

RANK () will be repeated when the order is the same, the total will not change

DENSE_RANK () will repeat when the order is the same, the total number will decrease

ROW_NUMBER () will be calculated according to the order

Before that, modify the data slightly:

select name,subject,score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;

Explanation:

Look at the first 4 rows, two 19 points, tied for the first, the third row of rp is 3, the data is not less, drp is 2, sorted in order, the total number is reduced, and finally rmp is sorted by number of rows.

percent_rank window function

Calculate the percentage ranking of a given row. It can be used to calculate the percentage of people who exceed it. (Rank value of the current row -1) / (total number of rows in the group -1)

select name,subject,score,
row_number() over(partition by subject order by score) as row_number,
percent_rank() over(partition by subject order by score) as percent_rank
from score;

 

 

 

 

 

 

 

Published 39 original articles · won praise 1 · views 4620

Guess you like

Origin blog.csdn.net/thetimelyrain/article/details/104194872