Window function
1. Hive window function syntax
The syntax structure of window function over() and over() window function in hive
Analysis function over (partition by column name order by column name rows between start position and end position)
The over() function includes three functions: including partition by column name, sort order by column name, specified window range rows between the start position and the end position. When we use the over() window function, these three functions in the over() function can be used in combination or not.
If these three functions are not used in the over() function, the window size is for all the data generated by the query. If the partition is specified, the window size is for the data of each partition.
The three functions in the over() function explain the meaning of
order by
order by, which is in the window.
partition by
partition by can be understood as group by grouping. When over (partition by column name) is used with the analysis function, the analysis function is calculated according to the data of each group and each group.
The starting position and ending position
of rows between are the specified window range, such as the first row to the current row. And this range changes with the data. When over(rows between start position and end position) is matched with analysis function, the analysis function is calculated according to this range.
Window range description:
The window range we often use is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (from the starting point to the current row), and this window is often used to calculate the accumulation.
PRECEDING: forward
FOLLOWING: backward
CURRENT ROW: current row
UNBOUNDED: starting point (usually combined with PRECEDING, FOLLOWING)
UNBOUNDED PRECEDING indicates the front row of the window (starting point)
UNBOUNDED FOLLOWING: indicating the last row (end point) of the window, for
example Say:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (representing from the starting point to the current row)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING (representing 2 rows forward to 1 row back)
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW (representing 2 rows forward To the current row)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (representing the current row to the end point)
2. Analysis functions often used with over():
2.1. Aggregate classes
avg(), sum(), max(), min()
2.2. Rank class
row_number() generates an auto-increment number when sorted by value, and will not repeat (such as: 1, 2, 3, 4, 5, 6)
rank() generates an auto-increment number when sorted by value, and the values are equal When the
value is repeated, there will be gaps (such as: 1, 2, 3, 3, 3, 6 ) dense_rank() will generate an auto-increment number when sorting by value, and it will be repeated when the values are equal, and no gaps will be generated (such as: 1, 2,
3, 3, 3, 4) 2.3. Others
lag (column name, the number of previous rows, [the default value when the number of rows is null, do not specify null]), can calculate the user's last purchase time , or the user's next purchase time.
lead(column name, number of subsequent rows, [default value when the number of rows is null, not specified as null])
ntile(n) distribute the rows in the ordered partition to the specified data group, each group has a number , The number starts from 1, and for each row, ntile returns the number of the group to which the row belongs.
Case 1:
1. Use the over() function to perform data statistics, count each user's information and the total number of data in the table
2. Ask for users List and count the total number of users per day
3. Calculate the total number of users with a score greater than 80 from the first day to the present
date user ID score
logday uid score
Field description:
Data:
20201210,10001,84
20201210,10002,83
20201210,10003,86
20201211,10001,87
20201211,10002,65
20201211,10003,98
20201212,10001,67
20201212,10002,28
20201212,10003,89
20201213,10001,99
20201213,10002,55
20201213,10003,57
Create a table and import data:
create table test_window
(day string,
uid string,
score int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
#Download Data
load data local inpath '/export/data/hive/test_window.txt' into table test_window;
1. Use the over() function for data statistics, and count each user's information and the total number of data in the table
select *, count(uid) over() as total from test_window;
2. Ask for user details and count the total number of users per day
select *,count(*) over(partition by logday)as day_total from test_window;
3. Calculate the total number of users with a score greater than 80 from the first day to the present
select *,count(*) over(order by logday rows between unbounded preceding and current row)as total from test_window where score > 80;
Case 2:
1. Query the customers who purchased in April 2020 and the total number of people
2. Query the customer's purchase details and monthly purchase total
3. Query the customer's purchase details and the total purchase amount of each customer so far
4. Query the customer Last purchase time
5. Query the order information of the first 20% of the time
Field description:
username order date order amount
name orderdate cost
data:
jack,2020-01-01,11
tony,2020-01-02,16
jack,2020-02-03,22
tony,2020-01-04,28
jack,2020-01-05,47
jack,2020-04-06,43
tony,2020-01-07,50
jack,2020-01-08,55
mart,2020-04-08,63
mart,2020-04-09,69
tom,2020-05-10,13
mart,2020-04-11,76
tom,2020-06-12,81
mart,2020-04-13,95
Create a table and import data:
create table business
(
name string,
orderdate string,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
#Download Data
load data local inpath "/export/data/hive/business.txt" into table business;
1. Query the total number of customers who purchased in April 2020
select *,count(*) over() as total from business where substr(orderdate,1,7) = '2020-04';
2. Query the customer's purchase details and monthly purchase total
select *,sum(cost) over(partition by name,substr(orderdate,1,7)) total_amount
from business;
3. Query the customer's purchase details and the total purchase amount of each customer so far
select *,sum(cost) over(partition by name order by orderdate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) total_amount
from business;
4. Query the customer's last purchase time
select name,orderdate,cost,lag(orderdate,1) over(partition by name order by orderdate) last_date
from business;
5. Query the order information of the first 20% of the time
select *
from
(select *,
ntile(5)over(order by orderdate) group_num from business) t
where t. group_num = 1;
Case Three:
1. The student's performance ranking of each subject (whether it is tied or not, and the vacant ranking is achieved in three ways)
2. The top 3 students of each subject's performance
Field Description:
Name subjects at Grade
name subject score
data:
建胜 语文 87
建胜 数学 95
建胜 英语 68
班长 语文 94
班长 数学 56
班长 英语 84
副班长 语文 64
副班长 数学 86
副班长 英语 84
团支书 语文 65
团支书 数学 85
团支书 英语 78
Create a table to import data:
create table score
(
name string,
subject string,
score int
) row format delimited fields terminated by "\t";
#Download Data
load data local inpath '/export/data/hive/window_score.txt' into table window_score;
1. Student performance rankings of each subject (whether tied or not, and vacant rankings are achieved)
select *,
row_number()over(partition by subject order by score desc) as rn,
rank()over(partition by subject order by score desc) as rk,
dense_rank()over(partition by subject order by score desc) as dr
from window_score;
2. Top 3 students in each subject
select
*
from
(
select
*,
row_number() over(partition by subject order by score desc) rmp
from window_score
) t
where t.rmp<=3;