Hive supplementary window function

Window function
1. Hive window function syntax
The syntax structure of window function over() and over() window function in hive

Analysis function over (partition by column name order by column name rows between start position and end position)

The over() function includes three functions: including partition by column name, sort order by column name, specified window range rows between the start position and the end position. When we use the over() window function, these three functions in the over() function can be used in combination or not.

If these three functions are not used in the over() function, the window size is for all the data generated by the query. If the partition is specified, the window size is for the data of each partition.

The three functions in the over() function explain the meaning of
order by
order by, which is in the window.
partition by
partition by can be understood as group by grouping. When over (partition by column name) is used with the analysis function, the analysis function is calculated according to the data of each group and each group.
The starting position and ending position
of rows between are the specified window range, such as the first row to the current row. And this range changes with the data. When over(rows between start position and end position) is matched with analysis function, the analysis function is calculated according to this range.
Window range description:
The window range we often use is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (from the starting point to the current row), and this window is often used to calculate the accumulation.

PRECEDING: forward
FOLLOWING: backward
CURRENT ROW: current row
UNBOUNDED: starting point (usually combined with PRECEDING, FOLLOWING)
UNBOUNDED PRECEDING indicates the front row of the window (starting point)
UNBOUNDED FOLLOWING: indicating the last row (end point) of the window, for
example Say:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (representing from the starting point to the current row)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING (representing 2 rows forward to 1 row back)
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW (representing 2 rows forward To the current row)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (representing the current row to the end point)
2. Analysis functions often used with over():
2.1. Aggregate classes
avg(), sum(), max(), min()
2.2. Rank class
row_number() generates an auto-increment number when sorted by value, and will not repeat (such as: 1, 2, 3, 4, 5, 6)
rank() generates an auto-increment number when sorted by value, and the values ​​are equal When the
value is repeated, there will be gaps (such as: 1, 2, 3, 3, 3, 6 ) dense_rank() will generate an auto-increment number when sorting by value, and it will be repeated when the values ​​are equal, and no gaps will be generated (such as: 1, 2,

3, 3, 3, 4) 2.3. Others
lag (column name, the number of previous rows, [the default value when the number of rows is null, do not specify null]), can calculate the user's last purchase time , or the user's next purchase time.
lead(column name, number of subsequent rows, [default value when the number of rows is null, not specified as null])
ntile(n) distribute the rows in the ordered partition to the specified data group, each group has a number , The number starts from 1, and for each row, ntile returns the number of the group to which the row belongs.
Case 1:
1. Use the over() function to perform data statistics, count each user's information and the total number of data in the table
2. Ask for users List and count the total number of users per day
3. Calculate the total number of users with a score greater than 80 from the first day to the present
date user ID score
logday uid score

Field description:
Data:

20201210,10001,84
20201210,10002,83
20201210,10003,86
20201211,10001,87
20201211,10002,65
20201211,10003,98
20201212,10001,67
20201212,10002,28
20201212,10003,89
20201213,10001,99
20201213,10002,55
20201213,10003,57

Create a table and import data:

create table test_window
(day string,    
uid string, 
score int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

#Download Data

load data local inpath '/export/data/hive/test_window.txt' into table test_window;

1. Use the over() function for data statistics, and count each user's information and the total number of data in the table

select *, count(uid) over() as total  from  test_window;

2. Ask for user details and count the total number of users per day

select  *,count(*) over(partition by logday)as day_total from  test_window;

3. Calculate the total number of users with a score greater than 80 from the first day to the present

select  *,count(*) over(order by logday rows between unbounded preceding and current row)as total from  test_window where score > 80;

Case 2:
1. Query the customers who purchased in April 2020 and the total number of people
2. Query the customer's purchase details and monthly purchase total
3. Query the customer's purchase details and the total purchase amount of each customer so far
4. Query the customer Last purchase time
5. Query the order information of the first 20% of the time

Field description:
username order date order amount
name orderdate cost

data:

jack,2020-01-01,11
tony,2020-01-02,16
jack,2020-02-03,22
tony,2020-01-04,28
jack,2020-01-05,47
jack,2020-04-06,43
tony,2020-01-07,50
jack,2020-01-08,55
mart,2020-04-08,63
mart,2020-04-09,69
tom,2020-05-10,13
mart,2020-04-11,76
tom,2020-06-12,81
mart,2020-04-13,95

Create a table and import data:

create table business
(
name string, 
orderdate string,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

#Download Data

load  data local inpath "/export/data/hive/business.txt" into table business;

1. Query the total number of customers who purchased in April 2020

select   *,count(*) over() as total  from  business  where substr(orderdate,1,7) = '2020-04';

2. Query the customer's purchase details and monthly purchase total

select *,sum(cost) over(partition by name,substr(orderdate,1,7)) total_amount
from business;

3. Query the customer's purchase details and the total purchase amount of each customer so far

select *,sum(cost) over(partition by name order  by  orderdate 
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) total_amount
from business;

4. Query the customer's last purchase time

select name,orderdate,cost,lag(orderdate,1) over(partition by name order by orderdate) last_date
from business;

5. Query the order information of the first 20% of the time

select  * 
from 
(select *,
ntile(5)over(order  by  orderdate)  group_num from  business) t 
where t. group_num = 1;

Case Three:
1. The student's performance ranking of each subject (whether it is tied or not, and the vacant ranking is achieved in three ways)
2. The top 3 students of each subject's performance

Field Description:
Name subjects at Grade
name subject score

data:

建胜	语文	87
建胜	数学	95
建胜	英语	68
班长	语文	94
班长	数学	56
班长	英语	84
副班长	语文	64
副班长	数学	86
副班长	英语	84
团支书	语文	65
团支书	数学	85
团支书	英语	78

Create a table to import data:

create table score
(
name string,
subject string, 
score int
) row format delimited fields terminated by "\t";

#Download Data

load data local inpath '/export/data/hive/window_score.txt' into table window_score;

1. Student performance rankings of each subject (whether tied or not, and vacant rankings are achieved)

select  *,
row_number()over(partition by subject order by score desc) as rn,
rank()over(partition by subject order by score desc) as rk,
dense_rank()over(partition by subject order by score desc) as dr
from window_score;

2. Top 3 students in each subject

select 
*
from 
(
select 
*,
row_number() over(partition by subject order by score desc) rmp
from window_score
) t
where t.rmp<=3;

Guess you like

Origin blog.csdn.net/xianyu120/article/details/114583183