Hive window function Detailed most all cases

grammar:

Analysis function over (partition by the column name column name order by rows between the start position and end position)

Common analysis functions:

  • Aggregate class
    avg (), sum (), max (), min ()

  • Ranking categories

ROW_NUMBER () value is generated according to a sort of self-energizing ID not duplicate

Rank () generating an auto-incremented number is repeated equal value, a gap is generated in accordance with the sorting value

DENSE_RANK () value is generated according to a sort incrementing number will be repeated when the values ​​are equal, no vacancy

  • other kind

LAG (column names, the number of rows onward, [the number of rows is the default value of null, null is not specified])

Lead (column names, the number of rows of the future, [the number of rows is the default value of null, null is not specified])

ntile (n) of the partition lines ordered distribution to the specified data group, each group numbered, starting at 1, for each row, this row belongs NTILE returns the number of the group

important point:

  • over () function in the partitions, sorting, specifies the window bounds may be used in combination is not specified, used in combination depending on the business needs
  • over () function if the partition is not specified, the window size is generated for all the data query, if the partition is specified, the window size for each partition of the data

over () function in the window range described:

current row: the current line

unbounded: the starting point, unbounded preceding represents the starting point from the front, unbounded following indicates the end to the rear

n preceding: Previous data row n

n following: future data row n

Real case:

Raw data (user data purchase details)

name,orderdate,cost

jack,2017-01-01,10

tony,2017-01-02,15

jack,2017-02-03,23

tony,2017-01-04,29

jack,2017-01-05,46

jack,2017-04-06,42

tony,2017-01-07,50

jack,2017-01-08,55

mart,2017-04-08,62

mart,2017-04-09,68

neil,2017-05-10,12

mart,2017-04-11,75

neil,2017-06-12,80

mart,2017-04-13,94


建表加载数据
vi business.txt

create table business
(
name string, 
orderdate string,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/data/business.txt" into table business;

demand

And the total number of customers (1) Inquiry Buy in April 2017 had

分析:按照日期过滤、分组count求总人数(分组为什么不是用group by?自己思考)

select 
name,
orderdate,
cost,
count(*) over() total_people
FROM 
business
where date_format(orderdate,'yyyy-MM')='2017-04';

(2) query the customer's purchase details and monthly purchase total

分析:按照顾客分组、sum购买金额

select 
name,
orderdate,
cost,
sum(cost) over(partition by name) total_amount
FROM 
business;

(3) the above scenario, to date in accordance with the accumulated cost

分析:按照顾客分组、日期升序排序、组内每条数据将之前的金额累加

select 
name,
orderdate,
cost,
sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row) cumulative_amount
FROM 
business;

(4) customer inquiries to buy the last time

分析:查询出明细数据同时获取上一条数据的购买时间(肯定需要按照顾客分组、时间升序排序)

select 
name,
orderdate,
cost,
lag(orderdate,1) over(partition by name order by orderdate) last_date
FROM 
business;

(5) Order Information query the top 20% of the time

分析:按照日期升序排序、取前20%的数据

select
*
from
(
select 
name,
orderdate,
cost,
ntile(5) over(order by orderdate) sortgroup_num
FROM 
business
) t
where t.sortgroup_num=1;

Guess you like

Origin www.cnblogs.com/sutao-bigdata/p/11608686.html