HIVE window and analysis function application scenarios

Source: http://yugouai.iteye.com/blog/1908121
Comment:
Window function application scenarios:

(1) For partition sorting

(2) Dynamic Group By

(3) Top N

(4) Cumulative calculation

(5) Hierarchical query



1. Analytical functions

are used for grades, percentiles, n-slices, etc.

Function description
RANK() Returns the rank of the data item in the group, equal ranks will leave a gap in the ranking
DENSE_RANK() Returns the rank of the data item in the group, equal ranks will leave no gap in the ranking
NTILE() Return The value of n-sliced
​​ROW_NUMBER() returns a numerical


Rank for each record, DENSE_RANK
RANK() reserves empty space when elements with the same rank appear, DENSE_RANK() does not.

Eg: There are two parallel firsts for a product type

RANK(): The first and second are 1, the third is 3

DENSE_RANK(): The first and second are 1, and the third is 2

Sql code Favorite code
SELECT 
column_name, 
RANK() OVER (ORDER BY column_name DESC) AS rank, 
DENSE_RANK() OVER (ORDER BY SUM(column_name) DESC) AS dense_rank 
FROM table_name 
OVER Required, the numbering sequence in parentheses



Note : When order by, desc NULL value is in the first place, in ASC the NULL value is at the end. Java code

can be controlled through NULLS LAST and NULLS FIRST

Favorite code
RANK() OVER (ORDER BY column_name DESC NULLS LAST) 
PARTITION BY Grouping order

Java code Favorite code
RANK() OVER(PARTITION BY month ORDER BY column_name DESC) 
In this way, it will be divided according to month, that is, the information to be arranged is firstly grouped by the value of month, and then Sorting in groups, each group does not interfere with



CUBE, ROLLUP, GROUPING SETS() See: HIVE-enhanced aggregation, and can also be used in conjunction with RANK() to implement specific logic.

NTILE
queries by level, such as a year, the list of the top 1/5 of the salary is counted, and the NTILE analysis function is used to divide all salaries into 5 parts, which part is 1 is the result we want:
Sql code Collection code
select empno,ename,sum(sal),ntile(5) over (order by sum(sal) desc nulls last) til from emp group by empno,ename; 
ROW_NUMBER
ROW_NUMBER() starts from 1 and returns a numeric

Sql code for each record. Collection code
SELECT 
ROW_NUMBER() OVER (ORDER BY column_name DESC) AS row_name 
FROM table_name; 


2. Window functions

can calculate within a certain range, within a certain range of values, or within a certain range Cumulative sums over time and moving averages, etc.

Can be used in conjunction with aggregate functions SUM(), AVG(), etc.

You can combine FIRST_VALUE() and LAST_VALUE() to return the first and last values ​​of the window

(1) to calculate the cumulative sum

eg: Count the cumulative sales from January to December, that is, January is the value of January, and February is 1.2 The sum of month values, March is the sum of 123 months, and December is the sum of January-December values.

Java code Collection code
SELECT 
month,SUM(amount) month_amount, 
SUM( SUM(amount)) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_amount 
FROM table_name 
GROUP BY month 
ORDER BY month; 
where:

SUM( SUM(amount)) The internal SUM(amount) is the value that needs to be accumulated. In the above, it can be replaced by month_amount

ORDER BY month The records read by the query are sorted by month, which is the sorting within the window range.

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW defines the starting point and ending point, UNBOUNDED PRECEDING is the starting point, indicating that it starts from the first row, and CURRENT ROW is the default value, which is equivalent to:

ROWS UNBOUNDED PRECEDING

PRECEDING: means in the first N rows.

FOLLOWING: The meaning of the last N lines.



Calculate the sum Sql code between the first 3 months



Collection code
SUM( SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS cumulative_amount 
can also be

Java code Collection code
SUM( SUM(amount)) OVER ( ORDER BY month 3 PRECENDING) AS cumulative_amount Sum  Sql code
between the month before and after the

collection code
SUM( SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS cumulative_amount 
Java code for the value of the first and last items of the form

Collection code
FIRST_VALUE(SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS xxxx; 
 
LAST_VALUE(SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS xxxx; 


3. LAG, LEAD

get the data of the record with the specified distance relative to the current record.

LAG() is forward, LEAD() is backward

Sql code Collection code
LAG(column_name1,1 ) OVER(ORDER BY column_name2) 
 
LEAG(column_name1,1) OVER(ORDER BY column_name2) 
This will get the data of the previous one and the next one.



4. FIRST and LAST

get the first value in a sorted group and the next value in the group. Java code can be combined with grouping function

Favorite code
SELECT 
MIN(month) KEEP(DENSE_RANK FIRST ORDER BY SUM(amount)) AS highest_sales_month, 
MIN(month) KEEP(DENSE_RANK LAST ORDER BY SUM(amount)) AS lows_sales_month 
FROM table_name 
GROUP BY month 
ORDER BY month; 
This will get the month with the highest and lowest sales in the year.

The output is the month, but it is judged by SUM(amount).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326240731&siteId=291194637