HiveSql a little trick a day: How to use the distribution function percent_rank() to find the average salary problem without the maximum and minimum values

0 problem description

reference link

(3 messages) HiveSql Interview Question 12--How to analyze the average salary without the maximum and minimum values ​​(byte beating) - Mo Ming Pomegranate Sister's Blog - CSDN Blog

Three solutions have been given in the article. Here we use this question to study how to use the percent_rank() function to solve the problem and simplify the problem-solving ideas.

1 percent_rank() function uses

The percent_rank() function is a distribution function, which is used to return the percentage rank of a sorted value in the data set , and its value is distributed between 0-1 [0,1]. This function is used to calculate the relative value of the value in the data set Location.

Calculation formula: current line rn -1 / number of lines in the group -1. Subtracting 1 means that he is not included in the ranking, indicating how many people in front of him are lower or higher than him, which has certain analytical significance in practice.

Usage scenario: used to care about how many people are ahead of me.

For example: class grades as an example, the returned percentage of 60% means that a certain score ranks in the top 60% of the class's total score.

For example, standing in line: I often care about how many people are in front of me. The following set of data:

For example, for a person with a score of 20, there are 5 people in front of him, excluding himself, there are 6 people in total, then his relative ranking percentage is 5/6

If the score is 10, there are 6 people in front of him, except himself, then the whole group has a higher score than him, so it is 100%

score

ranking

Percentage rank (percent_rank)

100

1

0%

100

1

0%

80

3

33%

40

4

50%

40

4

50%

20

6

83%

10

7

100%

Points to note: (1) percent_rank()'s handling of duplicate values

(2) percent_rank() processing of NULL values

Features: the first and last must be 0 and 1

cume_dist(): cumulative percentage

Similar to percent_rank(), the difference is whether to exclude its own influence

meaning:

Sort in ascending order: Indicates the percentage of the number of people less than or equal to the current value

Sort in descending order: the percentage of people greater than or equal to the current value

2 topic analysis

The requirement in the title is the average value after removing the maximum and minimum values, so the difficult question in this question is how to remove the maximum and minimum values. After our above analysis, the percent_rank() function is the proportion of the current row after ranking according to a certain sorted value, and its value is in the [0,1] interval. According to its characteristics, we know the values ​​​​of 0 and 1 after sorting Represents the minimum and maximum values, so we can easily obtain the marks of the maximum and minimum values ​​according to this function, thus solving the problem that the row_number() or dense_rank() function cannot completely distinguish the maximum and minimum values ​​using one sorting, and simplifying the problem solving method . The specific SQL is as follows:

with salary as (
select
'10001' emp_num    , '1' dep_num    , '60117'   salary
union all
select '10002' emp_num    , '2' dep_num    , '92102'   salary
union all
select '10003' emp_num    , '2' dep_num    , '86074'   salary
union all
select '10004' emp_num    , '1' dep_num    , '66596'   salary
union all
select '10005' emp_num    , '1' dep_num    , '66961'   salary
union all
select '10006' emp_num    , '2' dep_num    , '81046'   salary
union all
select '10007' emp_num    , '2' dep_num    , '94333'   salary
union all
select '10008' emp_num    , '1' dep_num    , '75286'   salary
union all
select '10009' emp_num    , '2' dep_num    , '85994'   salary
union all
select '10010' emp_num    , '1' dep_num    , '76884'   salary
)
SELECT dep_num,cast(avg(salary) as decimal(18,0)) as avg_salary
 from(
SELECT
   emp_num
  ,dep_num
  ,salary
  ,PERCENT_RANK() over(PARTITION BY dep_num ORDER BY salary) as rate
from salary
) t
 where rate != 0 and rate != 1
group by dep_num;

3 Summary

This article gives a method of using percent_rank() to find the average salary without the maximum and minimum values. This method is more concise and efficient, and it is worth learning. The posture points that need to be mastered through this article are as follows:

  • What is the function, significance and usage scenarios of the PERCENT_RANK function?

  • How is the result of PERCENT_RANK function calculated?

  • What is the difference between PERCENT_RANK and cume_disk() function?

  • How to use the characteristics of the PERCENT_RANK() function to quickly get the maximum and minimum values?

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/129000969