HiveSql Interview Question 12-How to analyze the average salary without the maximum and minimum values (byte beating)

table of Contents

0 Problem description

1 Data preparation

2 Data analysis

3 summary


0 Problem description

  • The salary table is the basic information of employee salary, including employee number, department number and salary
  • The first line indicates that the employee whose employee number is 10001 is in department 1, and the salary is 60117 yuan;
  • The second line indicates that the employee whose employee number is 10002 is in department 2, and the salary is 92,102 yuan;
  • ...
  • Line 10 indicates that the employee whose employee number is 10010 is in department 1, and the salary is 76884 yuan

Question: Query the average salary of each department after removing the highest and lowest salary, and keep the whole number. 

1 Data preparation

(1) Data

basic data
Employee number Department Number salary
10001 1 60117
10002 2 92102
10003 2 86074
10004 1 66596
10005 1 66961
10006 2 81046
10007 2 94333
10008 1 75286
10009 2 85994
10010 1 76884

 

(2) Create table SQL

drop table if exists dan_test.salary

CREATE TABLE dan_test.salary (

emp_num string ,

dep_num string ,

salary string

)

ROW format delimited FIELDS TERMINATED BY ",";

(3) Load data

load data local inpath "/home/centos/dan_test/salary.txt" into table salary;

 (4) Query data

 

2 Data analysis

Goal: Need to query the average salary of each department after removing the highest and lowest salary, and keep the whole number. 

Three key points of information:

  •   (1) Each department (grouped by department)
  • (2) Exclude the highest and lowest salary. (Need to find the highest and lowest salary first, and filter, the key to this question)
  • (3) Calculate the average on the basis of (1) and (2), and keep the integer

 Idea 1: Find and filter the highest and lowest salary in the department. Use the window function row_number() to mark the data grouped by department and sorted by salary

The SQL is as follows:

select emp_num
      ,dep_num
	  ,salary
	  ,row_number() over(partition by dep_num order by salary) as rn1 --标记最小
	  ,row_number() over(partition by dep_num order by salary desc) rn2 --标记最大
from salary

The query results are as follows:

According to requirements, filter out the intermediate results, that is, the conditions of rn1 >1 and rn2 >1 are both established. When there are multiple window result sets, and the displayed field participates in the window operation, the field is subject to the last window function. For example, the salary display result of this question is displayed based on the last window result. The final SQL is as follows

select *
from(
select emp_num
      ,dep_num
	  ,salary
	  ,row_number() over(partition by dep_num order by salary) as rn1 --标记最小
	  ,row_number() over(partition by dep_num order by salary desc) rn2 --标记最大
from salary
) t
where rn1 > 1 and rn2 > 1

Idea 2: In order to filter the largest and smallest, we can sort only once, first find the total number of grouping rows, and then sort the salary in ascending order, then the result we need is 1<rn<cn. The specific SQL is as follows:

select emp_num
      ,dep_num
	  ,salary
	  ,count(1) over(partition by dep_num) as cn --求出总行数
	  ,row_number() over(partition by dep_num order by salary ) rn --按照薪水升序排序
from salary

 

The final filtered results are as follows:

select *
from(
select emp_num
      ,dep_num
	  ,salary
	  ,count(1) over(partition by dep_num) as cn --求出总行数
	  ,row_number() over(partition by dep_num order by salary ) rn --按照薪水升序排序
from salary
) t
where rn > 1 and rn < cn

 The SQL to find the final average is as follows:

---方法1
select t.dep_num,round(avg(t.salary),0) as avg_salary
from
  (
  select *,
  row_number() over (partition by dep_num order by salary desc) as rn1,
  row_number() over (partition by dep_num order by salary) as rn2
  from salary
  ) t
where t.rn1 > 1 and t.rn2 > 1
group by t.dep_num;

--------方法2
select t.dep_num,round(avg(t.salary),0) as avg_salary
from
  (
  select emp_num
        ,dep_num
	    ,salary
	    ,count(1) over(partition by dep_num) as cn --求出总行数
	    ,row_number() over(partition by dep_num order by salary ) rn --按照薪水升序排序
  from salary
  ) t
where t.rn > 1 and t.rn < t.cn
group by t.dep_num;

The calculation results are as follows:

Method 3: Formula method . According to the normal idea of ​​averaging: (sum(salary)-max(salary) -min(salary)) / count(salary) -2 - grouped by department.

(1) First find the maximum salary value, minimum salary value, and the total number of salary values ​​for each department. The specific SQL is as follows:

select 
       dep_num
	  ,count(salary) as cn --求出总个数
	  ,max(salary) as max_salary --salary最大值
	  ,min(salary) as min_salary --salary最小值
	  ,sum(salary) as sum_salary --salary总和
from salary
group by dep_num

 The result of the request is as follows:

(2) Find the average value of each department after removing the maximum and minimum values ​​(formula method).

At this time, after grouping by department, since the acquisition of fields not in the grouping must be placed in the aggregate function to be used, we use the max() function to obtain the remaining fields, but will it affect the result? In fact, it will not affect, because the return value of the result of the previous step is after aggregation, each department has only one piece of data, and each field has only one value, so whether you use max(), min(), sum( ) And other functions have the same result. This is also a technique. When there is only one value in this group after grouping, in order to extract the field value, we can use the max(), min() function to extract, and the aggregate function often also has the function of filtering NULL values. Using this feature, it often provides a lot of convenience for upper-level queries when writing SQL.

The final SQL is as follows:

select dep_num
    ,(max(sum_salary) - max(max_salary) -max(min_salary)) / (max(cn) -2) as avg_salary--使用max()函数提取非group by组中的字段值,供上层计算
from (
select 
       dep_num
	  ,count(salary) as cn --求出总个数
	  ,max(salary) as max_salary --salary最大值
	  ,min(salary) as min_salary --salary最小值
	  ,sum(salary) as sum_salary --salary总和
from salary
group by dep_num
) t
group by dep_num

The final result is as follows:

3 summary

Summary of knowledge points used in this article:

  • (1) The role and use skills of the ranking function. row_number()
  • (2) The use of round() function.
  • (3) How to exclude the maximum and minimum skills
  • (4) How to get the value in the non-group by field

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/112372060