Acerca de la función de ventana de Hive

1. Gramática básica

    Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
    [<window_expression>])
Function (arg1,..., argn) 可以是下面的函数:
    Aggregate Functions: 聚合函数,比如:sum(...)、 max(...)、min(...)、avg(...)等.
    Sort Functions: 数据排序函数, 比如 :rank(...)、row_number(...)等.
    Analytics Functions: 统计和比较函数, 比如:lead(...)、lag(...)、 first_value(...)等.

2. Preparación preliminar:

2.1. Declaración de construcción de tablas:

CREATE TABLE IF NOT EXISTS temp.test (
                `name` string COMMENT '姓名',
                `dept_num` int COMMENT '编号',
                `employee_id` int COMMENT 'id',
                `salary` int COMMENT '工资',
                `type` string COMMENT '岗位类型',
                `start_date` date COMMENT '入职时间'
) ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  STORED as TEXTFILE;

2.2. Cree un archivo de prueba localmente:

name    dept_num    employee_id salary  type    start_date
Michael 1000        100         5000    full    2014-01-29
Will    1000        101         4000    full    2013-10-02
Wendy   1000        101         4000    part    2014-10-02
Steven  1000        102         6400    part    2012-11-03
Lucy    1000        103         5500    full    2010-01-03
Lily    1001        104         5000    part    2014-11-29
Jess    1001        105         6000    part    2014-12-02
Mike    1001        106         6400    part    2013-11-03
Wei     1002        107         7000    part    2010-04-03
Yun     1002        108         5500    full    2014-01-29
Richard 1002        109         8000    full    2013-09-01

2.3. Cargue el archivo local creado a la biblioteca de la colmena:

load data local inpath '/root/test' into table temp.test;

3. Función de agregación de ventanas

3.1. Consultar nombre, número de departamento, salario y número de personas del departamento

select `name`,`dept_num`,`salary`,
count(*) over (partition by dept_num) as cnt 
from employee;

3.1.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
name    dept_num    salary    cnt
Lucy    1000        5500      5
Steven  1000        6400      5
Wendy   1000        4000      5
Will    1000        4000      5
Michael 1000        5000      5
Mike    1001        6400      3
Jess    1001        6000      3
Lily    1001        5000      3
Richard 1002        8000      3
Yun     1002        5500      3
Wei     1002        7000      3
Time taken: 22.624 seconds, Fetched: 12 row(s)

3.2. Consultar el nombre, número de departamento, salario y el salario total de cada departamento. El salario total del departamento se emite en orden descendente

select name,dept_num,salary,
sum(salary) over (partition by dept_num order by dept_num) as sum_dept_salary 
from employee order by sum_dept_salary desc;

3.2.1. El resultado de la salida es:

Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
name    dept_num    salary    sum_dept_salary
Michael 1000        5000      24900
Will    1000        4000      24900
Wendy   1000        4000      24900
Steven  1000        6400      24900
Lucy    1000        5500      24900
Wei     1002        7000      20500
Yun     1002        5500      20500
Richard 1002        8000      20500
Lily    1001        5000      17400
Jess    1001        6000      17400
Mike    1001        6400      17400
Time taken: 47.313 seconds, Fetched: 12 row(s)

4. Función de clasificación de ventanas

4.1. Nombre de la consulta, número de departamento, salario, número de clasificación (clasificación por salario)

select `name`,`dept_num`,`salary`,
row_number() over (order by salary desc ) rnum 
from employee;

4.1.1. El resultado de la salida es:

Total MapReduce CPU Time Spent: 1 seconds 890 msec
OK
name    dept_num    salary    rnum
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      4
Jess    1001        6000      5
Yun     1002        5500      6
Lucy    1000        5500      7
Lily    1001        5000      8
Michael 1000        5000      9
Wendy   1000        4000      10
Will    1000        4000      11
Time taken: 22.453 seconds, Fetched: 12 row(s)

4.2. Consultar la información de las dos personas con mayor salario en cada departamento (nombre, departamento, salario)

select name,dept_num,salary 
from (
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum 
from employee) t1 where rnum <= 2;

4.2.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
name    dept_num    salary
Steven  1000        6400
Lucy    1000        5500
Mike    1001        6400
Jess    1001        6000
Richard 1002        8000
Wei     1002        7000
Time taken: 24.083 seconds, Fetched: 7 row(s)

4.3. Consultar información de clasificación salarial de los empleados de cada departamento

select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum 
from employee;

4.3.1. El resultado de la salida es:

Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
name    dept_num    salary    rnum
Steven  1000        6400      1
Lucy    1000        5500      2
Michael 1000        5000      3
Wendy   1000        4000      4
Will    1000        4000      5
Mike    1001        6400      1
Jess    1001        6000      2
Lily    1001        5000      3
Richard 1002        8000      1
Wei     1002        7000      2
Yun     1002        5500      3
Time taken: 23.202 seconds, Fetched: 12 row(s)

4.4, use la función de rango para clasificar

select `name`,`dept_num`,`salary`,
rank() over (order by salary desc) rank
from employee;

4.4.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
name    dept_num    salary    rank
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      3
Jess    1001        6000      5
Yun     1002        5500      6
Lucy    1000        5500      6
Lily    1001        5000      8
Michael 1000        5000      8
Wendy   1000        4000      10
Will    1000        4000      10
Time taken: 21.547 seconds, Fetched: 12 row(s)

4.5, use dense_rank para clasificar

select `name`,`dept_num`,`salary`,
dense_rank() over (order by salary desc) rank
from employee;

4.5.1 El resultado de la salida es:

Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
name    dept_num    salary    rank
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      3
Jess    1001        6000      4
Yun     1002        5500      5
Lucy    1000        5500      5
Lily    1001        5000      6
Michael 1000        5000      6
Wendy   1000        4000      7
Will    1000        4000      7
Time taken: 21.879 seconds, Fetched: 12 row(s)

4.6, use percent_rank () para la clasificación

select name,dept_num,salary,
percent_rank() over (order by salary desc) rank
from employee;

4.6.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name    dept_num    salary  rank
Richard 1002        8000    0.0
Wei     1002        7000    0.09090909090909091
Mike    1001        6400    0.18181818181818182
Steven  1000        6400    0.18181818181818182
Jess    1001        6000    0.36363636363636365
Yun     1002        5500    0.45454545454545453
Lucy    1000        5500    0.45454545454545453
Lily    1001        5000    0.6363636363636364
Michael 1000        5000    0.6363636363636364
Wendy   1000        4000    0.8181818181818182
Will    1000        4000    0.8181818181818182
Time taken: 22.401 seconds, Fetched: 12 row(s)

4.7. Utilice ntile para clasificar segmentos de datos

SELECT name,dept_num as deptno,salary,
ntile(4) OVER(ORDER BY salary desc) as ntile
FROM employee;

4.7.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name    deptno  salary  ntile
Richard 1002    8000    1
Wei     1002    7000    1
Mike    1001    6400    1
Steven  1000    6400    2
Jess    1001    6000    2
Yun     1002    5500    2
Lucy    1000    5500    3
Lily    1001    5000    3
Michael 1000    5000    3
Wendy   1000    4000    4
Will    1000    4000    4
Time taken: 28.829 seconds, Fetched: 12 row(s)

5. Función de análisis de ventana

5.1. Contar la proporción del número de personas menor o igual al salario actual respecto al número total de personas

SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary) as cume
FROM employee;

5.1.1. El resultado de la salida es:

name    deptno  salary  cume
Wendy   1000    4000    0.25
Will    1000    4000    0.25
Lily    1001    5000    0.4166666666666667
Michael 1000    5000    0.4166666666666667
Yun     1002    5500    0.5833333333333334
Lucy    1000    5500    0.5833333333333334
Jess    1001    6000    0.6666666666666666
Mike    1001    6400    0.8333333333333334
Steven  1000    6400    0.8333333333333334
Wei     1002    7000    0.9166666666666666
Richard 1002    8000    1.0
Time taken: 20.869 seconds, Fetched: 12 row(s)

5.2. Contar la proporción del número de personas menor o igual al salario actual con el número total de personas

SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary desc) as cume
FROM employee;

5.2.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 790 msec
OK
name    dept_num    salary  cume
Richard 1002        8000    0.08333333333333333
Wei     1002        7000    0.16666666666666666
Mike    1001        6400    0.3333333333333333
Steven  1000        6400    0.3333333333333333
Jess    1001        6000    0.4166666666666667
Yun     1002        5500    0.5833333333333334
Lucy    1000        5500    0.5833333333333334
Lily    1001        5000    0.75
Michael 1000        5000    0.75
Wendy   1000        4000    0.9166666666666666
Will    1000        4000    0.9166666666666666
Time taken: 21.672 seconds, Fetched: 12 row(s)

5.3. Contar la proporción del número de personas menor o igual al salario actual con el número total de personas

SELECT name,dept_num,salary,
cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume
FROM employee;

5.3.1. El resultado de la salida es:

Total MapReduce CPU Time Spent: 2 seconds 130 msec
OK
name    dept_num    salary  cume
Wendy   1000        4000    0.4
Will    1000        4000    0.4
Michael 1000        5000    0.6
Lucy    1000        5500    0.8
Steven  1000        6400    1.0
Lily    1001        5000    0.3333333333333333
Jess    1001        6000    0.6666666666666666
Mike    1001        6400    1.0
Yun     1002        5500    0.3333333333333333
Wei     1002        7000    0.6666666666666666
Richard 1002        8000    1.0
Time taken: 22.055 seconds, Fetched: 12 row(s)

5.4. Contar la proporción del número de personas menor o igual al salario actual con el número total de personas

SELECT name,dept_num,salary,
lead(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;

5.4.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
name    dept_num    salary  lead
Wendy   1000        4000    4000
Will    1000        4000    5000
Michael 1000        5000    5500
Lucy    1000        5500    6400
Steven  1000        6400    NULL
Lily    1001        5000    6000
Jess    1001        6000    6400
Mike    1001        6400    NULL
Yun     1002        5500    7000
Wei     1002        7000    8000
Richard 1002        8000    NULL
Time taken: 21.57 seconds, Fetched: 12 row(s)

5.5. Calcular la proporción del número de personas menor o igual al salario actual con respecto al número total

SELECT name,dept_num,salary,
lag(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;

5.5.1 El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 700 msec
OK
name    dept_num    salary  lead
Wendy   1000        4000    NULL
Will    1000        4000    4000
Michael 1000        5000    4000
Lucy    1000        5500    5000
Steven  1000        6400    5500
Lily    1001        5000    NULL
Jess    1001        6000    5000
Mike    1001        6400    6000
Yun     1002        5500    NULL
Wei     1002        7000    5500
Richard 1002        8000    7000
Time taken: 21.423 seconds, Fetched: 12 row(s)

5.6. Contar la proporción del número de personas menor o igual al salario actual con el número total de personas

SELECT name,dept_num,salary,
first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval
FROM employee;

5.6.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
name    dept_num    salary  fval
Wendy   1000        4000    4000
Will    1000        4000    4000
Michael 1000        5000    4000
Lucy    1000        5500    4000
Steven  1000        6400    4000
Lily    1001        5000    5000
Jess    1001        6000    5000
Mike    1001        6400    5000
Yun     1002        5500    5500
Wei     1002        7000    5500
Richard 1002        8000    5500
Time taken: 20.379 seconds, Fetched: 12 row(s)

5.7. Contar la proporción del número de personas menor o igual al salario actual con el número total de personas

SELECT name,dept_num,salary,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval
FROM employee;

5.7.1. El resultado de salida es:

Total MapReduce CPU Time Spent: 1 seconds 770 msec
OK
name    dept_num    salary  lval
Wendy   1000        4000    6400
Will    1000        4000    6400
Michael 1000        5000    6400
Lucy    1000        5500    6400
Steven  1000        6400    6400
Lily    1001        5000    6400
Jess    1001        6000    6400
Mike    1001        6400    6400
Yun     1002        5500    8000
Wei     1002        7000    8000
Richard 1002        8000    8000
Time taken: 21.649 seconds, Fetched: 12 row(s)

Supongo que te gusta

Origin blog.51cto.com/15080921/2595867
Recomendado
Clasificación