Select first row in each GROUP BY group?

stackflow上面的一个问题。用窗口函数比较简单，但是那些没有窗口函数的数据库怎么办？

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS ranks
      FROM PURCHASES p)
SELECT s.*
  FROM summary s
 WHERE s.ranks = 1

所以给出通用方法：

SELECT MIN(x.id),  -- change to MAX if you want the highest
       x.customer, 
       x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
             MAX(total) AS max_total
      FROM PURCHASES p
      GROUP BY p.customer) y ON y.customer = x.customer
                             AND y.max_total = x.total
GROUP BY x.customer, x.total

PS：原博还提到了一种Postresql中特有的解法：DISTINCT ON ()

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

...
ORDER  BY customer, total DESC NULLS LAST, id;

If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated.

--如果total可以为空，则最可能希望具有最大非空值的行。最后添加空值。具体可参照：

PostgreSQL sort by datetime asc, null first?

其实有点不明白distinct on，看前辈的博客点击打开链接。还用了IN 子查询

　DISTINCT ON ( expression [, …] )把记录根据[, …]的值进行分组，分组之后仅返回每一组的第一行。需要注意的是，如果你不指定ORDER BY子句，返回的第一条的不确定的。如果你使用了ORDER BY 子句，那么[, …]里面的值必须靠近ORDER BY子句的最左边。

1. 当没用指定ORDER BY子句的时候返回的记录是不确定的。

postgres=# select distinct on(course)id,name,course,score from student;
 id |  name  | course | score 
----+--------+--------+-------
 10 | 周星驰 | 化学   |    83
  8 | 周星驰 | 外语   |    88
  2 | 周润发 | 数学   |    99
 14 | 黎明   | 物理   |    90
  6 | 周星驰 | 语文   |    91
(5 rows)

2. 获取每门课程的最高分

postgres=# select distinct on(course)id,name,course,score from student order by course,score desc;
 id |  name  | course | score 
----+--------+--------+-------
  5 | 周润发 | 化学   |    87
 13 | 黎明   | 外语   |    95
  2 | 周润发 | 数学   |    99
 14 | 黎明   | 物理   |    90
  6 | 周星驰 | 语文   |    91
(5 rows)

3. 如果指定ORDER BY 必须把分组的字段放在最左边

postgres=# select distinct on(course)id,name,course,score from student order by score desc;
ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on(course)id,name,course,score from student ...

4. 获取每门课程的最高分同样可以使用IN子句来实现

postgres=# select * from student  where(course,score) in(select course,max(score) from student group by course);
 id |  name  | course | score 
----+--------+--------+-------
  2 | 周润发 | 数学   |    99
  5 | 周润发 | 化学   |    87
  6 | 周星驰 | 语文   |    91
 13 | 黎明   | 外语   |    95
 14 | 黎明   | 物理   |    90
(5 rows)

原文还提到在 row_number() over(), distinct on和in子句之间有一个小区别，主要是因为前两个方法是用行号，且行号唯一。解决办法就是用rank()窗口函数，让同成绩的行号出现重复。

下面是一位大神提供的6种方法，有些需要在PG中实现。而且这个大神的建表语句也让我学习了。

Queries

1. `row_number()` in CTE, (see other answer) 公用表达式

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2. `row_number()` in subquery (my optimization) 子查询

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3. `DISTINCT ON` (see other answer)

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4. rCTE with `LATERAL` subquery (see here) 递归和LATERAL

第一个查询取customer_id最小，且该id中total最大的。

在FROM 或者JOIN子句的子查询里面可以关联查询FROM子句或者JOIN子句的另一边的子句或者表.

见这一篇→点击打开链接

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5. `customer` table with `LATERAL` (see here)

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6. `array_agg()` with `ORDER BY` (see other answer)

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

挺有意思的，数组中按照每组total降序，取第一个（MAX）。第一次见到分组字段不是放在第一个的。最后用MAX函数取出最大的值。但是这里出现了customer_id的自动升序，看了些帖子。

Results（性能）

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

A. Postgres 9.4 with 200k rows and ~ 20 per `customer_id`

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  
5.  37.679 ms  -- winner
6. 189.495 ms

B. The same with Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  
5.  33.944 ms  -- winner
6. 211.540 ms

C. Same as B., but with ~ 2.3 rows per `customer_id`

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

参考资料：https://www.oschina.net/translate/postgresqls-powerful-new-join-type-lateral

https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group?noredirect=1&lq=1#

PostreSQL取出每组第一条（最高）记录（6种方法）

Select first row in each GROUP BY group?

PS：原博还提到了一种Postresql中特有的解法：DISTINCT ON ()

Queries

1. `row_number()` in CTE, (see other answer) 公用表达式

2. `row_number()` in subquery (my optimization) 子查询

3. `DISTINCT ON` (see other answer)

4. rCTE with `LATERAL` subquery (see here) 递归和LATERAL

5. `customer` table with `LATERAL` (see here)

6. `array_agg()` with `ORDER BY` (see other answer)

Results（性能）

A. Postgres 9.4 with 200k rows and ~ 20 per `customer_id`

B. The same with Postgres 9.5

C. Same as B., but with ~ 2.3 rows per `customer_id`

猜你喜欢

PostreSQL取出每组第一条（最高）记录（6种方法 ）

Select first row in each GROUP BY group?

PS：原博还提到了一种Postresql中特有的解法：DISTINCT ON ()

Queries

1. row_number() in CTE, (see other answer) 公用表达式

2. row_number() in subquery (my optimization) 子查询

3. DISTINCT ON (see other answer)

4. rCTE with LATERAL subquery (see here) 递归和LATERAL

5. customer table with LATERAL (see here)

6. array_agg() with ORDER BY (see other answer)

Results（性能）

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

B. The same with Postgres 9.5

C. Same as B., but with ~ 2.3 rows per customer_id

猜你喜欢

PostreSQL取出每组第一条（最高）记录（6种方法）

1. `row_number()` in CTE, (see other answer) 公用表达式

2. `row_number()` in subquery (my optimization) 子查询

3. `DISTINCT ON` (see other answer)

4. rCTE with `LATERAL` subquery (see here) 递归和LATERAL

5. `customer` table with `LATERAL` (see here)

6. `array_agg()` with `ORDER BY` (see other answer)

A. Postgres 9.4 with 200k rows and ~ 20 per `customer_id`

C. Same as B., but with ~ 2.3 rows per `customer_id`