对于SQL的初学者来说,有一个有点深奥的语法,名为PARTITION BY
,它在SQL中到处都出现。它总是有一个类似的含义,尽管在相当不同的背景下。其含义与GROUP BY
,即通过一些分组/分区标准对数据集进行分组/分区。
例如,在查询Sakila数据库的时候:
SELECT actor_id, film_id
FROM film_actor
可能会出现下面这样的情况:
|actor_id|film_id|
|--------|-------|
|1 |1 |
|2 |3 |
|10 |1 |
|20 |1 |
|1 |23 |
|1 |25 |
|30 |1 |
|19 |2 |
|40 |1 |
|3 |17 |
|53 |1 |
|19 |3 |
|2 |31 |
对于ACTOR_ID = 1
分区,我们可以这样划分数据:
|actor_id|film_id|
|--------|-------|
+--> |1 |1 |
All ACTOR_ID = 1 | |2 |3 |
| |10 |1 |
| |20 |1 |
+--> |1 |23 |
+--> |1 |25 |
|30 |1 |
|19 |2 |
|40 |1 |
|3 |17 |
|53 |1 |
|19 |3 |
|2 |31 |
对于ACTOR_ID = 2
分区:
|actor_id|film_id|
|--------|-------|
|1 |1 |
All ACTOR_ID = 2 +--> |2 |3 |
| |10 |1 |
| |20 |1 |
| |1 |23 |
| |1 |25 |
| |30 |1 |
| |19 |2 |
| |40 |1 |
| |3 |17 |
| |53 |1 |
| |19 |3 |
+--> |2 |31 |
我们如何在SQL中具体使用这些分区?它们是什么意思?简而言之:
分区将一个数据集分离成子集,这些子集不会重叠。
窗口分区
我们可以做的第一件事是窗口PARTITION
子句,在计算窗口函数时使用。例如,我们可能会计算:
SELECT
actor_id,
film_id,
COUNT(*) OVER (PARTITION BY actor_id)
FROM film_actor
如果我们假设看到的是整个数据集(实际表有更多的行),那么将显示以下结果:
|actor_id|film_id|count|
|--------|-------|-----|
|1 |1 |3 |
|2 |3 |2 |
|10 |1 |1 |
|20 |1 |1 |
|1 |23 |3 |
|1 |25 |3 |
|30 |1 |1 |
|19 |2 |2 |
|40 |1 |1 |
|3 |17 |1 |
|53 |1 |1 |
|19 |3 |2 |
|2 |31 |2 |
换句话说,我们是在 "计算分区的行数"。它的工作原理几乎与GROUP BY
,我们从分组中计算行数,尽管GROUP BY
子句改变了结果集和可投影列,使非分组列不可用:
SELECT actor_id, COUNT(*)
FROM film_actor
GROUP BY actor_id
结果是:
|actor_id|count|
|--------|-----|
|1 |3 |
|2 |2 |
|10 |1 |
|20 |1 |
|30 |1 |
|19 |2 |
|40 |1 |
|3 |1 |
|53 |1 |
如果你愿意,分区内容现在被折叠了,这样每个分区键/组键在结果集中只出现一次。这种差异使得窗口函数比普通的聚合函数和分组功能要强大得多。
MATCH_RECOGNIZE 分区
MATCH_RECOGNIZE
是SQL标准的一部分,由Oracle发明,是所有其他RDBMS羡慕的对象(尽管有些已经开始采用它)。它结合了正则表达式、模式匹配、数据生成和SQL的力量。它可能是有生命的,谁知道呢。
例如,让我们看看那些在少量时间内进行小额付款的客户就看吧!
SELECT
customer_id,
payment_date,
payment_id,
amount
FROM payment
MATCH_RECOGNIZE (
-- Partition the data set by customer_id
PARTITION BY customer_id
-- Order each partition by payment_date
ORDER BY payment_date
-- Return all the matched rows
ALL ROWS PER MATCH
-- Match rows with 3 occurrences of event "A" in a row
PATTERN (A {3})
-- Define the event "A" as...
DEFINE A AS
-- Being a payment whose amount is less than 1
A.amount < 1
-- And whose payment date is less than 1 day after
-- the previous payment
AND A.payment_date - prev(A.payment_date) < 1
)
ORDER BY customer_id, payment_date
呜!这使用了这么多花哨的关键词,这个廉价的博客的语法高亮器在这里甚至无法跟上!
其结果是:
|CUSTOMER_ID|PAYMENT_DATE |PAYMENT_ID|AMOUNT|
|-----------|-----------------------|----------|------|
|72 |2005-08-18 10:59:04.000|1961 |0.99 |
|72 |2005-08-18 16:17:54.000|1962 |0.99 |
|72 |2005-08-19 12:53:53.000|1963 |0.99 |
|152 |2005-08-20 01:16:52.000|4152 |0.99 |
|152 |2005-08-20 19:13:23.000|4153 |0.99 |
|152 |2005-08-21 03:01:01.000|4154 |0.99 |
|207 |2005-07-08 17:14:14.000|5607 |0.99 |
|207 |2005-07-09 01:26:22.000|5608 |0.99 |
|207 |2005-07-09 13:56:56.000|5609 |0.99 |
|244 |2005-08-20 11:54:01.000|6615 |0.99 |
|244 |2005-08-20 17:12:28.000|6616 |0.99 |
|244 |2005-08-21 09:31:44.000|6617 |0.99 |
因此,我们可以确认,对于这3组付款中的每一组,都有:
- 金额小于1。
- 连续的日期相隔不到1天。
- 组别是按客户划分的,这也是分区。
想了解更多关于MATCH_RECOGNIZE
?我认为这篇文章对它的解释比网上的任何其他东西都要好。你可以使用Oracle XE 21c免费玩一下,例如,Gerald Venzl在Docker上提供的。
MODEL分区
比MATCH_RECOGNIZE
更加神秘的是Oracle特有的MODEL
或SPREADSHEET
子句。每个复杂的应用程序都应该至少有一个MODEL
查询,只是为了让你的同事们感到奇怪。一个例子可以在我们以前的文章中找到。简而言之,你可以做任何你可以在电子表格软件中做的事情,比如MS Excel。我在这里再举一个例子,不需要深入研究它的工作原理:
SELECT
customer_id,
payment_date,
payment_id,
amount
FROM (
SELECT *
FROM (
SELECT p.*, 0 AS s, 0 AS n
FROM payment p
)
MODEL
-- We again partition our data set by customer_id
PARTITION BY (customer_id)
-- The "spreadsheet dimension" is the row number ordered
-- by payment date, within a partition
DIMENSION BY (
row_number () OVER (
PARTITION BY customer_id
ORDER BY payment_date
) AS rn
)
-- Measures is what we want to project, including
-- o Table columns
-- o Additional calculated values
MEASURES (payment_date, payment_id, amount, s, n)
-- These rules are the spreadsheet formulae
RULES (
-- S is the sum of previous amounts that are smaller than 1
-- and whose payment dates are less than 1 day apart
s[any] = CASE
WHEN amount[cv(rn)] < 1
AND payment_date[cv(rn)] - payment_date[cv(rn) - 1] < 1
THEN coalesce(s[cv(rn) - 1], 0) + amount[cv(rn)]
ELSE 0
END,
-- N is the number of consecutive amounts with these properties
n[any] = CASE
WHEN amount[cv(rn)] < 1
AND payment_date[cv(rn)] - payment_date[cv(rn) - 1] < 1
THEN coalesce(n[cv(rn) - 1], 0) + 1
ELSE 0
END
)
) t
-- Filter out only those rows where we had more than 3
-- consecutive events
WHERE n >= 3
ORDER BY customer_id, rn
在部署前的星期五,在你的生产代码库中放入一个这样的软件,你将成为所有人的宠儿,保证。
总之,MATCH_RECOGNIZE
,我想是比较好的。其结果是:
|CUSTOMER_ID|PAYMENT_DATE |PAYMENT_ID|AMOUNT|
|-----------|-----------------------|----------|------|
|72 |2005-08-19 12:53:53.000|1963 |0.99 |
|152 |2005-08-21 03:01:01.000|4154 |0.99 |
|207 |2005-07-09 13:56:56.000|5609 |0.99 |
|244 |2005-08-21 09:31:44.000|6617 |0.99 |
|244 |2005-08-21 19:39:43.000|6618 |0.99 |
|252 |2005-07-28 02:44:25.000|6800 |0.99 |
|377 |2005-07-07 12:24:37.000|10211 |0.99 |
|425 |2005-08-01 12:37:46.000|11499 |0.99 |
|511 |2005-07-11 18:50:55.000|13769 |0.99 |
如果你想找点刺激,试着修改我的查询,以返回通常的三行组成的组,就像在MATCH_RECOGNIZE
的例子中一样,并在评论中留下你的解决方案。这绝对是可以做到的!
分区表
至少Oracle和PostgreSQL在存储层面上支持表的分区,可能其他也是如此。该功能通过将数据分离到独立的物理 表,同时透明地假装你在应用中拥有一个单一的逻辑 表,以及引入其他类型的麻烦,来帮助驯服你的存储麻烦。
典型的例子是按日期范围划分数据集,例如,这就是PostgreSQL中记载的:
CREATE TABLE payment (
customer_id int not null,
amount numeric not null,
payment_date date not null
)
PARTITION BY RANGE (payment_date);
现在,我们还不能使用这个表,因为它只在逻辑上存在。它还不知道如何在物理上存储数据:
INSERT INTO payment (customer_id, amount, payment_date)
VALUES (1, 10, DATE '2000-01-01');
这就产生了:
SQL错误[23514]。ERROR: 没有找到关系 "payment "的分区的行
详细。失败行的分区键包含(payment_date)=(2000-01-01)。
因此,让我们为某个日期范围创建一些物理存储,例如:
CREATE TABLE payment_2000
PARTITION OF payment
FOR VALUES FROM (DATE '2000-01-01') TO (DATE '2000-12-31');
现在,插入成功了。这种对PARTITION
的解释又与窗口函数的解释相吻合,我们把我们的数据集划分为子集,这些子集是明确分开的,没有重叠的。
奇怪的一个,外连接分区
下一个分区功能是SQL标准的一部分,但是到目前为止,我只在Oracle中看到过它的实现,它一直都有:分区外连接。它们的解释并不简单,而且令人遗憾的是,它们的分区与窗口的分区没有关系。它们更像是CROSS JOIN
syntax sugar(或醋,取决于你的口味)。
这样想吧,你可以用分区外连接来填补本来稀疏的数据的空白。让我们看一个例子:
SELECT
f.film_id,
f.title,
c.category_id,
c.name,
count(*) OVER ()
FROM film f
LEFT OUTER JOIN film_category fc
ON f.film_id = fc.film_id
LEFT OUTER JOIN category c
ON fc.category_id = c.category_id
ORDER BY f.film_id, c.category_id
这个查询产生每部电影的类别。如果一个类别没有和一部电影一起出现,那么结果中就没有记录:
|FILM_ID|TITLE |CATEGORY_ID|NAME |COUNT(*)OVER()|
|-------|----------------|-----------|-----------|--------------|
|1 |ACADEMY DINOSAUR|6 |Documentary|1000 |
|2 |ACE GOLDFINGER |11 |Horror |1000 |
|3 |ADAPTATION HOLES|6 |Documentary|1000 |
|4 |AFFAIR PREJUDICE|11 |Horror |1000 |
|5 |AFRICAN EGG |8 |Family |1000 |
|6 |AGENT TRUMAN |9 |Foreign |1000 |
|7 |AIRPLANE SIERRA |5 |Comedy |1000 |
|8 |AIRPORT POLLOCK |11 |Horror |1000 |
|9 |ALABAMA DEVIL |11 |Horror |1000 |
|10 |ALADDIN CALENDAR|15 |Sports |1000 |
正如你所看到的,我们有1000部电影,由于Sakila数据库非常无聊,每部电影只有一个类别,即使多对多的关系允许多于一个分配。
如果我们在其中一个外层连接中添加一个PARTITION BY
子句会发生什么?
SELECT
f.film_id,
f.title,
c.category_id,
c.name,
count(*) OVER ()
FROM film f
LEFT OUTER JOIN film_category fc
ON f.film_id = fc.film_id
LEFT OUTER JOIN category c
PARTITION BY (c.category_id) -- Magic here
ON fc.category_id = c.category_id
ORDER BY f.film_id, c.category_id
我不会显示整个结果,但是你可以通过窗口函数的结果看到,我们现在总共有16000行,而不是1000行。这是因为我们有1000部电影x16个类别,所以如果你愿意的话,在没有匹配的情况下,用空白的类别名称(但不是空白的类别ID)进行交叉积:
|FILM_ID|TITLE |CATEGORY_ID|NAME |COUNT(*)OVER()|
|-------|----------------|-----------|-----------|--------------|
|1 |ACADEMY DINOSAUR|1 | |16000 |
|1 |ACADEMY DINOSAUR|2 | |16000 |
|1 |ACADEMY DINOSAUR|3 | |16000 |
|1 |ACADEMY DINOSAUR|4 | |16000 |
|1 |ACADEMY DINOSAUR|5 | |16000 |
|1 |ACADEMY DINOSAUR|6 |Documentary|16000 |
|1 |ACADEMY DINOSAUR|7 | |16000 |
|1 |ACADEMY DINOSAUR|8 | |16000 |
|1 |ACADEMY DINOSAUR|9 | |16000 |
|1 |ACADEMY DINOSAUR|10 | |16000 |
|1 |ACADEMY DINOSAUR|11 | |16000 |
|1 |ACADEMY DINOSAUR|12 | |16000 |
|1 |ACADEMY DINOSAUR|13 | |16000 |
|1 |ACADEMY DINOSAUR|14 | |16000 |
|1 |ACADEMY DINOSAUR|15 | |16000 |
|1 |ACADEMY DINOSAUR|16 | |16000 |
|2 |ACE GOLDFINGER |1 | |16000 |
|2 |ACE GOLDFINGER |2 | |16000 |
|2 |ACE GOLDFINGER |3 | |16000 |
|2 |ACE GOLDFINGER |4 | |16000 |
|2 |ACE GOLDFINGER |5 | |16000 |
|2 |ACE GOLDFINGER |6 | |16000 |
|2 |ACE GOLDFINGER |7 | |16000 |
|2 |ACE GOLDFINGER |8 | |16000 |
|2 |ACE GOLDFINGER |9 | |16000 |
|2 |ACE GOLDFINGER |10 | |16000 |
|2 |ACE GOLDFINGER |11 |Horror |16000 |
|2 |ACE GOLDFINGER |12 | |16000 |
|2 |ACE GOLDFINGER |13 | |16000 |
|2 |ACE GOLDFINGER |14 | |16000 |
|2 |ACE GOLDFINGER |15 | |16000 |
|2 |ACE GOLDFINGER |16 | |16000 |
在某种程度上,只要你想根据稀疏的数据创建一个报告,并为这些空隙生成记录,这就很有用。一个没有PARTITION BY
的类似查询将在使用CROSS JOIN
SELECT
f.film_id,
f.title,
c.category_id,
NVL2(fc.category_id, c.name, NULL) AS name,
count(*) OVER ()
FROM film f
CROSS JOIN category c
LEFT JOIN film_category fc
ON fc.film_id = f.film_id
AND fc.category_id = c.category_id
ORDER BY f.film_id, c.category_id;
我必须说,我在过去并没有发现这些分区外联接非常有用或可理解,而且我不相信其他RDBMS在这里真的缺乏一些重要的功能,尽管这是标准的SQL。
到目前为止,jOOQ还没有在其他RDBMS中模拟这个功能。