PostgreSQL反连接not in优化

两表关联只返回主表与子表没关联上的数据,这种情况即反连接,一般指的是not in和not exsits。

在实际应用中,我们常常会遇到这样的两表关联的情况,想要查看一张表的某个列没有出现在另一张表中的数据,下面我们来看看在PostgreSQL中这种场景我们有哪些优化方法呢?

例子:
–建表插入数据:
说明:a表1000条数据,b表100w条数据,我们想要判断a表中的id在b表的aid中不存在的数据。

create table a(id int primary key, info text);  
create table b(id int primary key, aid int, crt_time timestamp);  
create index b_aid on b(aid);  

insert into a select generate_series(1,1000), md5(random()::text); 
insert into b select generate_series(1,1000000), generate_series(1,100), clock_timestamp();  

常见写法:

最常见的写法莫过于直接使用not in查询:

select * from a where id not in (select aid from b);   

但是显然这种方式性能很差:

bill=# explain analyze select * from a where id not in (select aid from b);
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=0.00..13406521.50 rows=500 width=37) (actual time=96598.351..96598.352 rows=0 loops=1)
   Filter: (NOT (SubPlan 1))
   Rows Removed by Filter: 1000
   SubPlan 1
     ->  Materialize  (cost=0.00..24313.00 rows=1000000 width=4) (actual time=0.002..56.810 rows=900005 loops=1000)
           ->  Seq Scan on b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..84.128 rows=1000000 loops=1)
 Planning Time: 0.430 ms
 Execution Time: 96599.874 ms
(8 rows)

Time: 96600.994 ms (01:36.601)

显然上面这种写法是不能接受的,那么你可以想想有没有什么好的优化办法呢?

优化方法1:not exsits

–SQL:

select * from a where not exists (select aid from b where b.aid = a.id)

–执行计划:
使用exsits,获取到了符合条件的结果后即break,效率有了明显提升。

bill=# explain analyze select * from a where not exists (select aid from b where b.aid = a.id);
                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
---
 Merge Anti Join  (cost=0.70..38.76 rows=833 width=37) (actual time=0.173..0.780 rows=900 loops=1)
   Merge Cond: (a.id = b.aid)
   ->  Index Scan using a_pkey on a  (cost=0.28..31.07 rows=1000 width=37) (actual time=0.044..0.415 rows=1000 loops=1)
   ->  Index Only Scan using b_aid on b  (cost=0.42..16027.42 rows=1000000 width=4) (actual time=0.010..0.042 rows=101 loops=
1)
         Heap Fetches: 0
 Planning Time: 0.373 ms
 Execution Time: 0.900 ms
(7 rows)

优化方法2:left join

SQL:

select * from a left join b on(a.id = b.aid) where b.* is not null;

–执行计划

bill=# explain analyze select * from a left join b on(a.id = b.aid) where b.* is not null;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=0.70..39.67 rows=995 width=53) (actual time=0.087..0.775 rows=100 loops=1)
   Merge Cond: (a.id = b.aid)
   Filter: (b.* IS NOT NULL)
   Rows Removed by Filter: 900
   ->  Index Scan using a_pkey on a  (cost=0.28..31.07 rows=1000 width=37) (actual time=0.057..0.458 rows=1000 loops=1)
   ->  Index Scan using b_aid on b  (cost=0.42..21433.72 rows=1000000 width=56) (actual time=0.018..0.103 rows=101 loops=1)
 Planning Time: 0.472 ms
 Execution Time: 0.827 ms
(8 rows)

Time: 1.960 ms

优化方法3:sub query

–SQL:
因为这里a表只有1000条数据,我们使用sub query查询其实只会扫描b表和a表行数一样多行(加上limit),最后使用is null来判断在b表中未出现的aid。

扫描二维码关注公众号,回复: 13117939 查看本文章
select * from 
(
  select 
    a.* ,  
    (select aid from b where b.aid=a.id limit 1) as aid 
  from a  
) as t 
where t.aid is null;  

–执行计划:

bill=# explain analyze
bill-# select * from
bill-# (
bill(#   select
bill(#     a.* ,
bill(#     (select aid from b where b.aid=a.id limit 1) as aid   
bill(#   from a   
bill(# ) as t
bill-# where t.aid is null;
                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
------
 Seq Scan on a  (cost=0.00..1770.21 rows=5 width=41) (actual time=0.160..1.879 rows=900 loops=1)
   Filter: ((SubPlan 2) IS NULL)
   Rows Removed by Filter: 100
   SubPlan 1
     ->  Limit  (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=900)
           ->  Index Only Scan using b_aid on b  (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=900)
                 Index Cond: (aid = a.id)
                 Heap Fetches: 0
   SubPlan 2
     ->  Limit  (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=1000)
           ->  Index Only Scan using b_aid on b b_1  (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=
1000)
                 Index Cond: (aid = a.id)
                 Heap Fetches: 0
 Planning Time: 0.142 ms
 Execution Time: 1.925 ms
(15 rows)

Time: 2.402 ms

优化方法4:sub query

–SQL:
上面方法中sub query我们还可以简写成下面这种:

select * from a where (select aid from b where b.aid=a.id limit 1) is null;

–执行计划:

bill=# explain analyze select * from a where (select aid from b where b.aid=a.id limit 1) is null;
                                                          QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
--
 Seq Scan on a  (cost=0.00..1761.50 rows=5 width=37) (actual time=0.502..3.470 rows=900 loops=1)
   Filter: ((SubPlan 1) IS NULL)
   Rows Removed by Filter: 100
   SubPlan 1
     ->  Limit  (cost=0.42..1.74 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=1000)
           ->  Index Only Scan using b_aid on b  (cost=0.42..1.74 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1000
)
                 Index Cond: (aid = a.id)
                 Heap Fetches: 0
 Planning Time: 0.262 ms
 Execution Time: 3.651 ms
(10 rows)

Time: 4.786 ms

优化方法4:with递归

–SQL:
使用pg中的with递归语法:
和上面的sub query不同的是:a表都是全表扫一遍,sub query中B表索引扫描次数等于a表的行数,with递归B表索引扫描次数等于aid在b表中出现的次数。

select * from a where id not in   
(  
with recursive skip as (    
  (    
    select min(aid) aid from b where aid is not null    
  )    
  union all    
  (    
    select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)     
      from skip s where s.aid is not null    
  )    
)     
select aid from skip where aid is not null  
);      

–执行计划:

bill=# explain analyze select * from a where id not in
bill-# (
bill(# with recursive skip as (
bill(#   (
bill(#     select min(aid) aid from b where aid is not null
bill(#   )
bill(#   union all
bill(#   (
bill(#     select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)
bill(#       from skip s where s.aid is not null
bill(#   )
bill(# )
bill(# select aid from skip where aid is not null
bill(# );
                                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
--------------------------------
 Seq Scan on a  (cost=54.57..76.07 rows=500 width=37) (actual time=1.332..1.763 rows=900 loops=1)
   Filter: (NOT (hashed SubPlan 5))
   Rows Removed by Filter: 100
   SubPlan 5
     ->  CTE Scan on skip  (cost=52.30..54.32 rows=100 width=4) (actual time=0.087..1.146 rows=100 loops=1)
           Filter: (aid IS NOT NULL)
           Rows Removed by Filter: 1
           CTE skip
             ->  Recursive Union  (cost=0.45..52.30 rows=101 width=4) (actual time=0.083..1.087 rows=101 loops=1)
                   ->  Result  (cost=0.45..0.46 rows=1 width=4) (actual time=0.082..0.083 rows=1 loops=1)
                         InitPlan 3 (returns $1)
                           ->  Limit  (cost=0.42..0.45 rows=1 width=4) (actual time=0.077..0.078 rows=1 loops=1)
                                 ->  Index Only Scan using b_aid on b b_1  (cost=0.42..4.65 rows=167 width=4) (actual time=0.
075..0.076 rows=1 loops=1)
                                       Index Cond: (aid IS NOT NULL)
                                       Heap Fetches: 0
                   ->  WorkTable Scan on skip s  (cost=0.00..4.98 rows=10 width=4) (actual time=0.009..0.009 rows=1 loops=101
)
                         Filter: (aid IS NOT NULL)
                         Rows Removed by Filter: 0
                         SubPlan 2
                           ->  Result  (cost=0.47..0.48 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=100)
                                 InitPlan 1 (returns $3)
                                   ->  Limit  (cost=0.42..0.47 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=100)
                                         ->  Index Only Scan using b_aid on b  (cost=0.42..2.84 rows=56 width=4) (actual time
=0.006..0.006 rows=1 loops=100)
                                               Index Cond: ((aid > s.aid) AND (aid IS NOT NULL))
                                               Heap Fetches: 0
 Planning Time: 0.548 ms
 Execution Time: 2.027 ms
(27 rows)

Time: 4.036 ms

小结

一个常见的not in语句在PostgreSQL中竟然有这么多不同的优化方法,可见pg的语法真是十分强大,你还有没有什么更好的写法呢?

猜你喜欢

转载自blog.csdn.net/weixin_39540651/article/details/114531734