PostgreSQL中in、exists、= any性能对比

在SQL中我们可以使用in、exists语句来判断对象是否存在某个子查询中，在pg中除了支持这两种语法，还支持= any的语法，例如下面三个SQL的意思就是一样的：

select * from tbl where id in (select id from t);  
  
select * from tbl where exists (select 1 from t where t.id=tbl.id);  
  
select * from tbl where id = any (array( select id from t ));

但是不同的写法，性能还是有一定的差距。可能对in、exists这两种写法比较熟悉的人都知道这么一个结论：IN适合于外表大而内表小的情况；EXISTS适合于外表小而内表大的情况。

那么在pg中是不是这样呢？这两种写法和= any的写法又有什么区别呢？

例子：
1、创建测试表

bill@bill=> create table t(id int);  
CREATE TABLE  
bill@bill=> insert into t select generate_series(1,100*10000);  
INSERT 0 1000000

2、三者执行计划对比
可以发现，= any是InitPlan，而另外两种写法是SubPlan。

bill@bill=>explain select n = any(array(select id from test)) from generate_series(1,10000) as n;  
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=14425.00..14650.00 rows=10000 width=1)
   InitPlan 1 (returns $0)
     ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)

bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;   
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=0.00..129160125.00 rows=10000 width=1)
   SubPlan 1
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)
           ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4)
(4 rows)

bill@bill=>explain select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n;  
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=0.00..169250100.00 rows=10000 width=1)
   SubPlan 1
     ->  Seq Scan on test  (cost=0.00..16925.00 rows=1 width=0)
           Filter: (id = n.n)
   SubPlan 2
     ->  Seq Scan on test test_1  (cost=0.00..14425.00 rows=1000000 width=4)
(6 rows)

3、对于IN的写法，work_mem参数会直接影响性能，work_mem的大小决定了subquery是否要装载到hash table。

bill@bill=>set work_mem ='1MB';  
SET
bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;   
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=0.00..129160125.00 rows=10000 width=1)
   SubPlan 1
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)
           ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4)
(4 rows)

bill@bill=>set work_mem ='100MB';  
SET
bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;   
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=16925.00..17050.00 rows=10000 width=1)
   SubPlan 1
     ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)

对于三种写法优化器实现的差异，在代码里面有相关介绍，详见：backend/optimizer/plan/subselect.c

190          * For an EXISTS subplan, tell lower-level planner to expect that only the
 191          * first tuple will be retrieved.  For ALL and ANY subplans, we will be
 192          * able to stop evaluating if the test condition fails or matches, so very
 193          * often not all the tuples will be retrieved; for lack of a better idea,
 194          * specify 50% retrieval.  For EXPR, MULTIEXPR, and ROWCOMPARE subplans,
 195          * use default behavior (we're only expecting one row out, anyway).
 196          *
 197          * NOTE: if you change these numbers, also change cost_subplan() in
 198          * path/costsize.c.
 199          *
 200          * XXX If an ANY subplan is uncorrelated, build_subplan may decide to hash
 201          * its output.  In that case it would've been better to specify full
 202          * retrieval.  At present, however, we can only check hashability after
 203          * we've made the subplan :-(.  (Determining whether it'll fit in work_mem
 204          * is the really hard part.)  Therefore, we don't want to be too
 205          * optimistic about the percentage of tuples retrieved, for fear of
 206          * selecting a plan that's bad for the materialization case.
 207          */

in、exists、= any性能对比
1、in, work_mem装不下subquery

bill@bill=>set work_mem ='1MB';  
SET
bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;   
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=0.00..129160125.00 rows=10000 width=1)
   SubPlan 1
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)
           ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4)
(4 rows)

2、in, work_mem装的下subquery

bill@bill=>set work_mem ='100MB';  
SET
bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;   
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Function Scan on generate_series n  (cost=16925.00..17050.00 rows=10000 width=1)
   SubPlan 1
     ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)

3、= any
这种写法work_mem很小也无所谓，因为不涉及hashtable

bill@bill=>set work_mem ='64kB'; 
SET
bill@bill=>explain (analyze,verbose,timing,costs,buffers) select n = any(array(select id from test)) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on pg_catalog.generate_series n  (cost=14425.00..14650.00 rows=10000 width=1) (actual time=188.203..529.799 rows=10000 loops=1)
   Output: (n.n = ANY ($0))
   Function Call: generate_series(1, 10000)
   Buffers: shared hit=4425, temp read=18 written=18
   InitPlan 1 (returns $0)
     ->  Seq Scan on bill.test  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..97.379 rows=1000000 loops=1)
           Output: test.id
           Buffers: shared hit=4425
 Planning Time: 0.068 ms
 Execution Time: 531.720 ms
(10 rows)

4、exists
work_mem需求量较少(exists由于优化器在匹配到1条后即刻返回，所以会选择使用索引，性能就非常好。)

bill@bill=>create index idx_test1 on test(id);
CREATE INDEX
bill@bill=>set work_mem ='64KB'; 
bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n;  
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on pg_catalog.generate_series n  (cost=0.00..30525.00 rows=10000 width=1) (actual time=2.840..35.937 rows=10000 loops=1)
   Output: (SubPlan 1)
   Function Call: generate_series(1, 10000)
   Buffers: shared hit=39970 read=30, temp read=18 written=18
   SubPlan 1
     ->  Index Only Scan using idx_test1 on bill.test  (cost=0.42..3.04 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=10000)
           Index Cond: (test.id = n.n)
           Heap Fetches: 10000
           Buffers: shared hit=39970 read=30
 Planning Time: 0.170 ms
 Execution Time: 37.032 ms
(11 rows)

bill@bill=>set work_mem ='64MB';  
SET
bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n;  
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on pg_catalog.generate_series n  (cost=0.00..30525.00 rows=10000 width=1) (actual time=430.974..434.543 rows=10000 loops=1)
   Output: (alternatives: SubPlan 1 or hashed SubPlan 2)
   Function Call: generate_series(1, 10000)
   Buffers: shared hit=4425
   SubPlan 1
     ->  Index Only Scan using idx_test1 on bill.test  (cost=0.42..3.04 rows=1 width=0) (never executed)
           Index Cond: (test.id = n.n)
           Heap Fetches: 0
   SubPlan 2
     ->  Seq Scan on bill.test test_1  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.023..99.501 rows=1000000 loops=1)
           Output: test_1.id
           Buffers: shared hit=4425
 Planning Time: 0.115 ms
 Execution Time: 436.932 ms
(14 rows)

5、exists没有索引
如果删掉索引，exists性能就会下降了，同时性能也和是否使用哈希表有关。

bill@bill=>drop index idx_test1 ;
DROP INDEX
bill@bill=>set work_mem ='64kB';  
SET
bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on pg_catalog.generate_series n  (cost=0.00..169250100.00 rows=10000 width=1) (actual time=1.140..3917.927 rows=10000 loops=1)
   Output: (SubPlan 1)
   Function Call: generate_series(1, 10000)
   Buffers: shared hit=226260, temp read=18 written=18
   SubPlan 1
     ->  Seq Scan on bill.test  (cost=0.00..16925.00 rows=1 width=0) (actual time=0.391..0.391 rows=1 loops=10000)
           Filter: (test.id = n.n)
           Rows Removed by Filter: 5000
           Buffers: shared hit=226260
 Planning Time: 0.127 ms
 Execution Time: 3919.068 ms
(11 rows)

bill@bill=>set work_mem ='64MB';  
SET
bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on pg_catalog.generate_series n  (cost=0.00..169250100.00 rows=10000 width=1) (actual time=430.988..434.781 rows=10000 loops=1)
   Output: (alternatives: SubPlan 1 or hashed SubPlan 2)
   Function Call: generate_series(1, 10000)
   Buffers: shared hit=4425
   SubPlan 1
     ->  Seq Scan on bill.test  (cost=0.00..16925.00 rows=1 width=0) (never executed)
           Filter: (test.id = n.n)
   SubPlan 2
     ->  Seq Scan on bill.test test_1  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.022..100.982 rows=1000000 loops=1)
           Output: test_1.id
           Buffers: shared hit=4425
 Planning Time: 0.086 ms
 Execution Time: 437.278 ms
(13 rows)

总结：
1、= any这种写法，不会走subplan，因此不涉及hash table的问题。和work_mem设置大小无关。因此在pg中十分推荐使用这种写法。

2、exists，由于优化器会默认它只需要搜索到1条命中目标就不搜了，所以优化器评估是否使用hash table时，需要的内存相对较少，即使较小的work_mem也可能使用hashtable。

3、in ，当出现在subquery中时，优化器评估这个subquery是否要构建哈希TABLE，直接和subquery的大小相关，所以需要较大的work_mem才会选择使用hashtable。

foucus、

发布了155 篇原创文章 · 获赞 88 · 访问量 2万+

私信关注

PostgreSQL中in、exists、= any性能对比

猜你喜欢