MySQL subqueries and their optimization

DBAs or developers who have used oracle or other relational databases have such experience. They all think that the database has been optimized in terms of sub-queries, and can choose to drive table execution well, and then transplant this experience to the mysql database. But unfortunately, mysql may disappoint you in the processing of subqueries. We have encountered some cases on our production system, such as:

 

SELECT i_id,
       sum(i_sell) AS i_sell
FROM table_data
WHERE i_id IN
    (SELECT i_id
     FROM table_data
     WHERE Gmt_create >= '2011-10-07 00:00:00')
GROUP BY i_id;

(Note: The business logic of sql can be used as an analogy: first query the 100 newly sold books on 10-07, and then query the sales of the newly sold 100 books in the whole year).

The performance problem of this sql lies in the weakness of the mysql optimizer in processing subqueries. When the mysql optimizer processes subqueries, it will rewrite the subqueries. Usually, we want to complete the results of the sub-query from the inside to the outside, and then use the sub-query to drive the table of the outer query to complete the query; but mysql processing will scan all the data in the outer table first, each entry The data will be sent to the sub-query and associated with the sub-query. If the appearance is large, there will be performance problems;
for the above query, because the data in the table_data table has 70W of data, and the data in the sub-query There are many, many of which are repeated, so it needs to be associated nearly 70W times. A large number of associations cause this sql to be executed for several hours and not completed, so we need to rewrite the sql:

SELECT t2.i_id,
       SUM(t2.i_sell) AS sold
FROM
  (SELECT DISTINCT i_id
   FROM table_data
   WHERE gmt_create >= '2011-10-07 00:00:00') t1,
                                              table_data t2
WHERE t1.i_id = t2.i_id
GROUP BY t2.i_id;

We changed the subquery to association, and added distinct to the subquery to reduce the number of times t1 is associated with t2; 
after the transformation, the execution time of sql is reduced to less than 100ms. 
The optimization of mysql sub-queries has not been very friendly, and has been criticized by the industry. It is also one of the most problems I have encountered in sql optimization. When mysql is processing sub-queries, it will rewrite sub-queries. Usually, Next, we hope to go from the inside to the outside, that is, to complete the results of the sub-query first, and then use the sub-query to drive the table of the outer query to complete the query, but on the contrary, the sub-query will not be executed first; today I hope to introduce some Practical cases to deepen the understanding of mysql subqueries. The following will introduce a complete case and its analysis and tuning process and ideas. 

1. Case:

Users reported that the database response was slow, and many business updates were stuck; log in to the database to observe, and found long-running sql;

| 10437 | usr0321t9m9 | 10.242.232.50:51201 | oms | Execute | 1179 | Sending

Sql为:

SELECT tradedto0_.*
FROM a1 tradedto0_
WHERE tradedto0_.tradestatus='1'
  AND (tradedto0_.tradeoid IN
         (SELECT orderdto1_.tradeoid
          FROM a2 orderdto1_
          WHERE orderdto1_.proname LIKE '%??%'
            OR orderdto1_.procode LIKE '%??%'))
  AND tradedto0_.undefine4='1'
  AND tradedto0_.invoicetype='1'
  AND tradedto0_.tradestep='0'
  AND (tradedto0_.orderCompany LIKE '0002%')
ORDER BY tradedto0_.tradesign ASC,
         tradedto0_.makertime DESC LIMIT 15;

2. Phenomenon: The update of other tables is blocked

 

UPDATE a1
SET tradesign='DAB67634-795C-4EAC-B4A0-78F0D531D62F',
              markColor=' #CD5555',
                        memotime='2012-09- 22',
                                 markPerson='??'
WHERE tradeoid IN ('gy2012092204495100032') ;

In order to restore the application as soon as possible, after killing the long-running sql, the application returns to normal; 

3. Analyze the execution plan:

 

db@3306 :explain
SELECT tradedto0_.*
FROM a1 tradedto0_
WHERE tradedto0_.tradestatus='1'
  AND (tradedto0_.tradeoid IN
         (SELECT orderdto1_.tradeoid
          FROM a2 orderdto1_
          WHERE orderdto1_.proname LIKE '%??%'
            OR orderdto1_.procode LIKE '%??%'))
  AND tradedto0_.undefine4='1'
  AND tradedto0_.invoicetype='1'
  AND tradedto0_.tradestep='0'
  AND (tradedto0_.orderCompany LIKE '0002%')
ORDER BY tradedto0_.tradesign ASC,
         tradedto0_.makertime DESC LIMIT 15;

+----+--------------------+------------+------+---------------+------+---------+------+-------+-----
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+------+-------+-----
| 1 | PRIMARY | tradedto0_ | ALL | NULL | NULL | NULL | NULL | 27454 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | orderdto1_ | ALL | NULL | NULL | NULL | NULL | 40998 | Using where |
+----+--------------------+------------+------+---------------+------+---------+------+-------+-----

From the execution plan, we start to optimize step by step: 
first, let's look at the second line of the execution plan, which is the part of the subquery, orderdto1_ has performed a full table scan, and let's see if we can add appropriate index: 

A. Use a covering index:

db@3306:alter table a2 add index ind_a2(proname,procode,tradeoid);
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes

Adding a composite index exceeds the maximum key length limit:

b. Check out the field definitions for this table:

db@3306 :DESC  a2 ;
+---------------------+---------------+------+-----+---------+-------+
| FIELD               | TYPE          | NULL | KEY | DEFAULT | Extra |
+---------------------+---------------+------+-----+---------+-------+
| OID                 | VARCHAR(50)   | NO   | PRI | NULL    |       |
| TRADEOID            | VARCHAR(50)   | YES  |     | NULL    |       |
| PROCODE             | VARCHAR(50)   | YES  |     | NULL    |       |
| PRONAME             | VARCHAR(1000) | YES  |     | NULL    |       |
| SPCTNCODE           | VARCHAR(200)  | YES  |     | NULL    |       |

c. See the average length of table fields:

 

db@3306 :SELECT MAX(LENGTH(PRONAME)),avg(LENGTH(PRONAME)) FROM a2;
+----------------------+----------------------+
| MAX(LENGTH(PRONAME)) | avg(LENGTH(PRONAME)) |
+----------------------+----------------------+
|    95              |       24.5588 |

D. Reduce field length

 

ALTER TABLE MODIFY COLUMN PRONAME VARCHAR(156);

Then carry out the execution plan analysis:

 

db@3306 :explain
SELECT tradedto0_.*
FROM a1 tradedto0_
WHERE tradedto0_.tradestatus='1'
  AND (tradedto0_.tradeoid IN
         (SELECT orderdto1_.tradeoid
          FROM a2 orderdto1_
          WHERE orderdto1_.proname LIKE '%??%'
            OR orderdto1_.procode LIKE '%??%'))
  AND tradedto0_.undefine4='1'
  AND tradedto0_.invoicetype='1'
  AND tradedto0_.tradestep='0'
  AND (tradedto0_.orderCompany LIKE '0002%')
ORDER BY tradedto0_.tradesign ASC,
         tradedto0_.makertime DESC LIMIT 15;


+----+--------------------+------------+-------+-----------------+----------------------+---------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-------+-----------------+----------------------+---------+
| 1 | PRIMARY | tradedto0_ | ref | ind_tradestatus | ind_tradestatus | 345 | const,const,const,const | 8962 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | orderdto1_ | index | NULL | ind_a2 | 777 | NULL | 41005 | Using where; Using index |
+----+--------------------+------------+-------+-----------------+----------------------+---------+

It is found that the performance is still not improving. The key is that the number of rows scanned by the two tables has not decreased (8962*41005). The index added above has no great effect. Now check the execution result of the t table: 

 

db@3306 :
SELECT orderdto1_.tradeoid
FROM t orderdto1_
WHERE orderdto1_.proname LIKE '%??%'
  OR orderdto1_.procode LIKE '%??%';

 Empty
SET (0.05 sec)

The result set is empty, so the result set of the t table needs to be used as the driving table; 

4. Rewrite the subquery:

Through the above test and verification, the performance of ordinary mysql sub-query writing is very poor. Because of the natural weakness of mysql sub-queries, it is necessary to rewrite sql into an associated writing method:

SELECT tradedto0_.*
FROM a1 tradedto0_ ,
  (SELECT orderdto1_.tradeoid
   FROM a2 orderdto1_
   WHERE orderdto1_.proname LIKE '%??%'
     OR orderdto1_.procode LIKE '%??%')t2
WHERE tradedto0_.tradestatus='1'
  AND (tradedto0_.tradeoid=t2.tradeoid)
  AND tradedto0_.undefine4='1'
  AND tradedto0_.invoicetype='1'
  AND tradedto0_.tradestep='0'
  AND (tradedto0_.orderCompany LIKE '0002%')
ORDER BY tradedto0_.tradesign ASC,
         tradedto0_.makertime DESC LIMIT 15;

5. View the execution plan:

 

db@3306 :explain
SELECT tradedto0_.*
FROM a1 tradedto0_ ,
  (SELECT orderdto1_.tradeoid
   FROM a2 orderdto1_
   WHERE orderdto1_.proname LIKE '%??%'
     OR orderdto1_.procode LIKE '%??%')t2
WHERE tradedto0_.tradestatus='1'
  AND (tradedto0_.tradeoid=t2.tradeoid)
  AND tradedto0_.undefine4='1'
  AND tradedto0_.invoicetype='1'
  AND tradedto0_.tradestep='0'
  AND (tradedto0_.orderCompany LIKE '0002%')
ORDER BY tradedto0_.tradesign ASC,
         tradedto0_.makertime DESC LIMIT 15;

+----+-------------+------------+-------+---------------+----------------------+---------+------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+----------------------+---------+------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | orderdto1_ | index | NULL | ind_a2 | 777 | NULL | 41005 | Using where; Using index |
+----+-------------+------------+-------+---------------+----------------------+---------+------+

6. Execution time:

 

db@3306 :
SELECT tradedto0_.*
FROM a1 tradedto0_ ,
  (SELECT orderdto1_.tradeoid
   FROM a2 orderdto1_
   WHERE orderdto1_.proname LIKE '%??%'
     OR orderdto1_.procode LIKE '%??%')t2
WHERE tradedto0_.tradestatus='1'
  AND (tradedto0_.tradeoid=t2.tradeoid)
  AND tradedto0_.undefine4='1'
  AND tradedto0_.invoicetype='1'
  AND tradedto0_.tradestep='0'
  AND (tradedto0_.orderCompany LIKE '0002%')
ORDER BY tradedto0_.tradesign ASC,
         tradedto0_.makertime DESC LIMIT 15;

 Empty
SET (0.03 sec)

shortened to milliseconds; 

When a query is a condition of another query, it is called a subquery . Subqueries can construct powerful compound commands using a few simple commands.

Subqueries are most often used in WHERE clauses. It is also used in the SELECT and FROM clauses, and the following are examples.

 

1.  Subqueries use the WHERE clause.

Example: Display the information of employees whose positions are CLERK and SALESMAN in the emp table

SQL> SELECT * FROM emp WHERE job in('CLERK','SALESMAN');

EMPNO ENAME      JOB         MGR HIREDATE          SAL      COMM DEPTNO

----- ---------- --------- ----- ----------- --------- --------- ------

 7369 SMITH      CLERK      7902 1980/12/17     800.00               20

 7499 ALLEN      SALESMAN   7698 1981/2/20     1600.00    300.00     30

 7521 WARD       SALESMAN   7698 1981/2/22     1250.00    500.00     30

 7654 MARTIN     SALESMAN   7698 1981/9/28     1250.00   1400.00     30

 7844 TURNER     SALESMAN   7698 1981/9/8      1500.00      0.00     30

 7876 ADAMS      CLERK      7788 1987/5/23     1100.00               20

 7900 JAMES      CLERK      7698 1981/12/3      950.00               30

 7934 MILLER     CLERK      7782 1982/1/23     1300.00               10

8 rows selected

 

2. Use the from clause for subqueries.

Example: Display 5-10 records in the emp table. 

SQL> SELECT empno,ename,job,hiredate,sal,comm,deptno

  2  FROM (SELECT ROWNUM  r,emp.* FROM emp ) T

  3  WHERE T.r>=5 AND T.r<10;

EMPNO ENAME      JOB       HIREDATE          SAL      COMM DEPTNO

----- ---------- --------- ----------- --------- --------- ------

 7654 MARTIN     SALESMAN  1981/9/28     1250.00   1400.00     30

 7698 BLAKE      MANAGER   1981/5/1      2850.00               30

 7782 CLARK      MANAGER   1981/6/9      2450.00               10

 7788 SCOTT      ANALYST   1987/4/19     3000.00               20

 7839 KING       PRESIDENT 1981/11/17    5000.00               10

5 rows selected

 

3. Subquery with select clause

Example:  Display the employee information and department name in the emp table.

SQL> SELECT e.*,

  2    (SELECT dname FROM dept WHERE deptno=e.deptno) as dname

  3  FROM EMP e;

EMPNO ENAME      JOB         MGR HIREDATE          SAL      COMM DEPTNO dname

----- ---------- --------- ----- ----------- --------- --------- ------ -----

 7369 SMITH      CLERK      7902 1980/12/17     800.00               20 RESEARCH

 7499 ALLEN      SALESMAN   7698 1981/2/20     1600.00    300.00     30 SALES

 7521 WARD       SALESMAN   7698 1981/2/22     1250.00    500.00     30 SALES

 7566 JONES      MANAGER    7839 1981/4/2      2975.00               20 RESEARCH

 7654 MARTIN     SALESMAN   7698 1981/9/28     1250.00   1400.00     30 SALES

 7698 BLAKE      MANAGER    7839 1981/5/1      2850.00               30 SALES

……

 14 rows selected

写在前面的话:

  1. 在慢查优化12里都反复强调过 explain 的重要性,但有时候肉眼看不出 explain 结果如何指导优化,这时候还需要有一些其他基础知识的佐助,甚至需要了解 MySQL 实现原理,如子查询慢查优化
  2. 看到 SQL 执行计划中 select_type 字段中出现“DEPENDENT SUBQUERY”时,要打起精神了!

——MySQL 的子查询为什么有时候很糟糕——

引子:这样的子查询为什么这么慢?

下面的例子是一个慢查,线上执行时间相当夸张。为什么呢?

SELECT gid,COUNT(id) as count 

FROM shop_goods g1

WHERE status =0 and gid IN ( 

SELECT gid FROM shop_goods g2 WHERE sid IN  (1519066,1466114,1466110,1466102,1466071,1453929)

)

GROUP BY gid;

它的执行计划如下,请注意看关键词“DEPENDENT SUBQUERY”:

    id  select_type         table   type            possible_keys                           key           key_len  ref       rows  Extra      
------  ------------------  ------  --------------  --------------------------------------  ------------  -------  ------  ------  -----------
     1  PRIMARY             g1      index           (NULL)                                  idx_gid  5        (NULL)  850672 Using where
     2  DEPENDENT SUBQUERY  g2      index_subquery  id_shop_goods,idx_sid,idx_gid  idx_gid  5        func         1  Using where

 

基础知识:Dependent Subquery意味着什么

官方含义为:

SUBQUERY:子查询中的第一个SELECT;

DEPENDENT SUBQUERY:子查询中的第一个SELECT,取决于外面的查询 。

换句话说,就是 子查询对 g2 的查询方式依赖于外层 g1 的查询

什么意思呢?它意味着两步:

第一步,MySQL 根据 select gid,count(id) from shop_goods where status=0 group by gid; 得到一个大结果集 t1,其数据量就是上图中的 rows=850672 了。

第二步,上面的大结果集 t1 中的每一条记录,都将与子查询 SQL 组成新的查询语句:select gid from shop_goods where sid in (15...blabla..29) and gid=%t1.gid%。等于说,子查询要执行85万次……即使这两步查询都用到了索引,但不慢才怪。

如此一来,子查询的执行效率居然受制于外层查询的记录数,那还不如拆成两个独立查询顺序执行呢

 

优化策略1:

你不想拆成两个独立查询的话,也可以与临时表联表查询,如下所示:

SELECT g1.gid,count(1)

FROM shop_goods g1,(select gid from shop_goods WHERE sid in (1519066,1466114,1466110,1466102,1466071,1453929)) g2

where g1.status=0 and g1.gid=g2.gid

GROUP BY g1.gid;

也能得到同样的结果,且是毫秒级。

它的执行计划为:

    id  select_type  table           type    possible_keys              key            key_len  ref            rows  Extra                          
------  -----------  --------------  ------  -------------------------  -------------  -------  -----------  ------  -------------------------------
     1  PRIMARY      <derived2>      ALL     (NULL)                     (NULL)         (NULL)   (NULL)           30  Using temporary; Using filesort
     1  PRIMARY      g1              ref     idx_gid               idx_gid   5        g2.gid       1  Using where                    
     2  DERIVED      shop_goods  range   id_shop_goods,idx_sid  id_shop_goods  5        (NULL)           30  Using where; Using index      

DERIVED 的官方含义为:

DERIVED:用于 from 子句里有子查询的情况。MySQL 会递归执行这些子查询,把结果放在临时表里。

 

DBA观点引用:MySQL 子查询的弱点

hidba 论述道(参考资源3):

mysql 在处理子查询时,会改写子查询。

通常情况下,我们希望由内到外,先完成子查询的结果,然后再用子查询来驱动外查询的表,完成查询。

例如:

select * from test where tid in(select fk_tid from sub_test where gid=10)

通常我们会感性地认为该 sql 的执行顺序是:

sub_test 表中根据 gid 取得 fk_tid(2,3,4,5,6)记录,

然后再到 test 中,带入 tid=2,3,4,5,6,取得查询数据。

但是实际mysql的处理方式为:

select * from test where exists (

select * from sub_test where gid=10 and sub_test.fk_tid=test.tid

)

mysql 将会扫描 test 中所有数据,每条数据都将会传到子查询中与 sub_test 关联,子查询不会先被执行,所以如果 test 表很大的话,那么性能上将会出现问题。

 

《高性能MySQL》一书的观点引用

《高性能MySQL》的第4.4节“MySQL查询优化器的限制(Limitations of the MySQL Query Optimizer)”之第4.4.1小节“关联子查询(Correlated Subqueries)”也有类似的论述:

MySQL有时优化子查询很糟,特别是在WHERE从句中的IN()子查询。……

比如在sakila数据库sakila.film表中找出所有的film,这些film的actoress包括Penelope Guiness(actor_id = 1)。可以这样写:

mysql> SELECT * FROM sakila.film

-> WHERE film_id IN(

-> SELECT film_id FROM sakila.film_actor WHERE actor_id = 1);

mysql> EXPLAIN SELECT * FROM sakila.film ...;

+----+--------------------+------------+--------+------------------------+

| id | select_type        | table      | type   | possible_keys          |

+----+--------------------+------------+--------+------------------------+

| 1  | PRIMARY            | film       | ALL    | NULL                   |

| 2  | DEPENDENT SUBQUERY | film_actor | eq_ref | PRIMARY,idx_fk_film_id |

+----+--------------------+------------+--------+------------------------+

根据EXPLAIN的输出,MySQL将全表扫描film表,对找到的每行执行子查询,这是很不好的性能。幸运的是,很容易改写为一个join查询:

mysql> SELECT film.* FROM sakila.film

-> INNER JOIN sakila.film_actor USING(film_id)

-> WHERE actor_id = 1;

另外一个方法是通过使用GROUP_CONCAT()执行子查询作为一个单独的查询,手工产生IN()列表。有时候比join还快。(注:你不妨在我们的库上试试看 SELECT goods_id,GROUP_CONCAT(cast(id as char))

FROM bee_shop_goods

WHERE shop_id IN (1519066,1466114,1466110,1466102,1466071,1453929)

GROUP BY goods_id;)

MySQL已经因为这种特定类型的子查询执行计划而被批评。

 

何时子查询是好的

MySQL并不总是把子查询优化得很糟。有时候还是很优化的。下面是个例子:

mysql> EXPLAIN SELECT film_id, language_id FROM sakila.film

-> WHERE NOT EXISTS(

-> SELECT * FROM sakila.film_actor

-> WHERE film_actor.film_id = film.film_id

-> )G

……(注:具体文字还是请阅读《高性能MySQL》吧)

是的,子查询并不是总是被优化得很糟糕,具体问题具体分析,但别忘了 explain 。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325172822&siteId=291194637