Mysql topic V: JOIN optimization on optimization and sub-queries

JOIN meaning just like the English word "join" the same, to join two tables, roughly divided into internal connections, external connections, connecting the right and left connection, natural connection. Described herein a first throw of FIG rotten, and then insert the test data.
Here Insert Picture Description
Here Insert Picture Description

A, JOIN Syntax

The connection
at the connector versions are three kinds:
SELECT * from the Join T2 T1 = ON T1.a T2.a;
SELECT * from the Join Inner T1 = T2 ON T1.a T2.a;
SELECT * from the Join Cross T1 T2 ON t1.a = t2.a;

左连接:
select * from t1 left join t2 on t1.a = t2.a;

右连接:
select * from t1 right join t2 on t1.a = t2.a;

Second, the principle of connection

Whether it is within or connected to the left and right connection, you need a driver and a table driven table for internal connections, select which table-driven table does not matter, watch out drive connector is fixed, that is connected to the left driving table is the table, driving table right connection is to the right of the table to the left.
General principle is connected:

1) Select the drive table, table associated with the drive using filters to select the lowest cost of access in the form of single-table queries to perform the drive table.
2) the results of the previous step drive table query each record obtained concentrate, respectively to drive the table to find a matching record.
It is the corresponding pseudo-code:

for each row in t1 {   //此处表示遍历满足对t1单表查询结果集中的每一条记录
    for each row in t2 {   //此处表示对于某条t1表的记录来说,遍历满足对t2单表查询结果集中的每一条记录
        // 判断是否符合join条件
    }
}

(A) nested loop join (Nested-Loop Join)

The above process is like a nested loop, so this drive table only visited once, but was driven table but may be several visits, the number of visits depends on the result of the implementation of single-table queries drive table set of records the number of connections implementation called nested loop join (nested-loop Join), which is the simplest and most awkward query algorithm a connector;

For example, the following SQL:
SELECT * from the Join T2 T1 = ON T1.a T2.a WHERE t1.b in (1,2);

The first will execute:
Here Insert Picture Description
get three records.
Then were performed;
SELECT * from = T2 WHERE T2.a. 1;
SELECT * WHERE T2.a from T2 = 2;
SELECT * from = T2 WHERE T2.a. 5;

So in fact, for the above steps, it is actually a query against a single table, it can use the index to help the query.

(B) based on nested loop connection block (Block Nested-Loop Join)

A table scan process is actually put on the table is loaded from disk into memory, and then from memory compare match conditions are met. Real life is not like the table t1, t2 that only a few records, there may be thousands of data. All records memory may not be stored completely in the following table, the record in front of the table scan recording time may still be behind on the disk, such as scan-to-back record time may be insufficient memory, so it is necessary to record from the front memory freed. In front of us and said that a two-table join nested loop join algorithm, but is driven tables to be accessed many times, and particularly if this is data-driven table and can not be accessed using an index, it is quite read from disk on the table several times, the I / O costs is very big, so we need to find ways: minimize the number of visits driven table .

When data is driving the table very long time, every visit is the driving table, driven by record table will be loaded into memory, a record and each record will only drive table result set in memory do match, after it will be cleared from memory. And then from the drive table to come up with another record result set once again to be driven table records are loaded into memory again, again and again, driving table result set how many records, you have to be driven table loaded from disk into memory how many times. So can we put in the recording drive is loaded into memory when the table, and a plurality of drive-time record in the table do match, so that you can reduce the repetitive load driven the cost of the table from the disk.
join buffer
Mysql has called join buffer concept, join buffer is performed prior to the application of a join query fixed amount of memory, a plurality of driving first table records contained in the result set that join buffer, and then start scanning table driven , a plurality of one-time recording table drive is driven in each table in the join buffer and do match, because the matching process is done in memory, so it is possible to significantly reduce the driven table I / O consideration.
At best join buffer large enough to accommodate all of the driving of the result set list, so that only one access to the table can be driven to be operated to complete the connection. This addition of the nested loop join buffer algorithm called nested connection connecting (Block Nested-Loop Join) block-based algorithms.
The join buffer size is configurable by the system startup parameters or variables join_buffer_size, default size is 262144 bytes (i.e. 256KB), the minimum may be set to 128 bytes. Of course, for optimizing driven table queries, it is best to drive tables are combined with high efficiency index, if it can not use an index, and their machine's memory is relatively large values may try to transfer large join_buffer_size on join query optimization.
Also note that the recording drive table, not all columns will be placed join buffer, only the query column will list the columns and filter conditions are put in join buffer, so to remind us once again, the best Do not * as a query list, we just need to care about the column into the list of queries like, so you can put more records in the join buffer.

Third, the elimination of external connection

Table in the driving position and a driving connection table can be converted to each other, and the left and right connections table driving and driven table is fixed. This may cause the connector to reduce the overall cost of query optimization by connecting the order table, and the outer connection but can not connect to optimize the order of the table.
External connections and internal connections essential difference is this: For the record drive table outer joins, if the record clause filter criteria can not find a match in the drive ON the table, then the record will still be added to the result set , driven by the corresponding table records the fields filled with NULL value; table in the recording drive connected to the recording clause if the filter conditions match can not be found to be oN in the drive table, then the record is discarded

Case: The following diagram can be found in Table drive and driven table has changed, actually became optimized within the connection, you can use the query optimizer chooses the optimal join order after added is not null.
Here Insert Picture Description

Fourth, with regard to sub-optimize queries

The following sql contain sub-queries:

select * from t1 where a in (select a from t2);
select * from (select * from t1) as t;

(A) according to the result set returned by a query region of the molecule

  • 1, scalar query
    that returns only a single value of the scalar subquery called query. For example:
    SELECT * from A in T1 WHERE (SELECT max (A) from T2);
  • 2, row subquery
    subquery returns a record, but this record needs to include multiple columns. For example:
    SELECT * from T1 WHERE (A, B) = (SELECT A, B from T2 limit. 1);
  • 3, column subqueries
    subquery returns a column of data, including a plurality of records. For example:
    SELECT * from A in T1 WHERE (SELECT A from T2);
  • 4, sub-table query
    result of the subquery contains both a lot of records, but also contains a number of columns. For example:
    SELECT * from T1 WHERE (A, B) in (SELECT A, B from T2);

(B) the relationship with the outer query District molecular queries

  • 1, the relevant sub-query
    If you perform sub-queries need to rely on the value of the outer query, we can put this subquery is called a correlated subquery. For example:
    SELECT * from A in T1 WHERE (SELECT A = T2.a from T2 WHERE T1.a);
  • 2, uncorrelated subquery
    if the subquery run results may be used alone, without depending on the value of the outer query, we can call this sub-queries are not correlated subquery. Those sub-queries presented in front of all can be seen as unrelated child investigation.

(C) sub-queries in MySQL is how the implementation of

  • 1, is not relevant for scalar query or subquery line
    such as: select * from t1 where a = (select a from t2 limit 1);
    it performs the steps of:
    1) performing select a from t2 limit 1 subquery.
    2) then the results obtained in the previous step subqueries as arguments to the outer query then perform outer query select * from t1 where a = ... ;

  • 2, the relevant row scalar query or subquery
    example: select * from t1 where b = (select b from t2 where t1.a = t2.a limit 1);
    it performs the steps of:
    1) start with the outer query obtaining a record, i.e. in the present embodiment acquires a recording start t1 table.
    2) that record is then obtained from the previous step to identify the values related to the subquery, that record in the present embodiment is obtained from the table to find the value of t1 t1.a column, and then performing sub Inquire.
    3) Finally, according to the results of a query subquery to detect the outer query WHERE clause conditions are satisfied, if set up, put the outer query that record is added to the result set, otherwise discarded.
    4) The first step is executed again, get the second record in the outer query, and so on. . .

  • 3, IN sub-query optimization
    mysql for IN subquery is optimized.
    For example: select * from t1 where a in (select from t2 a);
    for IN subqueries irrelevant, if the number of records of the results of the subquery little concentration, the handle of the outer query, respectively as two separate single-table query efficiency is still quite high, but if the result of the implementation of individual sub-query set too much, it would cause these problems:
    • the results set too much, probably does not fit in memory
    • for the outer query , if the result of the subquery set too much, it means that a particularly large iN clause parameters, which can lead to:
    • can not effectively use the index, only the outer layer query a full table scan.
    • When the outer layer query performs a full table scan, because too many IN clause parameters, which can lead to detect whether a record time and in line with the IN clause parameters match takes too long

In mysql, it is not directly the result set is not relevant sub-query parameters as the outer query, but the result set is written to a temporary table inside. Process of writing the temporary table is as follows:
1) column of the temporary table is sub-query result set columns.
2) recording the temporary table is written to the weight . IN statement is a judgment on the record when an operand is not in a collection, the value of the collection of the results of the entire weight not repeat IN statement does not affect, so we will write a temporary table result set to re-let temporary table becomes smaller. Temporary table is also a table, simply create a primary key or unique index columns for all records in the table can be de-emphasis.
3) general erupted in the query result set is not particularly large, so it will be established as a memory-based temporary table using Memory storage engine, and will build a hash index for the table. IN statement is to determine the nature of an operand is not in a collection, if the data collection to establish a hash index , then this matching process is very fast.
4) If the sub-query result set is very large, exceeding the system variables tmp_table_size or max_heap_table_size, temporary tables will turn to the use of disk-based storage engine to save result set record, index types are also converted to the corresponding B + tree index.

这个将子查询结果集中的记录保存到临时表的过程称之为物化(Materialize)。那个存储子查询结果集的临时表称之为物化表。正因为物化表中的记录都建立了索引(基于内存的物化表有哈希索引,基于磁盘的有B+树索引),通过索引执行IN语句判断某个操作数在不在子查询结果集中变得非常快,从而提升了子查询语句的性能。

Or for the above sql: select * from t1 where a in (select a from t2);

When we handle queries materialized, assuming the name of the sub materialized query tables for materialized_table, as the subquery materialized table stored in the result set m_val, then this query can actually be viewed from two angles below:

• to look at an angle from the table t1, the entire query is actually meant: for each record in the table t1, if the value of a column in the record corresponding subquery materialized table, then the record is added the final result set.

• angle materialized query tables from a child to look at the whole meaning of the query is actually: For each value subquery materialized table, if the value of a column corresponding to the value equal to the record can be found in the table t1, then put these records added to the final result set.

That is in fact equivalent to the top of the query table t1 and subquery materialized table materialized_table carried out within the connection:
the SELECT * from t1 Inner T1.a the Join materialized_table ON = m_val;

After transformation into the connector, the query optimizer can evaluate the cost of different connection sequence number is required, perform a query to select the lowest cost that query.
Although after the sub-query execution cost of re-materialized query will create a temporary table, but can be converted into sub-queries JOIN or a little more efficient. That can not fail to materialize operate direct connection handle queries into it.
We compared the following two SQL:
SELECT * from A in T1 WHERE (SELECT T2 from A);
. SELECT * T1 T2 from the Join Inner T1 = ON T1.a T2.a;

The results of these two sql query is actually like, but said the outcome of the second set of sql did not go heavy, so IN subqueries and connection between the two tables is not exactly equivalent, but the sub-query into the connection and true can give full play to the role of the optimizer, so MySQL proposes a new concept of half-connected (the Join-the sEMI) , the tables t1 and t2 table semijoin means: for a record t1 table, we only care about whether there is a matching record in table t2 exists without regard to the specific number of records matching the final result set reserved recording only the table t1. way to execute child just inside the semi-join queries using MySQL, MySQL does not provide a user-oriented semi-join syntax.

So how to achieve semi-join it?

(1) (pull subquery table) Table pullout
when the query list subquery only the primary key or unique index column, can be directly pulled subquery table in the FROM clause in the outer query, and handle the query search criteria incorporated into the outer query in the search criteria.
For example: select * from t1 where a in (select a from t2 where t2.b = 1); - a primary key

We can directly pull on t2 table in the FROM clause in the outer query, and handle the query search criteria incorporated into the outer query's search criteria, the query after the pull is this:
the SELECT * from T1 Inner join t2 on t1.a = t2.a where t2.b = 1; - a primary key

(2) DuplicateWeedout Strategy Execution (eliminating duplicates)
to this query is:
SELECT * from A in T1 WHERE (SELECT = E from T2 WHERE t2.b. 1); - just an ordinary field E

After converting semi-join queries, t1 table a record may be multiple matching of records in the table t2, so many times this record may be added to the final result set in order to eliminate duplication, we can build a temporary table, say the temporary table so long:
the CREATE tABLE tmp (
the above mentioned id PRIMARY KEY
);

When this connection during the execution of the query, the records whenever a bar t1 of the table to join the result set, first put the primary key record is added to the temporary table, if successfully added, indicating this before t1 record in the table is not added to the final result set, now to add the record to the final result set; if the addition failed, indicating that the article prior to the recording of this t1 table had already joined the final result set, where it directly discarded like, this embodiment eliminating the use of a temporary table semi-join result set of duplicate values called DuplicateWeedout .

(3) (first match) FirstMatch at Strategy Execution
FirstMatch is one of the most primitive semijoin implementation, is the beginning of our ideas, first take a record in the outer query, and then to the sub-query table to find match the criteria, the record, if you can find one, then the record of the outer query into the final result set and stops looking for more matching records, if no record of the outer query put discarded; then start taking the next record in the outer query, repeat this process on top.

(. 4) LooseScan (loose index scan)
subqueries a non-unique index scan, because the non-unique indexes, may have the same value, the index can be used to weight.
For some uses correlated subquery IN statements, such as the query:
SELECT * from A in T1 WHERE (SELECT B = t2.b from T2 WHERE t1.b);
it can be converted to semijoin:
SELECT * from the Join SEMI T1 t2 on t1.a = t2.a and t1.b = t2.b;

As about several situations can not be converted to the Join-the SEMI:
• the WHERE condition in the outer query there are other search criteria with Boolean expressions IN subquery is connected using the OR
case • Use NOT IN instead of IN
• Subqueries cases containing GROUP BY, HAVING, or aggregate functions
• sub-query contains the UNION case

So for sub-queries can not be converted to semi-join queries, there are other ways to optimize:
• For uncorrelated subqueries, it can try to participate again after they materialized query
such as using the NOT IN the following SQL:
the SELECT * from t1 where a not in (select a from t2 where t2.a = 1);

Note that this connection can not be converted to a subquery materialized and after the outer query table is not in use because only scans the table t1 and t1 of a record table, it is determined that a record value table is not materialized.
• Whether subquery is relevant or not relevant, you can try the IN subquery EXISTS sub-query designed
in fact to any one of the IN subquery, it can be converted EXISTS subqueries, common example is as follows:
outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
can be converted to:
EXISTS (the FROM inner_expr the SELECT ... = inner_expr the WHERE subquery_where the AND outer_expr)

The benefits of this conversion is that the index had not used before conversion, but after the conversion may be able to use the index, such as:
the SELECT * from A in the WHERE T1 (T2 the WHERE from the SELECT A t2.e = t1.e);
this sql inside less than the index of the subquery, after conversion becomes:
SELECT * from T1 WHERE EXISTS (SELECT. 1 = t1.e from T2 WHERE t2.e and T1.a = T2.a)

After the conversion table t2 can be used in the index of a field.
So, if the IN subquery converted into semi-join does not meet the conditions, they can not be converted to a table or materialized materialized table is converted to cost too much, it will be converted to EXISTS query.

Fifth, for the derived table optimization

select * from (select a, b from t1) as t;
above the sql, subquery is placed from the back, the results of the subquery derived table corresponding to the name of a table is t, there is a, b two field.
For the derived table, there are two ways to perform:

(A) the derived table materialized

We can derive the result set table is written to a temporary table inside, and then put this materialized table as an ordinary table, like participating in the query. Of course, when the materialization of a derived table, use a strategy called delayed materialized, that is true only in a query using a derived table to go back to try materialized derived table, not yet begun to execute the query took a derived table materialized out. For example:
SELECT * from (SELECT * WHERE A = from T1. 1) Inner AS Derived1 the Join derived1.a = ON T2 WHERE T2.a T2.a = 10;

If materialized derived table way to perform this query, the priority will be to find a table when performing t1 = 10 meet t1.a record, if not, explain t1 connection table records involved is empty, the entire query result set is empty, so there is no need to materialized query the derived table.

(Ii) the consolidation table and the outer table derived, that is, rewrite the query does not form a derived table

Such as the following sql:
SELECT * from (SELECT * WHERE A = from T1. 1) T AS;
and sql following are equivalent:
SELECT * WHERE T1 from A. 1 =;

Look a little complex SQL:
SELECT * from (SELECT * WHERE A = from T1. 1) the Join Inner AS T TA = ON T2 WHERE t2.b = T2.a. 1;
we can derive the outer query Table combined and derived table search criteria into the search condition in the outer query, such as the following:
SELECT * from the Join Inner T1 = T2 ON T1.a T2.a = WHERE T1.a. 1 and t2.b = 1;

So by the outer query and derived tables merged successful way to eliminate the derived table, which means we do not need to re-pay the cost of creating and accessing the temporary table. But not all queries with derived table can be successful merger and the outer query, when the derived table with these statements can not be consolidated and the outer query:

聚集函数,比如MAX()、MIN()、SUM()啥的
DISTINCT
GROUP BY
HAVING
LIMIT
UNION 或者 UNION ALL
派生表对应的子查询的SELECT子句中含有另一个子查询

So MySQL in the implementation of the derived table with a priority to try to merge the outer query and derived tables away, if not, then the derived table materialized out to execute the query.

The above are Luban School learning materials, we welcome the report classes, really recommend

Published 143 original articles · won praise 49 · Views 250,000 +

Guess you like

Origin blog.csdn.net/weixin_36586564/article/details/104008195