MySQL 8.0 plan optimization 源码阅读笔记

以下基于社区版8.0代码

预备知识：

MySQL JOIN syntax: https://dev.mysql.com/doc/refman/8.0/en/join.html
Straight join: is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. STRAIGHT_JOIN有两种用法：一种是加在JOIN处作为INNER JOIN的一种特殊类型hint该join的顺序；另一种是加在SELECT处使该select下的所有JOIN都强制为用户table的join顺序，从优化代码上看该用法与semi-join不可同时存在（Optimize_table_order::optimize_straight_join: DBUG_ASSERT(join->select_lex->sj_nests.is_empty())）。
join order hint: Join-order hints affect the order in which the optimizer joins tables, including JOIN_FIXED_ORDER, JOIN_ORDER, JOIN_PREFIX, JOIN_SUFFIX
各种JOIN类型：INNER JOIN, OUTER JOIN, SEMI JOIN, LEFT/RIGHT JOIN, etc.
Materialization（物化）: Usually happens in subquery (sometimes known as semi-join). Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory.
Statistics (统计信息)：从存储获取表的rowcount、min/max/sum/avg/keyrange等元信息，用于辅助plan优化。
table dependencies: A LEFT JOIN B : B depends on A and A's own dependencies。(待确认 DEPEND JOIN语义是否也是由table dependencies关系表示)
table access path: An access path may use either an index scan, a table scan, a range scan or ref access, known as join type in explain.
- index scan: 一般index scan指的是二级索引scan (MySQL主键索引会带着data存放)
- table scan: 直接扫表
- range scan: 对于索引列的一些可能转化为范围查询的条件，MySQL会试图将其转化为range scan来减少范围外无用的scan。单个范围的range query类似带range条件下推的index scan或table scan；range query支持抽取出多个范围查询。
- ref: join field是索引，但不是pk或unique not null 索引
- eq_ref: join field是索引且是pk或unique not null索引，意味着对于每个record最多只会join到右表的一行。

MySQL源码中JOIN对象的tables的存放layout（参考注释，单看变量名有歧义）：

 /**
    Before plan has been created, "tables" denote number of input tables in the
    query block and "primary_tables" is equal to "tables".
    After plan has been created (after JOIN::get_best_combination()),
    the JOIN_TAB objects are enumerated as follows:
    - "tables" gives the total number of allocated JOIN_TAB objects
    - "primary_tables" gives the number of input tables, including
      materialized temporary tables from semi-join operation.
    - "const_tables" are those tables among primary_tables that are detected
      to be constant.
    - "tmp_tables" is 0, 1 or 2 (more if windows) and counts the maximum
      possible number of intermediate tables in post-processing (ie sorting and
      duplicate removal).
      Later, tmp_tables will be adjusted to the correct number of
      intermediate tables, @see JOIN::make_tmp_tables_info.
    - The remaining tables (ie. tables - primary_tables - tmp_tables) are
      input tables to materialized semi-join operations.
    The tables are ordered as follows in the join_tab array:
     1. const primary table
     2. non-const primary tables
     3. intermediate sort/group tables
     4. possible holes in array
     5. semi-joined tables used with materialization strategy
  */
  uint tables;          ///< Total number of tables in query block
  uint primary_tables;  ///< Number of primary input tables in query block
  uint const_tables;    ///< Number of primary tables deemed constant
  uint tmp_tables;      ///< Number of temporary tables used by query

源码剖析

Join表示一个query的join plan，同时也作为plan的context流转（因此在para query等一些优化实现中，并行查询里除了最上层的父查询有实际优化的价值外，Join起的作用更像一个context）。

best_positions存放最终优化的table order结果。
best_read 存放最终cost
best_ref 存放输入的table序列，the optimizer optimizes best_ref

make_join_plan 在JOIN::optimize里被调用，计算最佳的join order并构建join plan。 Steps:

Here is an overview of the logic of this function:

- Initialize JOIN data structures and setup basic dependencies between tables.

- Update dependencies based on join information. 对于存在outer join或recursive的tables进行关系传递propagate_dependencies()(用传递闭包算法)，构建出完整的依赖关系。(recursive这里具体指代未确定，nested？WITH RECURSIVE语法？)

- Make key descriptions (update_ref_and_keys()). 这一步骤较为烦杂，本意是想从conditions中找出join连接的condition，并识别出join condition相关的key(key指的就是索引)，为后续决定join_type到底是ref/ref_or_null/index等做好准备。但MySQL在这一步又加了不少特殊判断，比如对key is null的特殊处理等。

- Pull out semi-join tables based on table dependencies.

- Extract tables with zero or one row as const tables. 从这步开始的四个步骤都是const table优化，核心就是先把const table算出来，将变量替换成常量。这里是依靠获取采样判断const table。

- Read contents of const tables, substitute columns from these tables with
  actual data. Also keep track of empty tables vs. one-row tables.

- After const table extraction based on row count, more tables may
  have become functionally dependent. Extract these as const tables.

- Add new sargable predicates based on retrieved const values.

- Calculate number of rows to be retrieved from each table. 获取采样结果的步骤。

- Calculate cost of potential semi-join materializations.

- Calculate best possible join order based on available statistics. 即下文的Optimize_table_order::choose_table_order

- Fill in remaining information for the generated join order.

Statistics

核心对象ha_statistics。最主要的是records表示table rowcount。

class ha_statistics {
  ulonglong data_file_length;     /* Length off data file */
  ulonglong max_data_file_length; /* Length off data file */
  ulonglong index_file_length;
  ulonglong max_index_file_length;
  ulonglong delete_length; /* Free bytes */
  ulonglong auto_increment_value;
  /*
    The number of records in the table.
      0    - means the table has exactly 0 rows
    other  - if (table_flags() & HA_STATS_RECORDS_IS_EXACT)
               the value is the exact number of records in the table
             else
               it is an estimate
  */
  ha_rows records;
  ha_rows deleted;       /* Deleted records */
  ulong mean_rec_length; /* physical reclength */
  /* TODO: create_time should be retrieved from the new DD. Remove this. */
  time_t create_time; /* When table was created */
  ulong check_time;
  ulong update_time;
  uint block_size; /* index block size */

  /*
    number of buffer bytes that native mrr implementation needs,
  */
  uint mrr_length_per_rec;
}

myrocks是在handler::info中更新stats的。而info在除了insert的写和部分查询场景会被调用以更新采样信息（调用处多达十余处）。

/**
    General method to gather info from handler

    ::info() is used to return information to the optimizer.
    SHOW also makes use of this data Another note, if your handler
    doesn't proved exact record count, you will probably want to
    have the following in your code:
    if (records < 2)
      records = 2;
    The reason is that the server will optimize for cases of only a single
    record. If in a table scan you don't know the number of records
    it will probably be better to set records to two so you can return
    as many records as you need.

    Along with records a few more variables you may wish to set are:
      records
      deleted
      data_file_length
      index_file_length
      delete_length
      check_time
    Take a look at the public variables in handler.h for more information.
    See also my_base.h for a full description.

    @param   flag          Specifies what info is requested
  */
  virtual int info(uint flag) = 0;

// 以下为可能的flag对应bit取值。 CONST除了初始化较少用；大部分情况下用VARIABLE，因为VARIABLE涉及的变量确实是较频繁更新的；ERRKEY在正常路径不会用到，用来报错查信息；AUTO专门针对自增值，自增值可从内存里table级别对象拿到。

/*
  Recalculate loads of constant variables. MyISAM also sets things
  directly on the table share object.

  Check whether this should be fixed since handlers should not
  change things directly on the table object.

  Monty comment: This should NOT be changed!  It's the handlers
  responsibility to correct table->s->keys_xxxx information if keys
  have been disabled.

  The most important parameters set here is records per key on
  all indexes. block_size and primar key ref_length.

  For each index there is an array of rec_per_key.
  As an example if we have an index with three attributes a,b and c
  we will have an array of 3 rec_per_key.
  rec_per_key[0] is an estimate of number of records divided by
  number of unique values of the field a.
  rec_per_key[1] is an estimate of the number of records divided
  by the number of unique combinations of the fields a and b.
  rec_per_key[2] is an estimate of the number of records divided
  by the number of unique combinations of the fields a,b and c.

  Many handlers only set the value of rec_per_key when all fields
  are bound (rec_per_key[2] in the example above).

  If the handler doesn't support statistics, it should set all of the
  above to 0.

  update the 'constant' part of the info:
  handler::max_data_file_length, max_index_file_length, create_time
  sortkey, ref_length, block_size, data_file_name, index_file_name.
  handler::table->s->keys_in_use, keys_for_keyread, rec_per_key
*/
#define HA_STATUS_CONST 8
/*
  update the 'variable' part of the info:
  handler::records, deleted, data_file_length, index_file_length,
  check_time, mean_rec_length
*/
#define HA_STATUS_VARIABLE 16
/*
  This flag is used to get index number of the unique index that
  reported duplicate key.
  update handler::errkey and handler::dupp_ref
  see handler::get_dup_key()
*/
#define HA_STATUS_ERRKEY 32
/*
  update handler::auto_increment_value
*/
#define HA_STATUS_AUTO 64

Join reorder

Optimize_table_order类负责实际的join reorder操作，入口方法为其惟一的public方法 choose_table_order，在make_join_plan中被调用。Optimize_table_order依赖三个前提：

tables的依赖关系已经排好序
access paths 排好序
statistics 采样已经完成

choose_table_order Steps：

初始化const_tables的cost，如果全是const_tables则可以直接短路返回
如果是在一个sjm(semi-join materialization) plan优化过程中，则做一次排序将semi-join(即子查询的query提前预计算，可根据需求物化)

否则，非STRAIGHT_JOIN且depend无关的tables是按照row_count从小到大排序的

if (SELECT_STRAIGHT_JOIN option is set)
  reorder tables so dependent tables come after tables they depend
  on, otherwise keep tables in the order they were specified in the query
else
  Apply heuristic: pre-sort all access plans with respect to the number of
  records accessed.

Sort algo is merge-sort (tbl >= 5) or insert-sort (tbl < 5)

如果有where_cond，需要把where_cond涉及的列遍历设置到table->cond_set的bitmap中。
STRAIGHT_JOIN的tables优化optimize_straight_join。STRAIGHT_JOIN相当于用户限定了JOIN的顺序，所以此处的优化工作如其注释所说：Select the best ways to access the tables in a query without reordering them.
非STRAIGHT_JOIN则使用启发式贪心算法greedy_search 进行join reorder。

optimize_straight_join :

只支持straight_join，DBUG_ASSERT(join->select_lex->sj_nests.is_empty());与semi-join不兼容，只关注primary tables。
对每个JOIN_TABLE，best_access_path计算其最优的access path，best_access_path通俗的思路概括可参见上面table access pathexplain文档中关于join types的介绍。

set_prefix_join_cost计算当前表基于对应access path下的cost，并计入总的cost model。Cost计算如下：

m_row_evaluate_cost = 0.1 // default value

/*
Cost of accessing the table in course of the entire complete join
    execution, i.e. cost of one access method use (e.g. 'range' or
    'ref' scan ) multiplied by estimated number of rows from tables
    earlier in the join sequence.
*/
read_cost = get_read_cost(table)

void set_prefix_join_cost(uint idx, const Cost_model_server *cm) {
  if (idx == 0) {
    prefix_rowcount = rows_fetched;
    prefix_cost = read_cost + prefix_rowcount * m_row_evaluate_cost;
  } else {
    // this - 1 means last table
    prefix_rowcount = (this - 1)->prefix_rowcount * rows_fetched;
    prefix_cost = (this - 1)->prefix_cost + read_cost + prefix_rowcount * m_row_evaluate_cost;
  }
  // float filter_effect [0,1] means cond filters in executor may reduce rows. 1 means nothing filtered, 0 means all rows filtered and no rows left. It is used to calculate how many row combinations will be joined with the next table
  prefix_rowcount *= filter_effect;
}

greedy_search：

bool Optimize_table_order::best_extension_by_limited_search( table_map remaining_tables, uint idx, uint current_search_depth);
  
procedure greedy_search
    input: remaining_tables
    output: partial_plan;
    {
      partial_plan = <>;
      do {
        (table, a) = best_extension_by_limited_search(partial_plan, remaining_tables, limit_search_depth);
        partial_plan = concat(partial_plan, (table, a));
        remaining_tables = remaining_tables - table;
      } while (remaining_tables != {})
      return pplan;
    }

// 简单理解就是每一步找拓展出去join路径最佳的table，按顺序加进plan里面。
// 这种方案会很受选取的第一个表影响(因为第一个表没有join关系，只能依靠筛选之后的cardinality，一般都是小表)，选小表作第一个表不一定是最优选择。一般Greedy的优化方案会把每个表都当第一个表去评估一次cost，然后从N个cost里选最小的作为最终plan。MySQL里只是返回找到的第一个完整的plan。

best_extension_by_limited_search是一个启发式的搜索过程，search_depth即最大可搜索的深度。best_extension_by_limited_search前半部分逻辑和optimize_straight_join类似：

计算best_access_path 并计算cost。
如果此时的cost已经大于best_read，则直接剪枝，无需继续搜索。
如果prune_level=PRUNE_BY_TIME_OR_ROWS开启，则判断如果best_row_count和best_cost已经大于当前的rows和cost（注意新版本是and关系），且该表没有被后续的其他表依赖 (可以理解成该表是这个图路径上的最后一个节点，所以可以直接prune；但不一定是整个plan的最后一个节点)，则将best_row_count和best_cost设为当前的。写的很绕的代码，结合整个循环看，大致就是每次找基于(rowcount, cost)二维最优的表，所谓的剪枝实际变成了类似加强贪心。
对eq_ref做优先的选择，遇到第一个eq_ref后便递归将所有的eq_ref join找出来。(原作者认为eq_ref是个1:1的mapping，所以基本可以认为cost是恒定的，单独将这个eq_ref的序列提前生成，在后续优化时可以看作一整块放在任何一个顺序位置。当然前提是eq_ref是可以连续的。)
如果还有remaining_tables，则递归继续处理直至remaining 为空。