MySQL · 引擎特性 · InnoDB 事务子系统整理（上）

一序

本文基于mysql.taobao的月报《MySQL · 引擎特性 · InnoDB 事务子系统介绍》。这里牵扯的知识点很多，本篇整理上半部分，主要包含mdl,事务开启，事务提交，savepoint.

二事务开启

InnoDB 提供了多种方式来开启一个事务，最简单的就是以一条 BEGIN 语句开始，也可以以 START TRANSACTION 开启事务，你还可以选择开启一个只读事务还是读写事务。所有显式开启事务的行为都会隐式的将上一条事务提交掉。
所有显示开启事务的入口函数均为trans_begin，如下列出了几种常用的事务开启方式。

2.1 BEGIN

当以BEGIN开启一个事务时，首先会去检查是否有活跃的事务还未提交，如果没有提交，则调用ha_commit_trans提交之前的事务，并释放之前事务持有的MDL锁。执行BEGIN命令并不会真的去引擎层开启一个事务，仅仅是为当前线程设定标记，表示为显式开启的事务。
和BEGIN等效的命令还有“BEGIN WORK”及“START TRANSACTION”。

2.2 START TRANSACTION READ ONLY

使用该选项开启一个只读事务，当以这种形式开启事务时，会为当前线程的thd->tx_read_only设置为true。当Server层接受到任何数据更改的SQL时，都会直接拒绝请求，返回错误码ER_CANT_EXECUTE_IN_READ_ONLY_TRANSACTION，不会进入引擎层。
这个选项可以强约束一个事务为只读的，而只读事务在引擎层可以走优化过的逻辑，相比读写事务的开销更小，例如不用分配事务id、不用分配回滚段、不用维护到全局事务链表中。
该事务开启的方式从5.6版本开始引入。我们知道，在MySQL5.6版本中引入的一个对事务模块的重要优化：将全局事务链表拆成了两个链表：一个用于维护只读事务，一个用于维护读写事务。这样我们在构建一个一致性视图时，只需要遍历读写事务链表即可。但是在5.6版本中，InnoDB并不具备事务从只读模式自动转换成读写事务的能力，因此需要用户显式的使用以下两种方式来开启只读事务：
执行START TRANSACTION READ ONLY
或者将变量tx_read_only设置为true
5.7版本引入了模式自动转换的功能，但该语法依然保留了。

2.3 START TRANSACTION READ WRITE

和上述相反，该SQL用于开启读写事务，这也是默认的事务模式。但有一点不同的是，如果当前实例的 read_only 打开了且当前连接不是超级账户，则显示开启读写事务会报错。
同样的事务状态TX_READ_WRITE也要加入到Session Tracker中。另外包括上述几种显式开启的事务，其标记TX_EXPLICIT也加入到session tracker中。
读写事务并不意味着一定在引擎层就被认定为读写事务了，5.7版本InnoDB里总是默认一个事务开启时的状态为只读的。举个简单的例子，如果你事务的第一条SQL是只读查询，那么在InnoDB层，它的事务状态就是只读的，如果第二条SQL是更新操作，就将事务转换成读写模式。

2.4 START TRANSACTION WITH CONSISTENT SNAPSHOT

和上面几种方式不同的是，在开启事务时还会顺便创建一个视图（Read View），在InnoDB中，视图用于描述一个事务的可见性范围，也是多版本特性的重要组成部分。
这里会进入InnoDB层，调用函数innobase_start_trx_and_assign_read_view，注意只有你的隔离级别设置成REPEATABLE READ（可重复读）时，才会显式开启一个Read View，否则会抛出一个warning。
使用这种方式开启事务时，事务状态已经被设置成ACTIVE的。
状态变量TX_WITH_SNAPSHOT会加入到Session Tracker中。

2.5 AUTOCOMMIT = 0

当autocommit设置成0时，就无需显式开启事务，如果你执行多条SQL但不显式的调用COMMIT（或者执行会引起隐式提交的SQL）进行提交，事务将一直存在。通常我们不建议将该变量设置成0，因为很容易由于程序逻辑或使用习惯造成事务长时间不提交。而事务长时间不提交，在MySQL里简直就是噩梦，各种诡异的问题都会纷纷出现。一种典型的场景就是，你开启了一条查询，但由于未提交，导致后续对该表的DDL堵塞住，进而导致随后的所有SQL全部堵塞，简直就是灾难性的后果。
另外一种情况是，如果你长时间不提交一个已经构建Read View的事务，purge线程就无法清理一些已经提交的事务锁产生的undo日志，进而导致undo空间膨胀，具体的表现为ibdata文件疯狂膨胀。
TIPS：所幸的是从5.7版本开始提供了可以在线truncate undo log的功能，前提是开启了独立的undo表空间，并保留了足够的 undo 回滚段配置（默认128个），至少需要35个回滚段。其truncate 原理也比较简单：当purge线程发现一个undo文件超过某个定义的阀值时，如果没有活跃事务引用这个undo文件，就将其设置成不可分配，并直接物理truncate文件。

三事务提交

事务的提交分为两种方式，一种是隐式提交，一种是显式提交。
当你显式开启一个新的事务，或者执行一条非临时表的DDL语句时，就会隐式的将上一个事务提交掉。另外一种就是显式的执行“COMMIT” 语句来提交事务。
然而，在不同的场景下，MySQL在提交时进行的动作并不相同，这主要是因为 MySQL 是一种服务器层-引擎层的架构，并存在两套日志系统：Binary log及引擎事务日志。MySQL支持两种XA事务方式：隐式XA和显式XA；当然如果关闭binlog，并且仅使用一种事务引擎，就没有XA可言了。

若打开binlog，且使用了事务引擎，则XA控制对象为mysql_bin_log；
若关闭了binlog，且存在不止一种事务引擎时，则XA控制对象为tc_log_mmap；
其他情况，使用tc_log_dummy，这种场景下就没有什么XA可言了，无需任何协调者来进行XA。

3.1 Binlog/Engine XA

当开启binlog时, MySQL默认使用该隐式XA模式。在5.7版本中，事务的提交流程包括：
Binlog Prepare 设置thd->durability_property= HA_IGNORE_DURABILITY, 表示在innodb prepare时，不刷redo log。
InnoDB Prepare （入口函数innobase_xa_prepare --> trx_prepare）：更新InnoDB的undo回滚段，将其设置为Prepare状态（TRX_UNDO_PREPARED）。

Group COMMIT

Flush Stage：此时形成一组队列，由leader依次为别的线程写binlog文件在准备写binlog前，会调用ha_flush_logs接口，将存储的日志写到最新的LSN，然后再写binlog到文件。这样做的目的是为了提升组提交的效率。{

背景：在MySQL5.6之前的版本，由于引入了Binlog/InnoDB的XA，Binlog的写入和InnoDB commit完全串行化执行，大概的执行序列如下：

InnoDB prepare （持有prepare_commit_mutex）；
write/sync Binlog；
InnoDB commit (写入COMMIT标记后释放prepare_commit_mutex)。

当sync_binlog=1时，很明显上述的第二步会成为瓶颈.}
Sync Stage：如果sync_binlog计数超过配置值，则进行一次文件fsync，注意，参数sync_binlog的含义不是指的这么多个事务之后做一次fsync，而是这么多组事务队列后做一次fsync。
Commit Stage：队列中的事务依次进行innodb commit，将undo头的状态修改为TRX_UNDO_CACHED/TRX_UNDO_TO_FREE/TRX_UNDO_TO_PURGE任意一种 (undo相关知识)；并释放事务锁，清理读写事务链表、readview等一系列操作。每个事务在commit阶段也会去更新事务页的binlog位点。

当然阿里的在comit stage还多个一个：半主动同步阶段。Semisync Stage (RDS MySQL only)：如果我们在事务commit之前等待备库ACK（设置成AFTER_SYNC模式），用户线程会释放上一个stage的锁，并等待ACk。这意味着在等待ACK的过程中，我们并不堵塞上一个stage的binlog写入，可以增加一定的吞吐量。

在Binlog打开的情况下，MySQL默认使用MYSQL_BIN_LOG来做XA协调者，大致流程为：
1.扫描最后一个Binlog文件，提取其中的xid；
2.InnoDB维持了状态为Prepare的事务链表，将这些事务的xid和Binlog中记录的xid做比较，如果在Binlog中存在，则提交，否则回滚事务。
通过这种方式，可以让InnoDB和Binlog中的事务状态保持一致。显然只要事务在InnoDB层完成了Prepare，并且写入了Binlog，就可以从崩溃中恢复事务，这意味着我们无需在InnoDB commit时显式的write/fsync redo log。

源码在/blob/master/sql/binlog.cc，函数入口：MYSQL_BIN_LOG::ordered_commit

int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit)
{
  DBUG_ENTER("MYSQL_BIN_LOG::ordered_commit");
  int flush_error= 0, sync_error= 0;
  my_off_t total_bytes= 0;
  bool do_rotate= false;

  /*
    These values are used while flushing a transaction, so clear
    everything.
    Notes:
    - It would be good if we could keep transaction coordinator
      log-specific data out of the THD structure, but that is not the
      case right now.
    - Everything in the transaction structure is reset when calling
      ha_commit_low since that calls Transaction_ctx::cleanup.
  */
  thd->get_transaction()->m_flags.pending= true;
  thd->commit_error= THD::CE_NONE;
  thd->next_to_commit= NULL;
  thd->durability_property= HA_IGNORE_DURABILITY;
  thd->get_transaction()->m_flags.real_commit= all;
  thd->get_transaction()->m_flags.xid_written= false;
  thd->get_transaction()->m_flags.commit_low= !skip_commit;
  thd->get_transaction()->m_flags.run_hooks= !skip_commit;
#ifndef DBUG_OFF
  /*
     The group commit Leader may have to wait for follower whose transaction
     is not ready to be preempted. Initially the status is pessimistic.
     Preemption guarding logics is necessary only when !DBUG_OFF is set.
     It won't be required for the dbug-off case as long as the follower won't
     execute any thread-specific write access code in this method, which is
     the case as of current.
  */
  thd->get_transaction()->m_flags.ready_preempt= 0;
#endif

  DBUG_PRINT("enter", ("flags.pending: %s, commit_error: %d, thread_id: %u",
                       YESNO(thd->get_transaction()->m_flags.pending),
                       thd->commit_error, thd->thread_id()));

  DEBUG_SYNC(thd, "bgc_before_flush_stage");

  /*
    Stage #1: flushing transactions to binary log
    While flushing, we allow new threads to enter and will process
    them in due time. Once the queue was empty, we cannot reap
    anything more since it is possible that a thread entered and
    appointed itself leader for the flush phase.
  */

#ifdef HAVE_REPLICATION
  if (has_commit_order_manager(thd))
  {
    Slave_worker *worker= dynamic_cast<Slave_worker *>(thd->rli_slave);
    Commit_order_manager *mngr= worker->get_commit_order_manager();

    if (mngr->wait_for_its_turn(worker, all))
    {
      thd->commit_error= THD::CE_COMMIT_ERROR;
      DBUG_RETURN(thd->commit_error);
    }

    if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log))
      DBUG_RETURN(finish_commit(thd));
  }
  else
#endif
  if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log))
  {
    DBUG_PRINT("return", ("Thread ID: %u, commit_error: %d",
                          thd->thread_id(), thd->commit_error));
    DBUG_RETURN(finish_commit(thd));
  }

  THD *wait_queue= NULL, *final_queue= NULL;
  mysql_mutex_t *leave_mutex_before_commit_stage= NULL;
  my_off_t flush_end_pos= 0;
  bool update_binlog_end_pos_after_sync;
  if (unlikely(!is_open()))
  {
    final_queue= stage_manager.fetch_queue_for(Stage_manager::FLUSH_STAGE);
    leave_mutex_before_commit_stage= &LOCK_log;
    /*
      binary log is closed, flush stage and sync stage should be
      ignored. Binlog cache should be cleared, but instead of doing
      it here, do that work in 'finish_commit' function so that
      leader and followers thread caches will be cleared.
    */
    goto commit_stage;
  }
  DEBUG_SYNC(thd, "waiting_in_the_middle_of_flush_stage");
  flush_error= process_flush_stage_queue(&total_bytes, &do_rotate,
                                                 &wait_queue);

  if (flush_error == 0 && total_bytes > 0)
    flush_error= flush_cache_to_file(&flush_end_pos);
  DBUG_EXECUTE_IF("crash_after_flush_binlog", DBUG_SUICIDE(););

  update_binlog_end_pos_after_sync= (get_sync_period() == 1);

  /*
    If the flush finished successfully, we can call the after_flush
    hook. Being invoked here, we have the guarantee that the hook is
    executed before the before/after_send_hooks on the dump thread
    preventing race conditions among these plug-ins.
  */
  if (flush_error == 0)
  {
    const char *file_name_ptr= log_file_name + dirname_length(log_file_name);
    DBUG_ASSERT(flush_end_pos != 0);
    if (RUN_HOOK(binlog_storage, after_flush,
                 (thd, file_name_ptr, flush_end_pos)))
    {
      sql_print_error("Failed to run 'after_flush' hooks");
      flush_error= ER_ERROR_ON_WRITE;
    }

    if (!update_binlog_end_pos_after_sync)
      update_binlog_end_pos();
    DBUG_EXECUTE_IF("crash_commit_after_log", DBUG_SUICIDE(););
  }

  if (flush_error)
  {
    /*
      Handle flush error (if any) after leader finishes it's flush stage.
    */
    handle_binlog_flush_or_sync_error(thd, false /* need_lock_log */);
  }

  DEBUG_SYNC(thd, "bgc_after_flush_stage_before_sync_stage");

  /*
    Stage #2: Syncing binary log file to disk
  */

  if (change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue, &LOCK_log, &LOCK_sync))
  {
    DBUG_PRINT("return", ("Thread ID: %u, commit_error: %d",
                          thd->thread_id(), thd->commit_error));
    DBUG_RETURN(finish_commit(thd));
  }

  /*
    Shall introduce a delay only if it is going to do sync
    in this ongoing SYNC stage. The "+1" used below in the
    if condition is to count the ongoing sync stage.
  */
  if (!flush_error && (sync_counter + 1 >= get_sync_period()))
    stage_manager.wait_count_or_timeout(opt_binlog_group_commit_sync_no_delay_count,
                                        opt_binlog_group_commit_sync_delay,
                                        Stage_manager::SYNC_STAGE);

  final_queue= stage_manager.fetch_queue_for(Stage_manager::SYNC_STAGE);

  if (flush_error == 0 && total_bytes > 0)
  {
    DEBUG_SYNC(thd, "before_sync_binlog_file");
    std::pair<bool, bool> result= sync_binlog_file(false);
    sync_error= result.first;
  }

  if (update_binlog_end_pos_after_sync)
  {
    THD *tmp_thd= final_queue;

    while (tmp_thd->next_to_commit != NULL)
      tmp_thd= tmp_thd->next_to_commit;
    if (flush_error == 0 && sync_error == 0)
      update_binlog_end_pos(tmp_thd->get_trans_pos());
  }

  DEBUG_SYNC(thd, "bgc_after_sync_stage_before_commit_stage");

  leave_mutex_before_commit_stage= &LOCK_sync;
  /*
    Stage #3: Commit all transactions in order.
    This stage is skipped if we do not need to order the commits and
    each thread have to execute the handlerton commit instead.
    Howver, since we are keeping the lock from the previous stage, we
    need to unlock it if we skip the stage.
    We must also step commit_clock before the ha_commit_low() is called
    either in ordered fashion(by the leader of this stage) or by the tread
    themselves.
    We are delaying the handling of sync error until
    all locks are released but we should not enter into
    commit stage if binlog_error_action is ABORT_SERVER.
  */
commit_stage:
  if (opt_binlog_order_commits &&
      (sync_error == 0 || binlog_error_action != ABORT_SERVER))
  {
    if (change_stage(thd, Stage_manager::COMMIT_STAGE,
                     final_queue, leave_mutex_before_commit_stage,
                     &LOCK_commit))
    {
      DBUG_PRINT("return", ("Thread ID: %u, commit_error: %d",
                            thd->thread_id(), thd->commit_error));
      DBUG_RETURN(finish_commit(thd));
    }
    THD *commit_queue= stage_manager.fetch_queue_for(Stage_manager::COMMIT_STAGE);
    DBUG_EXECUTE_IF("semi_sync_3-way_deadlock",
                    DEBUG_SYNC(thd, "before_process_commit_stage_queue"););

    if (flush_error == 0 && sync_error == 0)
      sync_error= call_after_sync_hook(commit_queue);

    /*
      process_commit_stage_queue will call update_on_commit or
      update_on_rollback for the GTID owned by each thd in the queue.
      This will be done this way to guarantee that GTIDs are added to
      gtid_executed in order, to avoid creating unnecessary temporary
      gaps and keep gtid_executed as a single interval at all times.
      If we allow each thread to call update_on_commit only when they
      are at finish_commit, the GTID order cannot be guaranteed and
      temporary gaps may appear in gtid_executed. When this happen,
      the server would have to add and remove intervals from the
      Gtid_set, and adding and removing intervals requires a mutex,
      which would reduce performance.
    */
    process_commit_stage_queue(thd, commit_queue);
    mysql_mutex_unlock(&LOCK_commit);
    /*
      Process after_commit after LOCK_commit is released for avoiding
      3-way deadlock among user thread, rotate thread and dump thread.
    */
    process_after_commit_stage_queue(thd, commit_queue);
    final_queue= commit_queue;
  }
  else
  {
    if (leave_mutex_before_commit_stage)
      mysql_mutex_unlock(leave_mutex_before_commit_stage);
    if (flush_error == 0 && sync_error == 0)
      sync_error= call_after_sync_hook(final_queue);
  }

  /*
    Handle sync error after we release all locks in order to avoid deadlocks
  */
  if (sync_error)
    handle_binlog_flush_or_sync_error(thd, true /* need_lock_log */);

  /* Commit done so signal all waiting threads */
  stage_manager.signal_done(final_queue);

  /*
    Finish the commit before executing a rotate, or run the risk of a
    deadlock. We don't need the return value here since it is in
    thd->commit_error, which is returned below.
  */
  (void) finish_commit(thd);

  /*
    If we need to rotate, we do it without commit error.
    Otherwise the thd->commit_error will be possibly reset.
   */
  if (DBUG_EVALUATE_IF("force_rotate", 1, 0) ||
      (do_rotate && thd->commit_error == THD::CE_NONE))
  {
    /*
      Do not force the rotate as several consecutive groups may
      request unnecessary rotations.
      NOTE: Run purge_logs wo/ holding LOCK_log because it does not
      need the mutex. Otherwise causes various deadlocks.
    */

    DEBUG_SYNC(thd, "ready_to_do_rotation");
    bool check_purge= false;
    mysql_mutex_lock(&LOCK_log);
    /*
      If rotate fails then depends on binlog_error_action variable
      appropriate action will be taken inside rotate call.
    */
    int error= rotate(false, &check_purge);
    mysql_mutex_unlock(&LOCK_log);

    if (error)
      thd->commit_error= THD::CE_COMMIT_ERROR;
    else if (check_purge)
      purge();
  }
  /*
    flush or sync errors are handled above (using binlog_error_action).
    Hence treat only COMMIT_ERRORs as errors.
  */
  DBUG_RETURN(thd->commit_error == THD::CE_COMMIT_ERROR);
}

3.2 Engine/Engine XA

当binlog关闭时，如果事务跨引擎了，就可以在事务引擎间进行XA了，典型的例如InnoDB和TokuDB（在RDS MySQL里已同时支持这两种事务引擎）。当支持超过1种事务引擎时，并且binlog关闭了，就走TC LOG MMAP逻辑。对应的XA控制对象为tc_log_mmap。
由于需要持久化事务信息以用于重启恢复，因此在该场景下，tc_log_mmap模块会创建一个文件，名为tc.log，文件初始化大小为24KB，使用mmap的方式映射到内存中。
在事务执行的过程中，例如遇到第一条数据变更SQL时，会注册一个唯一标识的XID（实际上通过当前查询的query_id来唯一标识），之后直到事务提交，这个XID都不会改变。事务引擎本身在使用undo时，必须加上这个XID标识。

在进行事务Prepare阶段，若事务涉及到多个引擎，先在各自引擎里做事务Prepare。
然后进入commit阶段，这时候会将XID记录到tc.log中（如上图所示），这类涉及到相对复杂的page选择流程，这里不展开描述，具体的参阅函数TC_LOG_MMAP::commit
在完成记录到tc.log后，就到引擎层各自提交事务。这样即使在引擎提交时失败，我们也可以在crash recovery时，通过读取tc.log记录的xid，指导引擎层将符合XID的事务进行提交。

3.3 Engine Commit

当关闭binlog时，且事务只使用了一个事务引擎时，就无需进行XA了，相应的事务commit的流程也有所不同。
首先事务无需进入Prepare状态，因为对单引擎事务做XA没有任何意义。
其次，因为没有Prepare状态的保护，事务在commit时需要对事务日志进行持久化。这样才能保证所有成功返回的事务变更，能够在崩溃恢复时全部完成。

3.4 显式XA

先看下官网的介绍：MySQL支持显式的开启一个带命名的XA事务

13.3.7.2 XA Transaction States

An XA transaction progresses through the following states:

Use XA START to start an XA transaction and put it in the ACTIVE state.
For an ACTIVE XA transaction, issue the SQL statements that make up the transaction, and then issue an XA END statement. XA END puts the transaction in the IDLE state.
For an IDLE XA transaction, you can issue either an XA PREPARE statement or an XA COMMIT ... ONE PHASE statement:
- XA PREPARE puts the transaction in the PREPARED state. An XA RECOVER statement at this point will include the transaction's xid value in its output, because XA RECOVER lists all XA transactions that are in the PREPARED state.
- XA COMMIT ... ONE PHASE prepares and commits the transaction. The xid value will not be listed by XA RECOVER because the transaction terminates.
For a PREPARED XA transaction, you can issue an XA COMMIT statement to commit and terminate the transaction, or XA ROLLBACKto roll back and terminate the transaction.

Here is a simple XA transaction that inserts a row into a table as part of a global transaction:

mysql> XA START 'xatest';
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO mytable (i) VALUES(10);
Query OK, 1 row affected (0.04 sec)

mysql> XA END 'xatest';
Query OK, 0 rows affected (0.00 sec)

mysql> XA PREPARE 'xatest';
Query OK, 0 rows affected (0.00 sec)

mysql> XA COMMIT 'xatest';
Query OK, 0 rows affected (0.00 sec)

Within the context of a given client connection, XA transactions and local (non-XA) transactions are mutually exclusive. For example, if XA START has been issued to begin an XA transaction, a local transaction cannot be started until the XA transaction has been committed or rolled back. Conversely, if a local transaction has been started with START TRANSACTION, no XA statements can be used until the transaction has been committed or rolled back.

If an XA transaction is in the ACTIVE state, you cannot issue any statements that cause an implicit commit. That would violate the XA contract because you could not roll back the XA transaction. You will receive the following error if you try to execute such a statement:

ERROR 1399 (XAE07): XAER_RMFAIL: The command cannot be executed
when global transaction is in the ACTIVE state

Statements to which the preceding remark applies are listed at Section 13.3.3, “Statements That Cause an Implicit Commit”.

在5.7之前的版本中，如果执行XA的过程中，在完成XA PREPARE后，如果kill掉session，事务就丢失了，而不是像崩溃恢复那样，可以直接恢复出来。这主要是因为MySQL对Kill session的行为处理是直接回滚事务。
为了解决这个问题，MySQL5.7版本做了不小的改动，将XA的两阶段都记录到了binlog中。这样状态是持久化了的，一次干净的shutdown后，可以通过扫描binlog恢复出XA事务的状态，对于kill session导致的XA事务丢失，逻辑则比较简单：内存中使用一个transaction_cache维护了所有的XA事务，在断开连接调用THD::cleanup时不做回滚，仅设置事务标记即可。

四事务回滚

当由于各种原因（例如死锁，或者显式ROLLBACK）需要将事务回滚时，会调用handler接口ha_rollback_low，进而调用InnoDB函数trx_rollback_for_mysql来回滚事务。回滚的方式是提取undo日志，做逆向操作。

由于InnoDB的undo是单独写在表空间中的，本质上和普通的数据页是一样的。如果在事务回滚时，undo页已经被从内存淘汰，回滚操作（特别是大事务变更回滚）就可能伴随大量的磁盘IO。因此InnoDB的回滚效率非常低。有的数据库管理系统，例如PostgreSQL，通过在数据页上冗余数据产生版本链的方式来实现多版本，因此回滚起来非常方便，只需要设置标记即可，但额外带来的问题就是无效数据清理开销。

五 savepoint

在事务执行的过程中，你可以通过设置SAVEPOINT的方式来管理事务的执行过程。下面是例子。

mysql>  create table test(aa varchar(9));
Query OK, 0 rows affected

mysql> begin;
Query OK, 0 rows affected

mysql>  insert into test values(1);
Query OK, 1 row affected

mysql> savepoint a1;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
+----+
1 row in set

mysql> insert into test values(2);
Query OK, 1 row affected

mysql> savepoint a2;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 2  |
+----+
2 rows in set

mysql>  insert into test values(3);
Query OK, 1 row affected

mysql> savepoint a3;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 2  |
| 3  |
+----+
3 rows in set

mysql> rollback to a2;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 2  |
+----+
2 rows in set

mysql> release savepoint a2;
Query OK, 0 rows affected

mysql> commit;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 2  |
+----+
2 rows in set
mysql> begin;
Query OK, 0 rows affected

mysql> delete from test where aa=2;
Query OK, 1 row affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
+----+
1 row in set

mysql> savepoint b1;
Query OK, 0 rows affected

mysql> insert into test values(4);
Query OK, 1 row affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 4  |
+----+
2 rows in set

mysql> release savepoint b1;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 4  |
+----+
2 rows in set

mysql> rollback to b1;
1305 - SAVEPOINT b1 does not exist
mysql> savepoint b2;
Query OK, 0 rows affected

mysql> insert into test values(5);
Query OK, 1 row affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 4  |
| 5  |
+----+
3 rows in set

mysql> rollback to b2;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 4  |
+----+
2 rows in set

mysql> commit;
Query OK, 0 rows affected

mysql> select * from test;
+----+
| aa |
+----+
| 1  |
| 4  |
+----+
2 rows in set

mysql>

具体的函数源码在blob/master/sql/transaction.cc

***********************************************************************

虽然学习之路十分痛苦，坚持下去，总会有收获的。

参考：

http://mysql.taobao.org/monthly/2015/12/01/

https://dev.mysql.com/doc/refman/5.7/en/metadata-locking.html

https://dev.mysql.com/doc/refman/5.7/en/xa-states.html