MySQL · 源码分析 · InnoDB Repeatable Read隔离级别之大不同

一序

ANSI SQL STANDARD定义了4类隔离级别（READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE)，包括了一些具体规则，用来限定事务内外的哪些改变是可见的，哪些是不可见的。低级别的隔离级别一般支持更高的并发处理，并拥有更低的系统开销。

Read Uncommitted（读未提交） 在该隔离级别，所有事务都可以看到其他未提交事务的执行结果。本隔离级别很少用于实际应用，因为它的性能也不比其他级别好多少。读取未提交的数据，也被称之为脏读（Dirty Read）。
Read Committed（读已提交） 一个事务只能看见已经提交事务所做的改变。这种隔离级别也支持所谓的不可重复读（Nonrepeatable Read），因为同一事务的其他实例在该实例处理其间可能会有新的commit，所以同一select可能返回不同结果。
Repeatable Read（可重读） 这是MySQL的默认事务隔离级别，它确保同一事务的多个实例在并发读取数据时，会看到同样的数据行。不过理论上，这会导致另一个棘手的问题：幻读（Phantom Read）。简单的说，幻读指当用户读取某一范围的数据行时，另一个事务又在该范围内插入了新行，当用户再读取该范围的数据行时，会发现有新的“幻影” 行。InnoDB存储引擎通过多版本并发控制（MVCC，Multiversion Concurrency Control）机制解决了该问题。
Serializable（可串行化） 这是最高的隔离级别，它通过强制事务排序，使之不可能相互冲突，从而解决幻读问题。简言之，它是在每个读的数据行上加上共享锁。在这个级别，可能导致大量的超时现象和锁竞争。这四种隔离级别采取不同的锁类型来实现。并发控制中读取同一个表的数据，可能出现如下问题：
脏读(Drity Read)：事务T1修改了一行数据，事务T2在事务T1提交之前读到了该行数据。

不可重复读(Non-repeatable read): 事务T1读取了一行数据。事务T2接着修改或者删除了改行数据，当T1再次读取同一行数据的时候，读到的数据时修改之后的或者发现已经被删除。

幻读(Phantom Read): 事务T1读取了满足某条件的一个数据集，事务T2插入了一行或者多行数据满足了T1的选择条件，导致事务T1再次使用同样的选择条件读取的时候，得到了比第一次读取更多的数据集。

MySQL/INNODB支持ANSI SQL STANDARD规定的四种隔离级别（READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE).本篇文章重点关注一下MySQL REPEATABLE READ隔离级别与其他数据实现方式上的不同之处。

下面看一下MySQL在REPEATABLE READ 隔离级别下的工作方式：

MySQL在执行UPDATE语句的时候对于session2的INSERT语句是可以看到的，也就是说发生了幻读。ANSI SQL STANDARD可以看到对于REPEATABLE READ隔离级别下是允许出现幻读现象的。

二源码分析

接下来我们从源码的角度分析一下Innodb对于REPEATABLE READ隔离级别的执行过程(代码只覆盖重要执行部分）。以上面的例子为依据进行剖析：对于第一条SELECT语句，InnoDB将调用row_search_for_mysql函数来返回扫描行。函数row_search_for_mysql调用相关代码如下： mysql 版本5.7.18为例

row_search_for_mysql(
	byte*			buf,
	page_cur_mode_t		mode,
	row_prebuilt_t*		prebuilt,
	ulint			match_mode,
	ulint			direction)
{
	if (!dict_table_is_intrinsic(prebuilt->table)) {
		return(row_search_mvcc(
			buf, mode, prebuilt, match_mode, direction));
	} else {
		return(row_search_no_mvcc(
			buf, mode, prebuilt, match_mode, direction));
	}
}

我们看看row_search_no_mvcc.

/** Searches for rows in the database using cursor.
Function is for temporary tables that are not shared accross connections
and so lot of complexity is reduced especially locking and transaction related.
The cursor is an iterator over the table/index.
@param[out]	buf		buffer for the fetched row in MySQL format
@param[in]	mode		search mode PAGE_CUR_L
@param[in,out]	prebuilt	prebuilt struct for the table handler;
				this contains the info to search_tuple,
				index; if search tuple contains 0 field then
				we position the cursor at start or the end of
				index, depending on 'mode'
@param[in]	match_mode	0 or ROW_SEL_EXACT or ROW_SEL_EXACT_PREFIX
@param[in]	direction	0 or ROW_SEL_NEXT or ROW_SEL_PREV;
				Note: if this is != 0, then prebuilt must has a
				pcur with stored position! In opening of a
				cursor 'direction' should be 0.
@return DB_SUCCESS or error code */
dberr_t
row_search_no_mvcc(
	byte*		buf, //用来存放记录的空间地址 
	page_cur_mode_t	mode, /* InnoDB页扫描顺序 */
	row_prebuilt_t*	prebuilt, /* InnoDB扫描需要的所有信息都包含在这个结构体，比如表以及Index等信息 */
	ulint		match_mode, /* 对于Index的匹配模式，是精确匹配还是前缀索引匹配 */
	ulint		direction)  /* 指定扫描顺序，正序还是倒叙扫描 */
{
	dict_index_t*	index		= prebuilt->index;
	const dtuple_t*	search_tuple	= prebuilt->search_tuple;
	btr_pcur_t*	pcur		= prebuilt->pcur;

	const rec_t*	result_rec	= NULL;
	const rec_t*	clust_rec	= NULL;

	dberr_t		err		= DB_SUCCESS;

	mem_heap_t*	heap		= NULL;
	ulint		offsets_[REC_OFFS_NORMAL_SIZE];
	ulint*		offsets		= offsets_;
	rec_offs_init(offsets_);
	ut_ad(index && pcur && search_tuple);

	/* Step-0: Re-use the cached mtr. */
	mtr_t*		mtr = &index->last_sel_cur->mtr;
	dict_index_t*	clust_index = dict_table_get_first_index(index->table);

	/* Step-1: Build the select graph. */
	if (direction == 0 && prebuilt->sel_graph == NULL) {
		row_prebuild_sel_graph(prebuilt);
	}

	que_thr_t*	thr = que_fork_get_first_thr(prebuilt->sel_graph);

	bool		moves_up;

	if (direction == 0) {

		if (mode == PAGE_CUR_GE || mode == PAGE_CUR_G) {
			moves_up = true;
		} else {
			moves_up = false;
		}

	} else if (direction == ROW_SEL_NEXT) {
		moves_up = true;
	} else {
		moves_up = false;
	}

	/* Step-2: Open or Restore the cursor.
	If search key is specified, cursor is open using the key else
	cursor is open to return all the records. */
	if (direction != 0) {
		if (index->last_sel_cur->invalid) {

			/* Index tree has changed and so active cached cursor
			is no more valid. Re-set it based on the last selected
			position. */
			index->last_sel_cur->release();

			mtr_start(mtr);
			dict_disable_redo_if_temporary(index->table, mtr);

			mem_heap_t*	heap = mem_heap_create(256);
			dtuple_t*	tuple;

			tuple = dict_index_build_data_tuple(
				index, pcur->old_rec,
				pcur->old_n_fields, heap);

			btr_pcur_open_with_no_init(
				index, tuple, pcur->search_mode,
				BTR_SEARCH_LEAF, pcur, 0, mtr);

			mem_heap_free(heap);
		} else {
			/* Restore the cursor for reading next record from cache
			information. */
			ut_ad(index->last_sel_cur->rec != NULL);

			pcur->btr_cur.page_cur.rec = index->last_sel_cur->rec;
			pcur->btr_cur.page_cur.block =
				index->last_sel_cur->block;

			err = row_search_traverse(
				moves_up, match_mode, pcur, mtr);
			if (err != DB_SUCCESS) {
				return(err);
			}
		}
	} else {
		/* There could be previous uncommitted transaction if SELECT
		is operation as part of SELECT (IF NOT FOUND) INSERT
		(IF DUPLICATE) UPDATE plan. */
		index->last_sel_cur->release();

		/* Capture table snapshot in form of trx-id. */
		index->trx_id = dict_table_get_curr_table_sess_trx_id(
			index->table);

		/* Fresh search commences. */
		mtr_start(mtr);
		dict_disable_redo_if_temporary(index->table, mtr);

		if (dtuple_get_n_fields(search_tuple) > 0) {

			btr_pcur_open_with_no_init(
				index, search_tuple, mode, BTR_SEARCH_LEAF,
				pcur, 0, mtr);

		} else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {

			btr_pcur_open_at_index_side(
				mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
				pcur, false, 0, mtr);

		}
	}

	/* Step-3: Traverse the records filtering non-qualifiying records. */
	for (/* No op */;
	     err == DB_SUCCESS;
	     err = row_search_traverse(moves_up, match_mode, pcur, mtr)) {

		const rec_t*	rec = btr_pcur_get_rec(pcur);

		if (page_rec_is_infimum(rec)
		    || page_rec_is_supremum(rec)
		    || rec_get_deleted_flag(
			rec, dict_table_is_comp(index->table))) {

			/* The infimum record on a page cannot be in the
			result set, and neither can a record lock be placed on
			it: we skip such a record. */
			continue;
		}

		offsets = rec_get_offsets(
			rec, index, offsets, ULINT_UNDEFINED, &heap);

		/* Note that we cannot trust the up_match value in the cursor
		at this place because we can arrive here after moving the
		cursor! Thus we have to recompare rec and search_tuple to
		determine if they match enough. */
		if (match_mode == ROW_SEL_EXACT) {
			/* Test if the index record matches completely to
			search_tuple in prebuilt: if not, then we return with
			DB_RECORD_NOT_FOUND */
			if (0 != cmp_dtuple_rec(search_tuple, rec, offsets)) {
				err = DB_RECORD_NOT_FOUND;
				break;
			}
		} else if (match_mode == ROW_SEL_EXACT_PREFIX) {
			if (!cmp_dtuple_is_prefix_of_rec(
				search_tuple, rec, offsets)) {
				err = DB_RECORD_NOT_FOUND;
				break;
			}
		}

		/* Get the clustered index. We always need clustered index
		record for snapshort verification. */
		if (index != clust_index) {

			err = row_sel_get_clust_rec_for_mysql(
				prebuilt, index, rec, thr, &clust_rec,
				&offsets, &heap, NULL, mtr);

			if (err != DB_SUCCESS) {
				break;
			}

			if (rec_get_deleted_flag(
				clust_rec, dict_table_is_comp(index->table))) {

				/* The record is delete marked in clustered
				index. We can skip this record. */
				continue;
			}

			result_rec = clust_rec;
		} else {
			result_rec = rec;
		}

		/* Step-4: Check if row is part of the consistent view that was
		captured while SELECT statement started execution. */
		{
			trx_id_t	trx_id;

			ulint		len;
			ulint		trx_id_off = rec_get_nth_field_offs(
				offsets, clust_index->n_uniq, &len);

			ut_ad(len == DATA_TRX_ID_LEN);

			trx_id = trx_read_trx_id(result_rec + trx_id_off);

			if (trx_id > index->trx_id) {
				/* This row was recently added skip it from
				SELECT view. */
				continue;
			}
		}

		/* Step-5: Cache the row-id of selected row to prebuilt cache.*/
		if (prebuilt->clust_index_was_generated) {
			row_sel_store_row_id_to_prebuilt(
				prebuilt, result_rec, clust_index, offsets);
		}

		/* Step-6: Convert selected record to MySQL format and
		store it. */
		if (prebuilt->template_type == ROW_MYSQL_DUMMY_TEMPLATE) {

			const rec_t*	ret_rec =
				(index != clust_index
				 && prebuilt->need_to_access_clustered)
				? result_rec : rec;

			offsets = rec_get_offsets(ret_rec, index, offsets,
						  ULINT_UNDEFINED, &heap);

			memcpy(buf + 4, ret_rec - rec_offs_extra_size(offsets),
			rec_offs_size(offsets));

			mach_write_to_4(buf, rec_offs_extra_size(offsets) + 4);

		} else if (!row_sel_store_mysql_rec(
				buf, prebuilt, result_rec, NULL, TRUE,
				clust_index, offsets, false)) {
			err = DB_ERROR;
			break;
		}

		/* Step-7: Store cursor position to fetch next record.
		MySQL calls this function iteratively get_next(), get_next()
		fashion. */
		ut_ad(err == DB_SUCCESS);
		index->last_sel_cur->rec = btr_pcur_get_rec(pcur);
		index->last_sel_cur->block = btr_pcur_get_block(pcur);

		/* This is needed in order to restore the cursor if index
		structure changes while SELECT is still active. */
		pcur->old_rec = dict_index_copy_rec_order_prefix(
			index, rec, &pcur->old_n_fields,
			&pcur->old_rec_buf, &pcur->buf_size);

		break;
	}

	if (err != DB_SUCCESS) {
		index->last_sel_cur->release();
	}

	if (heap != NULL) {
		mem_heap_free(heap);
	}
	return(err);
}

/** Extract virtual column data from a virtual index record and fill a dtuple
@param[in]	rec		the virtual (secondary) index record
@param[in]	index		the virtual index
@param[in,out]	vrow		the dtuple where data extract to
@param[in]	heap		memory heap to allocate memory
*/
static
void
row_sel_fill_vrow(
	const rec_t*		rec,
	dict_index_t*		index,
	const dtuple_t**	vrow,
	mem_heap_t*		heap)
{
	ulint           offsets_[REC_OFFS_NORMAL_SIZE];
	ulint*		offsets	= offsets_;
	rec_offs_init(offsets_);

	ut_ad(!(*vrow));

	offsets = rec_get_offsets(rec, index, offsets,
				  ULINT_UNDEFINED, &heap);

	*vrow = dtuple_create_with_vcol(
		heap, 0, dict_table_get_n_v_cols(index->table));

	/* Initialize all virtual row's mtype to DATA_MISSING */
	dtuple_init_v_fld(*vrow);

	for (ulint i = 0; i < dict_index_get_n_fields(index); i++) {
		const dict_field_t*     field;
                const dict_col_t*       col;

		field = dict_index_get_nth_field(index, i);
		col = dict_field_get_col(field);

		if (dict_col_is_virtual(col)) {
			const byte*     data;
		        ulint           len;

			data = rec_get_nth_field(rec, offsets, i, &len);

                        const dict_v_col_t*     vcol = reinterpret_cast<
				const dict_v_col_t*>(col);

			dfield_t* dfield = dtuple_get_nth_v_field(
				*vrow, vcol->v_pos);
			dfield_set_data(dfield, data, len);
			dict_col_copy_type(col, dfield_get_type(dfield));
		}
	}
}

/** Searches for rows in the database using cursor.
Function is mainly used for tables that are shared accorss connection and
so it employs technique that can help re-construct the rows that
transaction is suppose to see.
It also has optimization such as pre-caching the rows, using AHI, etc.
@param[out]	buf		buffer for the fetched row in MySQL format
@param[in]	mode		search mode PAGE_CUR_L
@param[in,out]	prebuilt	prebuilt struct for the table handler;
				this contains the info to search_tuple,
				index; if search tuple contains 0 field then
				we position the cursor at start or the end of
				index, depending on 'mode'
@param[in]	match_mode	0 or ROW_SEL_EXACT or ROW_SEL_EXACT_PREFIX
@param[in]	direction	0 or ROW_SEL_NEXT or ROW_SEL_PREV;
				Note: if this is != 0, then prebuilt must has a
				pcur with stored position! In opening of a
				cursor 'direction' should be 0.
@return DB_SUCCESS or error code */
dberr_t
row_search_mvcc(
	byte*		buf,
	page_cur_mode_t	mode,
	row_prebuilt_t*	prebuilt,
	ulint		match_mode,
	ulint		direction)
{
	DBUG_ENTER("row_search_mvcc");

	dict_index_t*	index		= prebuilt->index;
	ibool		comp		= dict_table_is_comp(index->table);
	const dtuple_t*	search_tuple	= prebuilt->search_tuple;
	btr_pcur_t*	pcur		= prebuilt->pcur;
	trx_t*		trx		= prebuilt->trx;
	dict_index_t*	clust_index;
	que_thr_t*	thr;
	const rec_t*	prev_rec = NULL;
	const rec_t*	rec = NULL;
	byte*		end_range_cache = NULL;
	const dtuple_t*	prev_vrow = NULL;
	const dtuple_t*	vrow = NULL;
	const rec_t*	result_rec = NULL;
	const rec_t*	clust_rec;
	dberr_t		err				= DB_SUCCESS;
	ibool		unique_search			= FALSE;
	ibool		mtr_has_extra_clust_latch	= FALSE;
	ibool		moves_up			= FALSE;
	ibool		set_also_gap_locks		= TRUE;
	/* if the query is a plain locking SELECT, and the isolation level
	is <= TRX_ISO_READ_COMMITTED, then this is set to FALSE */
	ibool		did_semi_consistent_read	= FALSE;
	/* if the returned record was locked and we did a semi-consistent
	read (fetch the newest committed version), then this is set to
	TRUE */
	ulint		next_offs;
	ibool		same_user_rec;
	mtr_t		mtr;
	mem_heap_t*	heap				= NULL;
	ulint		offsets_[REC_OFFS_NORMAL_SIZE];
	ulint*		offsets				= offsets_;
	ibool		table_lock_waited		= FALSE;
	byte*		next_buf			= 0;
	bool		spatial_search			= false;
	ulint		end_loop			= 0;

	rec_offs_init(offsets_);

	ut_ad(index && pcur && search_tuple);
	ut_a(prebuilt->magic_n == ROW_PREBUILT_ALLOCATED);
	ut_a(prebuilt->magic_n2 == ROW_PREBUILT_ALLOCATED);

	/* We don't support FTS queries from the HANDLER interfaces, because
	we implemented FTS as reversed inverted index with auxiliary tables.
	So anything related to traditional index query would not apply to
	it. */
	if (prebuilt->index->type & DICT_FTS) {
		DBUG_RETURN(DB_END_OF_INDEX);
	}

#ifdef UNIV_DEBUG
	{
		btrsea_sync_check	check(trx->has_search_latch);
		ut_ad(!sync_check_iterate(check));
	}
#endif /* UNIV_DEBUG */

	if (dict_table_is_discarded(prebuilt->table)) {

		DBUG_RETURN(DB_TABLESPACE_DELETED);

	} else if (prebuilt->table->ibd_file_missing) {

		DBUG_RETURN(DB_TABLESPACE_NOT_FOUND);

	} else if (!prebuilt->index_usable) {

		DBUG_RETURN(DB_MISSING_HISTORY);

	} else if (dict_index_is_corrupted(prebuilt->index)) {

		DBUG_RETURN(DB_CORRUPTION);
	}

	/* We need to get the virtual column values stored in secondary
	index key, if this is covered index scan or virtual key read is
	requested. */
	bool    need_vrow = dict_index_has_virtual(prebuilt->index)
		&& (prebuilt->read_just_key
		    || prebuilt->m_read_virtual_key);

	/*-------------------------------------------------------------*/
	/* PHASE 0: Release a possible s-latch we are holding on the
	adaptive hash index latch if there is someone waiting behind */

	if (trx->has_search_latch
#ifndef INNODB_RW_LOCKS_USE_ATOMICS
	    && rw_lock_get_writer(
		btr_get_search_latch(index)) != RW_LOCK_NOT_LOCKED
#endif /* !INNODB_RW_LOCKS_USE_ATOMICS */
	    ) {

		/* There is an x-latch request on the adaptive hash index:
		release the s-latch to reduce starvation and wait for
		BTR_SEA_TIMEOUT rounds before trying to keep it again over
		calls from MySQL */

		trx_search_latch_release_if_reserved(trx);
	}

	/* Reset the new record lock info if srv_locks_unsafe_for_binlog
	is set or session is using a READ COMMITED isolation level. Then
	we are able to remove the record locks set here on an individual
	row. */
	prebuilt->new_rec_locks = 0;

	/*-------------------------------------------------------------*/
	/* PHASE 1: Try to pop the row from the prefetch cache */

	if (UNIV_UNLIKELY(direction == 0)) {
		trx->op_info = "starting index read";

		prebuilt->n_rows_fetched = 0;
		prebuilt->n_fetch_cached = 0;
		prebuilt->fetch_cache_first = 0;

		if (prebuilt->sel_graph == NULL) {
			/* Build a dummy select query graph */
			row_prebuild_sel_graph(prebuilt);
		}
	} else {
		trx->op_info = "fetching rows";

		if (prebuilt->n_rows_fetched == 0) {
			prebuilt->fetch_direction = direction;
		}

		if (UNIV_UNLIKELY(direction != prebuilt->fetch_direction)) {
			if (UNIV_UNLIKELY(prebuilt->n_fetch_cached > 0)) {
				ut_error;
				/* TODO: scrollable cursor: restore cursor to
				the place of the latest returned row,
				or better: prevent caching for a scroll
				cursor! */
			}

			prebuilt->n_rows_fetched = 0;
			prebuilt->n_fetch_cached = 0;
			prebuilt->fetch_cache_first = 0;

		} else if (UNIV_LIKELY(prebuilt->n_fetch_cached > 0)) {
			row_sel_dequeue_cached_row_for_mysql(buf, prebuilt);

			prebuilt->n_rows_fetched++;

			err = DB_SUCCESS;
			goto func_exit;
		} else if (prebuilt->m_end_range == true) {

			prebuilt->m_end_range = false;

			err = DB_RECORD_NOT_FOUND;
			goto func_exit;
		}

		if (prebuilt->fetch_cache_first > 0
		    && prebuilt->fetch_cache_first < MYSQL_FETCH_CACHE_SIZE) {

			/* The previous returned row was popped from the fetch
			cache, but the cache was not full at the time of the
			popping: no more rows can exist in the result set */

			err = DB_RECORD_NOT_FOUND;
			goto func_exit;
		}

		prebuilt->n_rows_fetched++;

		if (prebuilt->n_rows_fetched > 1000000000) {
			/* Prevent wrap-over */
			prebuilt->n_rows_fetched = 500000000;
		}

		mode = pcur->search_mode;
	}

	/* In a search where at most one record in the index may match, we
	can use a LOCK_REC_NOT_GAP type record lock when locking a
	non-delete-marked matching record.
	Note that in a unique secondary index there may be different
	delete-marked versions of a record where only the primary key
	values differ: thus in a secondary index we must use next-key
	locks when locking delete-marked records. */

	if (match_mode == ROW_SEL_EXACT
	    && dict_index_is_unique(index)
	    && dtuple_get_n_fields(search_tuple)
	    == dict_index_get_n_unique(index)
	    && (dict_index_is_clust(index)
		|| !dtuple_contains_null(search_tuple))) {

		/* Note above that a UNIQUE secondary index can contain many
		rows with the same key value if one of the columns is the SQL
		null. A clustered index under MySQL can never contain null
		columns because we demand that all the columns in primary key
		are non-null. */

		unique_search = TRUE;

		/* Even if the condition is unique, MySQL seems to try to
		retrieve also a second row if a primary key contains more than
		1 column. Return immediately if this is not a HANDLER
		command. */

		if (UNIV_UNLIKELY(direction != 0
				  && !prebuilt->used_in_HANDLER)) {

			err = DB_RECORD_NOT_FOUND;
			goto func_exit;
		}
	}

	/* We don't support sequencial scan for Rtree index, because it
	is no meaning to do so. */
	if (dict_index_is_spatial(index)
		&& !RTREE_SEARCH_MODE(mode)) {
		err = DB_END_OF_INDEX;
		goto func_exit;
	}

	mtr_start(&mtr);

	/*-------------------------------------------------------------*/
	/* PHASE 2: Try fast adaptive hash index search if possible */

	/* Next test if this is the special case where we can use the fast
	adaptive hash index to try the search. Since we must release the
	search system latch when we retrieve an externally stored field, we
	cannot use the adaptive hash index in a search in the case the row
	may be long and there may be externally stored fields */

	if (UNIV_UNLIKELY(direction == 0)
	    && unique_search
	    && btr_search_enabled
	    && dict_index_is_clust(index)
	    && !prebuilt->templ_contains_blob
	    && !prebuilt->used_in_HANDLER
	    && (prebuilt->mysql_row_len < UNIV_PAGE_SIZE / 8)
	    && !prebuilt->innodb_api) {

		mode = PAGE_CUR_GE;

		if (trx->mysql_n_tables_locked == 0
		    && !prebuilt->ins_sel_stmt
		    && prebuilt->select_lock_type == LOCK_NONE
		    && trx->isolation_level > TRX_ISO_READ_UNCOMMITTED
		    && MVCC::is_view_active(trx->read_view)) {

			/* This is a SELECT query done as a consistent read,
			and the read view has already been allocated:
			let us try a search shortcut through the hash
			index.
			NOTE that we must also test that
			mysql_n_tables_locked == 0, because this might
			also be INSERT INTO ... SELECT ... or
			CREATE TABLE ... SELECT ... . Our algorithm is
			NOT prepared to inserts interleaved with the SELECT,
			and if we try that, we can deadlock on the adaptive
			hash index semaphore! */

			ut_a(!trx->has_search_latch);
			rw_lock_s_lock(btr_get_search_latch(index));
			trx->has_search_latch = true;

			switch (row_sel_try_search_shortcut_for_mysql(
					&rec, prebuilt, &offsets, &heap,
					&mtr)) {
			case SEL_FOUND:
				/* At this point, rec is protected by
				a page latch that was acquired by
				row_sel_try_search_shortcut_for_mysql().
				The latch will not be released until
				mtr_commit(&mtr). */
				ut_ad(!rec_get_deleted_flag(rec, comp));

				if (prebuilt->idx_cond) {
					switch (row_search_idx_cond_check(
							buf, prebuilt,
							rec, offsets)) {
					case ICP_NO_MATCH:
					case ICP_OUT_OF_RANGE:
						goto shortcut_mismatch;
					case ICP_MATCH:
						goto shortcut_match;
					}
				}

				if (!row_sel_store_mysql_rec(
					    buf, prebuilt,
					    rec, NULL, FALSE, index,
					    offsets, false)) {
					/* Only fresh inserts may contain
					incomplete externally stored
					columns. Pretend that such
					records do not exist. Such
					records may only be accessed
					at the READ UNCOMMITTED
					isolation level or when
					rolling back a recovered
					transaction. Rollback happens
					at a lower level, not here. */

					/* Proceed as in case SEL_RETRY. */
					break;
				}

			shortcut_match:
				mtr_commit(&mtr);

				/* NOTE that we do NOT store the cursor
				position */

				err = DB_SUCCESS;

				rw_lock_s_unlock(btr_get_search_latch(index));
				trx->has_search_latch = false;

				goto func_exit;

			case SEL_EXHAUSTED:
			shortcut_mismatch:
				mtr_commit(&mtr);

				err = DB_RECORD_NOT_FOUND;

				rw_lock_s_unlock(btr_get_search_latch(index));
				trx->has_search_latch = false;

				/* NOTE that we do NOT store the cursor
				position */

				goto func_exit;

			case SEL_RETRY:
				break;

			default:
				ut_ad(0);
			}

			mtr_commit(&mtr);
			mtr_start(&mtr);

                        rw_lock_s_unlock(btr_get_search_latch(index));
                        trx->has_search_latch = false;
		}
	}

	/*-------------------------------------------------------------*/
	/* PHASE 3: Open or restore index cursor position */

	trx_search_latch_release_if_reserved(trx);

	spatial_search = dict_index_is_spatial(index)
			 && mode >= PAGE_CUR_CONTAIN;

	/* The state of a running trx can only be changed by the
	thread that is currently serving the transaction. Because we
	are that thread, we can read trx->state without holding any
	mutex. */
	ut_ad(prebuilt->sql_stat_start || trx->state == TRX_STATE_ACTIVE);

	ut_ad(!trx_is_started(trx) || trx->state == TRX_STATE_ACTIVE);

	ut_ad(prebuilt->sql_stat_start
	      || prebuilt->select_lock_type != LOCK_NONE
	      || MVCC::is_view_active(trx->read_view)
	      || srv_read_only_mode);
  //从这里我们看出开始一个新事务，并非是从执行BEGIN语句位置开始，而是从其后开始执行的第一条语句开始分配事务ID */
	trx_start_if_not_started(trx, false);

	if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
	    && prebuilt->select_lock_type != LOCK_NONE
	    && trx->mysql_thd != NULL
	    && thd_is_select(trx->mysql_thd)) {
		/* It is a plain locking SELECT and the isolation
		level is low: do not lock gaps */

		set_also_gap_locks = FALSE;
	}

	/* Note that if the search mode was GE or G, then the cursor
	naturally moves upward (in fetch next) in alphabetical order,
	otherwise downward */

	if (direction == 0) {

		if (mode == PAGE_CUR_GE
		    || mode == PAGE_CUR_G
		    || mode >= PAGE_CUR_CONTAIN) {

			moves_up = TRUE;
		}

	} else if (direction == ROW_SEL_NEXT) {

		moves_up = TRUE;
	}

	thr = que_fork_get_first_thr(prebuilt->sel_graph);

	que_thr_move_to_run_state_for_mysql(thr, trx);

	clust_index = dict_table_get_first_index(index->table);

	/* Do some start-of-statement preparations */

	if (!prebuilt->sql_stat_start) {
		//这里标记SQL语句已经开始执行，处理一条SQL语句循环扫描记录的过程
		/* No need to set an intention lock or assign a read view */

		if (!MVCC::is_view_active(trx->read_view)
		    && !srv_read_only_mode
		    && prebuilt->select_lock_type == LOCK_NONE) {

			ib::error() << "MySQL is trying to perform a"
				" consistent read but the read view is not"
				" assigned!";
			trx_print(stderr, trx, 600);
			fputc('\n', stderr);
			ut_error;
		}
	} else if (prebuilt->select_lock_type == LOCK_NONE) {
		// 如果是第一次执行SELECT语句，构建READ_VIEW. 该READ_VIEW 用来判断记录的可见性
		/* This is a consistent read */
		/* Assign a read view for the query */

		if (!srv_read_only_mode) {
			trx_assign_read_view(trx);
		}

		prebuilt->sql_stat_start = FALSE;
	} else {
		// 这里开始非INSERT的DML操作，因为DML会对记录增加记录排他锁。
wait_table_again:
	  // 对table加意向锁
		err = lock_table(0, index->table,
				 prebuilt->select_lock_type == LOCK_S
				 ? LOCK_IS : LOCK_IX, thr);

		if (err != DB_SUCCESS) {

			table_lock_waited = TRUE;
			goto lock_table_wait;
		}
		prebuilt->sql_stat_start = FALSE;
	}

	/* Open or restore index cursor position */

	if (UNIV_LIKELY(direction != 0)) {
		if (spatial_search) {
			/* R-Tree access does not need to do
			cursor position and resposition */
			goto next_rec;
		}

		ibool	need_to_process = sel_restore_position_for_mysql(
			&same_user_rec, BTR_SEARCH_LEAF,
			pcur, moves_up, &mtr);

		if (UNIV_UNLIKELY(need_to_process)) {
			if (UNIV_UNLIKELY(prebuilt->row_read_type
					  == ROW_READ_DID_SEMI_CONSISTENT)) {
				/* We did a semi-consistent read,
				but the record was removed in
				the meantime. */
				prebuilt->row_read_type
					= ROW_READ_TRY_SEMI_CONSISTENT;
			}
		} else if (UNIV_LIKELY(prebuilt->row_read_type
				       != ROW_READ_DID_SEMI_CONSISTENT)) {

			/* The cursor was positioned on the record
			that we returned previously.  If we need
			to repeat a semi-consistent read as a
			pessimistic locking read, the record
			cannot be skipped. */

			goto next_rec;
		}

	} else if (dtuple_get_n_fields(search_tuple) > 0) {
		pcur->btr_cur.thr = thr;

		if (dict_index_is_spatial(index)) {
			bool	need_pred_lock;

			need_pred_lock = (set_also_gap_locks
					  && !(srv_locks_unsafe_for_binlog
					      || trx->isolation_level
						 <= TRX_ISO_READ_COMMITTED)
					  && prebuilt->select_lock_type
						 != LOCK_NONE);

			if (!prebuilt->rtr_info) {
				prebuilt->rtr_info = rtr_create_rtr_info(
					need_pred_lock, true,
					btr_pcur_get_btr_cur(pcur), index);
				prebuilt->rtr_info->search_tuple = search_tuple;
				prebuilt->rtr_info->search_mode = mode;
				rtr_info_update_btr(btr_pcur_get_btr_cur(pcur),
						    prebuilt->rtr_info);
			} else {
				rtr_info_reinit_in_cursor(
					btr_pcur_get_btr_cur(pcur),
					index, need_pred_lock);
				prebuilt->rtr_info->search_tuple = search_tuple;
				prebuilt->rtr_info->search_mode = mode;
			}
		}

		btr_pcur_open_with_no_init(index, search_tuple, mode,
					   BTR_SEARCH_LEAF,
					   pcur, 0, &mtr);

		pcur->trx_if_known = trx;

		rec = btr_pcur_get_rec(pcur);

		if (!moves_up
		    && !page_rec_is_supremum(rec)
		    && set_also_gap_locks
		    && !(srv_locks_unsafe_for_binlog
			 || trx->isolation_level <= TRX_ISO_READ_COMMITTED)
		    && prebuilt->select_lock_type != LOCK_NONE
		    && !dict_index_is_spatial(index)) {

			/* Try to place a gap lock on the next index record
			to prevent phantoms in ORDER BY ... DESC queries */
			const rec_t*	next_rec = page_rec_get_next_const(rec);

			offsets = rec_get_offsets(next_rec, index, offsets,
						  ULINT_UNDEFINED, &heap);
			err = sel_set_rec_lock(pcur,
					       next_rec, index, offsets,
					       prebuilt->select_lock_type,
					       LOCK_GAP, thr, &mtr);

			switch (err) {
			case DB_SUCCESS_LOCKED_REC:
				err = DB_SUCCESS;
			case DB_SUCCESS:
				break;
			default:
				goto lock_wait_or_error;
			}
		}
	} else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
		btr_pcur_open_at_index_side(
			mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
			pcur, false, 0, &mtr);
	}

rec_loop:
	DEBUG_SYNC_C("row_search_rec_loop");
	if (trx_is_interrupted(trx)) {
		if (!spatial_search) {
			btr_pcur_store_position(pcur, &mtr);
		}
		err = DB_INTERRUPTED;
		goto normal_return;
	}

	/*-------------------------------------------------------------*/
	/* PHASE 4: Look for matching records in a loop */

	rec = btr_pcur_get_rec(pcur);

	ut_ad(!!page_rec_is_comp(rec) == comp);

	if (page_rec_is_infimum(rec)) {

		/* The infimum record on a page cannot be in the result set,
		and neither can a record lock be placed on it: we skip such
		a record. */

		prev_rec = NULL;
		goto next_rec;
	}

	if (page_rec_is_supremum(rec)) {

		/** Compare the last record of the page with end range
		passed to InnoDB when there is no ICP and number of
		loops in row_search_mvcc for rows found but not
		reporting due to search views etc. */
		if (prev_rec != NULL
		    && prebuilt->m_mysql_handler->end_range != NULL
		    && prebuilt->idx_cond == false && end_loop >= 100) {

			dict_index_t*	key_index = prebuilt->index;
			bool		clust_templ_for_sec = false;

			if (end_range_cache == NULL) {
				end_range_cache = static_cast<byte*>(
					ut_malloc_nokey(prebuilt->mysql_row_len));
			}

			if (index != clust_index
			    && prebuilt->need_to_access_clustered) {
				/** Secondary index record but the template
				based on PK. */
				key_index = clust_index;
				clust_templ_for_sec = true;
			}

			/** Create offsets based on prebuilt index. */
			offsets = rec_get_offsets(prev_rec, prebuilt->index,
					offsets, ULINT_UNDEFINED, &heap);

			if (row_sel_store_mysql_rec(
				end_range_cache, prebuilt, prev_rec, prev_vrow,
				clust_templ_for_sec, key_index, offsets,
				clust_templ_for_sec)) {

				if (row_search_end_range_check(
					end_range_cache,
					prebuilt->m_mysql_handler)) {

					/** In case of prebuilt->fetch,
					set the error in prebuilt->end_range. */
					if (prebuilt->n_fetch_cached > 0) {
						prebuilt->m_end_range = true;
					}

					err = DB_RECORD_NOT_FOUND;
					goto normal_return;
				}
			}
		}

		if (set_also_gap_locks
		    && !(srv_locks_unsafe_for_binlog
			 || trx->isolation_level <= TRX_ISO_READ_COMMITTED)
		    && prebuilt->select_lock_type != LOCK_NONE
		    && !dict_index_is_spatial(index)) {

			/* Try to place a lock on the index record */

			/* If innodb_locks_unsafe_for_binlog option is used
			or this session is using a READ COMMITTED isolation
			level we do not lock gaps. Supremum record is really
			a gap and therefore we do not set locks there. */

			offsets = rec_get_offsets(rec, index, offsets,
						  ULINT_UNDEFINED, &heap);
			err = sel_set_rec_lock(pcur,
					       rec, index, offsets,
					       prebuilt->select_lock_type,
					       LOCK_ORDINARY, thr, &mtr);

			switch (err) {
			case DB_SUCCESS_LOCKED_REC:
				err = DB_SUCCESS;
			case DB_SUCCESS:
				break;
			default:
				goto lock_wait_or_error;
			}
		}

		/* A page supremum record cannot be in the result set: skip
		it now that we have placed a possible lock on it */

		prev_rec = NULL;
		goto next_rec;
	}

	/*-------------------------------------------------------------*/
	/* Do sanity checks in case our cursor has bumped into page
	corruption */

	if (comp) {
		next_offs = rec_get_next_offs(rec, TRUE);
		if (UNIV_UNLIKELY(next_offs < PAGE_NEW_SUPREMUM)) {

			goto wrong_offs;
		}
	} else {
		next_offs = rec_get_next_offs(rec, FALSE);
		if (UNIV_UNLIKELY(next_offs < PAGE_OLD_SUPREMUM)) {

			goto wrong_offs;
		}
	}

	if (UNIV_UNLIKELY(next_offs >= UNIV_PAGE_SIZE - PAGE_DIR)) {

wrong_offs:
		if (srv_force_recovery == 0 || moves_up == FALSE) {
			ib::error() << "Rec address "
				<< static_cast<const void*>(rec)
				<< ", buf block fix count "
				<< btr_cur_get_block(
					btr_pcur_get_btr_cur(pcur))->page
					.buf_fix_count;

			ib::error() << "Index corruption: rec offs "
				<< page_offset(rec) << " next offs "
				<< next_offs << ", page no "
				<< page_get_page_no(page_align(rec))
				<< ", index " << index->name
				<< " of table " << index->table->name
				<< ". Run CHECK TABLE. You may need to"
				" restore from a backup, or dump + drop +"
				" reimport the table.";
			ut_ad(0);
			err = DB_CORRUPTION;

			goto lock_wait_or_error;
		} else {
			/* The user may be dumping a corrupt table. Jump
			over the corruption to recover as much as possible. */

			ib::info() << "Index corruption: rec offs "
				<< page_offset(rec) << " next offs "
				<< next_offs << ", page no "
				<< page_get_page_no(page_align(rec))
				<< ", index " << index->name
				<< " of table " << index->table->name
				<< ". We try to skip the rest of the page.";

			btr_pcur_move_to_last_on_page(pcur, &mtr);

			prev_rec = NULL;
			goto next_rec;
		}
	}
	/*-------------------------------------------------------------*/

	/* Calculate the 'offsets' associated with 'rec' */

	ut_ad(fil_page_index_page_check(btr_pcur_get_page(pcur)));
	ut_ad(btr_page_get_index_id(btr_pcur_get_page(pcur)) == index->id);

	offsets = rec_get_offsets(rec, index, offsets, ULINT_UNDEFINED, &heap);

	if (UNIV_UNLIKELY(srv_force_recovery > 0)) {
		if (!rec_validate(rec, offsets)
		    || !btr_index_rec_validate(rec, index, FALSE)) {

			ib::info() << "Index corruption: rec offs "
				<< page_offset(rec) << " next offs "
				<< next_offs << ", page no "
				<< page_get_page_no(page_align(rec))
				<< ", index " << index->name
				<< " of table " << index->table->name
				<< ". We try to skip the record.";

			prev_rec = NULL;
			goto next_rec;
		}
	}

	prev_rec = rec;

	/* Note that we cannot trust the up_match value in the cursor at this
	place because we can arrive here after moving the cursor! Thus
	we have to recompare rec and search_tuple to determine if they
	match enough. */

	if (match_mode == ROW_SEL_EXACT) {
		/* Test if the index record matches completely to search_tuple
		in prebuilt: if not, then we return with DB_RECORD_NOT_FOUND */

		/* fputs("Comparing rec and search tuple\n", stderr); */

		if (0 != cmp_dtuple_rec(search_tuple, rec, offsets)) {

			if (set_also_gap_locks
			    && !(srv_locks_unsafe_for_binlog
				 || trx->isolation_level
				 <= TRX_ISO_READ_COMMITTED)
			    && prebuilt->select_lock_type != LOCK_NONE
			    && !dict_index_is_spatial(index)) {

				/* Try to place a gap lock on the index
				record only if innodb_locks_unsafe_for_binlog
				option is not set or this session is not
				using a READ COMMITTED isolation level. */

				err = sel_set_rec_lock(
					pcur,
					rec, index, offsets,
					prebuilt->select_lock_type, LOCK_GAP,
					thr, &mtr);

				switch (err) {
				case DB_SUCCESS_LOCKED_REC:
				case DB_SUCCESS:
					break;
				default:
					goto lock_wait_or_error;
				}
			}

			btr_pcur_store_position(pcur, &mtr);

			/* The found record was not a match, but may be used
			as NEXT record (index_next). Set the relative position
			to BTR_PCUR_BEFORE, to reflect that the position of
			the persistent cursor is before the found/stored row
			(pcur->old_rec). */
			ut_ad(pcur->rel_pos == BTR_PCUR_ON);
			pcur->rel_pos = BTR_PCUR_BEFORE;

			err = DB_RECORD_NOT_FOUND;
			goto normal_return;
		}

	} else if (match_mode == ROW_SEL_EXACT_PREFIX) {

		if (!cmp_dtuple_is_prefix_of_rec(search_tuple, rec, offsets)) {

			if (set_also_gap_locks
			    && !(srv_locks_unsafe_for_binlog
				 || trx->isolation_level
				 <= TRX_ISO_READ_COMMITTED)
			    && prebuilt->select_lock_type != LOCK_NONE
			    && !dict_index_is_spatial(index)) {

				/* Try to place a gap lock on the index
				record only if innodb_locks_unsafe_for_binlog
				option is not set or this session is not
				using a READ COMMITTED isolation level. */

				err = sel_set_rec_lock(
					pcur,
					rec, index, offsets,
					prebuilt->select_lock_type, LOCK_GAP,
					thr, &mtr);

				switch (err) {
				case DB_SUCCESS_LOCKED_REC:
				case DB_SUCCESS:
					break;
				default:
					goto lock_wait_or_error;
				}
			}

			btr_pcur_store_position(pcur, &mtr);

			/* The found record was not a match, but may be used
			as NEXT record (index_next). Set the relative position
			to BTR_PCUR_BEFORE, to reflect that the position of
			the persistent cursor is before the found/stored row
			(pcur->old_rec). */
			ut_ad(pcur->rel_pos == BTR_PCUR_ON);
			pcur->rel_pos = BTR_PCUR_BEFORE;

			err = DB_RECORD_NOT_FOUND;
			goto normal_return;
		}
	}

	/* We are ready to look at a possible new index entry in the result
	set: the cursor is now placed on a user record */
	/* 从这里我们看一下InnoDB如何获取一条新纪录。由于上面例子中SESSION1的第一条语句是SELECT语句，
	InnoDB在REPEATABLE READ 隔离级别下，不对SELECT 语句加锁，
	所以这里执行SELECT语句的时候prebuilt->select_lock_type为LOCK_NONE。
	下面我们直接看一下prebuilt->select_lock_type为LOCK_NONE的情况下，InnoDB如何扫描行？ */
	if (prebuilt->select_lock_type != LOCK_NONE) {
		/* Try to place a lock on the index record; note that delete
		marked records are a special case in a unique search. If there
		is a non-delete marked record, then it is enough to lock its
		existence with LOCK_REC_NOT_GAP. */

		/* If innodb_locks_unsafe_for_binlog option is used
		or this session is using a READ COMMITED isolation
		level we lock only the record, i.e., next-key locking is
		not used. */

		ulint	lock_type;

		if (!set_also_gap_locks
		    || srv_locks_unsafe_for_binlog
		    || trx->isolation_level <= TRX_ISO_READ_COMMITTED
		    || (unique_search && !rec_get_deleted_flag(rec, comp))
		    || dict_index_is_spatial(index)) {
				 // 这里对于READ_UNCOMMITTED以及READ_COMMITTED,或者唯一键扫描不需要使用gap锁
			goto no_gap_lock;
		} else {
			lock_type = LOCK_ORDINARY;
		}

		/* If we are doing a 'greater or equal than a primary key
		value' search from a clustered index, and we find a record
		that has that exact primary key value, then there is no need
		to lock the gap before the record, because no insert in the
		gap can be in our search range. That is, no phantom row can
		appear that way.
		An example: if col1 is the primary key, the search is WHERE
		col1 >= 100, and we find a record where col1 = 100, then no
		need to lock the gap before that record. */

		if (index == clust_index
		    && mode == PAGE_CUR_GE
		    && direction == 0
		    && dtuple_get_n_fields_cmp(search_tuple)
		    == dict_index_get_n_unique(index)
		    && 0 == cmp_dtuple_rec(search_tuple, rec, offsets)) {
no_gap_lock:
			lock_type = LOCK_REC_NOT_GAP;
		}

		err = sel_set_rec_lock(pcur,
				       rec, index, offsets,
				       prebuilt->select_lock_type,
				       lock_type, thr, &mtr);

		switch (err) {
			const rec_t*	old_vers;
		case DB_SUCCESS_LOCKED_REC:
			if (srv_locks_unsafe_for_binlog
			    || trx->isolation_level
			    <= TRX_ISO_READ_COMMITTED) {
				/* Note that a record of
				prebuilt->index was locked. */
				prebuilt->new_rec_locks = 1;
			}
			err = DB_SUCCESS;
		case DB_SUCCESS:
			 // 加锁成功后就认为记录可见了，并未像SELECT语句一样根据事务开始的READ_VIEW进行可见性判断。所以对于DML来说，所有提交的事务都是可见的。
			break;
		case DB_LOCK_WAIT:
			// 如果存在锁冲突，也就是其他事务正在更新同一行
			/* Lock wait for R-tree should already
			be handled in sel_set_rtr_rec_lock() */
			ut_ad(!dict_index_is_spatial(index));
			/* Never unlock rows that were part of a conflict. */
			prebuilt->new_rec_locks = 0;

			if (UNIV_LIKELY(prebuilt->row_read_type
					!= ROW_READ_TRY_SEMI_CONSISTENT)
			    || unique_search
			    || index != clust_index) {

				goto lock_wait_or_error;
			}

			/* The following call returns 'offsets'
			associated with 'old_vers' */
			 // 这里需要查看是否有别的事务提交了，以便获取最新版本的记录
			row_sel_build_committed_vers_for_mysql(
				clust_index, prebuilt, rec,
				&offsets, &heap, &old_vers, need_vrow ? &vrow : NULL,
			        &mtr);

			/* Check whether it was a deadlock or not, if not
			a deadlock and the transaction had to wait then
			release the lock it is waiting on. */

			err = lock_trx_handle_wait(trx);

			switch (err) {
			case DB_SUCCESS:
				/* The lock was granted while we were
				searching for the last committed version.
				Do a normal locking read. */

				offsets = rec_get_offsets(
					rec, index, offsets, ULINT_UNDEFINED,
					&heap);
				goto locks_ok;
			case DB_DEADLOCK:
				goto lock_wait_or_error;
			case DB_LOCK_WAIT:
				ut_ad(!dict_index_is_spatial(index));
				err = DB_SUCCESS;
				break;
			default:
				ut_error;
			}

			if (old_vers == NULL) {
				/* The row was not yet committed */

				goto next_rec;
			}

			did_semi_consistent_read = TRUE;
			rec = old_vers;
			prev_rec = rec;
			break;
		case DB_RECORD_NOT_FOUND:
			if (dict_index_is_spatial(index)) {
				goto next_rec;
			} else {
				goto lock_wait_or_error;
			}

		default:

			goto lock_wait_or_error;
		}
	} else {
		/* This is a non-locking consistent read: if necessary, fetch
		a previous version of the record */

		if (trx->isolation_level == TRX_ISO_READ_UNCOMMITTED) {
      /* 对于READ UNCOMMITTED隔离级别，我们什么都不需要，只要让他读取最新的记录版本即可 */
			/* Do nothing: we let a non-locking SELECT read the
			latest version of the record */

		} else if (index == clust_index) {

			/* Fetch a previous version of the row if the current
			one is not visible in the snapshot; if we have a very
			high force recovery level set, we try to avoid crashes
			by skipping this lookup */
			 // 如果是全表扫描或主键扫描，这里需要看看当前记录是否对当前事务可见
			if (srv_force_recovery < 5
			    && !lock_clust_rec_cons_read_sees(
				    rec, index, offsets,
				    trx_get_read_view(trx))) {

				rec_t*	old_vers;
				/* The following call returns 'offsets'
				associated with 'old_vers' */
				// 如果不可见，这里需要查找历史版本
				err = row_sel_build_prev_vers_for_mysql(
					trx->read_view, clust_index,
					prebuilt, rec, &offsets, &heap,
					&old_vers, need_vrow ? &vrow : NULL,
					&mtr);

				if (err != DB_SUCCESS) {

					goto lock_wait_or_error;
				}

				if (old_vers == NULL) {
					/* The row did not exist yet in
					the read view */
	       // 如果当前记录对当前事务不可见，也没有历史版本，直接查找下一条记录
					goto next_rec;
				}

				rec = old_vers;
				prev_rec = rec;
			}
		} else {
			/* We are looking into a non-clustered index,
			and to get the right version of the record we
			have to look also into the clustered index: this
			is necessary, because we can only get the undo
			information via the clustered index record. */

			ut_ad(!dict_index_is_clust(index));
			// 这里处理是Secondary index扫描的情况
			if (!srv_read_only_mode
			    && !lock_sec_rec_cons_read_sees(
					rec, index, trx->read_view)) {
				/* We should look at the clustered index.
				However, as this is a non-locking read,
				we can skip the clustered index lookup if
				the condition does not match the secondary
				index entry. */
				// 这里InnoDB做了一下优化，如果当前记录不满足ICP，直接查找下一条记录；
				//如果满足ICP则需要继续根据聚集索引寻找历史版本
				switch (row_search_idx_cond_check(
						buf, prebuilt, rec, offsets)) {
				case ICP_NO_MATCH:
					goto next_rec;
				case ICP_OUT_OF_RANGE:
					err = DB_RECORD_NOT_FOUND;
					goto idx_cond_failed;
				case ICP_MATCH:
					goto requires_clust_rec;
				}

				ut_error;
			}
		}
	}

locks_ok:
	/* NOTE that at this point rec can be an old version of a clustered
	index record built for a consistent read. We cannot assume after this
	point that rec is on a buffer pool page. Functions like
	page_rec_is_comp() cannot be used! */

	if (rec_get_deleted_flag(rec, comp)) {

		/* The record is delete-marked: we can skip it */

		if ((srv_locks_unsafe_for_binlog
		     || trx->isolation_level <= TRX_ISO_READ_COMMITTED)
		    && prebuilt->select_lock_type != LOCK_NONE
		    && !did_semi_consistent_read) {

			/* No need to keep a lock on a delete-marked record
			if we do not want to use next-key locking. */

			row_unlock_for_mysql(prebuilt, TRUE);
		}

		/* This is an optimization to skip setting the next key lock
		on the record that follows this delete-marked record. This
		optimization works because of the unique search criteria
		which precludes the presence of a range lock between this
		delete marked record and the record following it.
		For now this is applicable only to clustered indexes while
		doing a unique search except for HANDLER queries because
		HANDLER allows NEXT and PREV even in unique search on
		clustered index. There is scope for further optimization
		applicable to unique secondary indexes. Current behaviour is
		to widen the scope of a lock on an already delete marked record
		if the same record is deleted twice by the same transaction */
		if (index == clust_index && unique_search
		    && !prebuilt->used_in_HANDLER) {

			err = DB_RECORD_NOT_FOUND;

			goto normal_return;
		}

		goto next_rec;
	}

	/* Check if the record matches the index condition. */
	switch (row_search_idx_cond_check(buf, prebuilt, rec, offsets)) {
	case ICP_NO_MATCH:
		if (did_semi_consistent_read) {
			row_unlock_for_mysql(prebuilt, TRUE);
		}
		goto next_rec;
	case ICP_OUT_OF_RANGE:
		err = DB_RECORD_NOT_FOUND;
		goto idx_cond_failed;
	case ICP_MATCH:
		break;
	}

	/* Get the clustered index record if needed, if we did not do the
	search using the clustered index. */

	if (index != clust_index && prebuilt->need_to_access_clustered) {

requires_clust_rec:
		ut_ad(index != clust_index);
		/* We use a 'goto' to the preceding label if a consistent
		read of a secondary index record requires us to look up old
		versions of the associated clustered index record. */

		ut_ad(rec_offs_validate(rec, index, offsets));

		/* It was a non-clustered index and we must fetch also the
		clustered index record */

		mtr_has_extra_clust_latch = TRUE;

		ut_ad(!vrow);
		/* The following call returns 'offsets' associated with
		'clust_rec'. Note that 'clust_rec' can be an old version
		built for a consistent read. */

		err = row_sel_get_clust_rec_for_mysql(prebuilt, index, rec,
						      thr, &clust_rec,
						      &offsets, &heap,
						      need_vrow ? &vrow : NULL,
						      &mtr);
		switch (err) {
		case DB_SUCCESS:
			if (clust_rec == NULL) {
				/* The record did not exist in the read view */
				ut_ad(prebuilt->select_lock_type == LOCK_NONE
				      || dict_index_is_spatial(index));

				goto next_rec;
			}
			break;
		case DB_SUCCESS_LOCKED_REC:
			ut_a(clust_rec != NULL);
			if (srv_locks_unsafe_for_binlog
			     || trx->isolation_level
			    <= TRX_ISO_READ_COMMITTED) {
				/* Note that the clustered index record
				was locked. */
				prebuilt->new_rec_locks = 2;
			}
			err = DB_SUCCESS;
			break;
		default:
			vrow = NULL;
			goto lock_wait_or_error;
		}

		if (rec_get_deleted_flag(clust_rec, comp)) {

			/* The record is delete marked: we can skip it */

			if ((srv_locks_unsafe_for_binlog
			     || trx->isolation_level <= TRX_ISO_READ_COMMITTED)
			    && prebuilt->select_lock_type != LOCK_NONE) {

				/* No need to keep a lock on a delete-marked
				record if we do not want to use next-key
				locking. */

				row_unlock_for_mysql(prebuilt, TRUE);
			}

			goto next_rec;
		}

		if (need_vrow && !vrow) {
			if (!heap) {
				heap = mem_heap_create(100);
			}
			row_sel_fill_vrow(rec, index, &vrow, heap);
		}

		result_rec = clust_rec;
		ut_ad(rec_offs_validate(result_rec, clust_index, offsets));

		if (prebuilt->idx_cond) {
			/* Convert the record to MySQL format. We were
			unable to do this in row_search_idx_cond_check(),
			because the condition is on the secondary index
			and the requested column is in the clustered index.
			We convert all fields, including those that
			may have been used in ICP, because the
			secondary index may contain a column prefix
			rather than the full column. Also, as noted
			in Bug #56680, the column in the secondary
			index may be in the wrong case, and the
			authoritative case is in result_rec, the
			appropriate version of the clustered index record. */
			if (!row_sel_store_mysql_rec(
				    buf, prebuilt, result_rec, vrow,
				    TRUE, clust_index, offsets, false)) {
				goto next_rec;
			}
		}
	} else {
		result_rec = rec;
	}

	/* We found a qualifying record 'result_rec'. At this point,
	'offsets' are associated with 'result_rec'. */

	ut_ad(rec_offs_validate(result_rec,
				result_rec != rec ? clust_index : index,
				offsets));
	ut_ad(!rec_get_deleted_flag(result_rec, comp));

	/* Decide whether to prefetch extra rows.
	At this point, the clustered index record is protected
	by a page latch that was acquired when pcur was positioned.
	The latch will not be released until mtr_commit(&mtr). */

	if ((match_mode == ROW_SEL_EXACT
	     || prebuilt->n_rows_fetched >= MYSQL_FETCH_CACHE_THRESHOLD)
	    && prebuilt->select_lock_type == LOCK_NONE
	    && !prebuilt->m_no_prefetch
	    && !prebuilt->templ_contains_blob
	    && !prebuilt->templ_contains_fixed_point
	    && !prebuilt->clust_index_was_generated
	    && !prebuilt->used_in_HANDLER
	    && !prebuilt->innodb_api
	    && prebuilt->template_type != ROW_MYSQL_DUMMY_TEMPLATE
	    && !prebuilt->in_fts_query) {

		/* Inside an update, for example, we do not cache rows,
		since we may use the cursor position to do the actual
		update, that is why we require ...lock_type == LOCK_NONE.
		Since we keep space in prebuilt only for the BLOBs of
		a single row, we cannot cache rows in the case there
		are BLOBs in the fields to be fetched. In HANDLER we do
		not cache rows because there the cursor is a scrollable
		cursor. */

		ut_a(prebuilt->n_fetch_cached < MYSQL_FETCH_CACHE_SIZE);

		/* We only convert from InnoDB row format to MySQL row
		format when ICP is disabled. */

		if (!prebuilt->idx_cond) {

			/* We use next_buf to track the allocation of buffers
			where we store and enqueue the buffers for our
			pre-fetch optimisation.
			If next_buf == 0 then we store the converted record
			directly into the MySQL record buffer (buf). If it is
			!= 0 then we allocate a pre-fetch buffer and store the
			converted record there.
			If the conversion fails and the MySQL record buffer
			was not written to then we reset next_buf so that
			we can re-use the MySQL record buffer in the next
			iteration. */

			next_buf = next_buf
				 ? row_sel_fetch_last_buf(prebuilt) : buf;

			if (!row_sel_store_mysql_rec(
				next_buf, prebuilt, result_rec, vrow,
				result_rec != rec,
				result_rec != rec ? clust_index : index,
				offsets, false)) {

				if (next_buf == buf) {
					ut_a(prebuilt->n_fetch_cached == 0);
					next_buf = 0;
				}

				/* Only fresh inserts may contain incomplete
				externally stored columns. Pretend that such
				records do not exist. Such records may only be
				accessed at the READ UNCOMMITTED isolation
				level or when rolling back a recovered
				transaction. Rollback happens at a lower
				level, not here. */
				goto next_rec;
			}

			if (next_buf != buf) {
				row_sel_enqueue_cache_row_for_mysql(
					next_buf, prebuilt);
			}
		} else {
			row_sel_enqueue_cache_row_for_mysql(buf, prebuilt);
		}

		if (prebuilt->n_fetch_cached < MYSQL_FETCH_CACHE_SIZE) {
			goto next_rec;
		}

	} else {
		if (UNIV_UNLIKELY
		    (prebuilt->template_type == ROW_MYSQL_DUMMY_TEMPLATE)) {
			/* CHECK TABLE: fetch the row */

			if (result_rec != rec
			    && !prebuilt->need_to_access_clustered) {
				/* We used 'offsets' for the clust
				rec, recalculate them for 'rec' */
				offsets = rec_get_offsets(rec, index, offsets,
							  ULINT_UNDEFINED,
							  &heap);
				result_rec = rec;
			}

			memcpy(buf + 4, result_rec
			       - rec_offs_extra_size(offsets),
			       rec_offs_size(offsets));
			mach_write_to_4(buf,
					rec_offs_extra_size(offsets) + 4);
		} else if (!prebuilt->idx_cond && !prebuilt->innodb_api) {
			/* The record was not yet converted to MySQL format. */
			if (!row_sel_store_mysql_rec(
				    buf, prebuilt, result_rec, vrow,
				    result_rec != rec,
				    result_rec != rec ? clust_index : index,
				    offsets, false)) {
				/* Only fresh inserts may contain
				incomplete externally stored
				columns. Pretend that such records do
				not exist. Such records may only be
				accessed at the READ UNCOMMITTED
				isolation level or when rolling back a
				recovered transaction. Rollback
				happens at a lower level, not here. */
				goto next_rec;
			}
		}

		if (prebuilt->clust_index_was_generated) {
			row_sel_store_row_id_to_prebuilt(
				prebuilt, result_rec,
				result_rec == rec ? index : clust_index,
				offsets);
		}
	}

	/* From this point on, 'offsets' are invalid. */

	/* We have an optimization to save CPU time: if this is a consistent
	read on a unique condition on the clustered index, then we do not
	store the pcur position, because any fetch next or prev will anyway
	return 'end of file'. Exceptions are locking reads and the MySQL
	HANDLER command where the user can move the cursor with PREV or NEXT
	even after a unique search. */

	err = DB_SUCCESS;

idx_cond_failed:
	if (!unique_search
	    || !dict_index_is_clust(index)
	    || direction != 0
	    || prebuilt->select_lock_type != LOCK_NONE
	    || prebuilt->used_in_HANDLER
	    || prebuilt->innodb_api) {

		/* Inside an update always store the cursor position */

		if (!spatial_search) {
			btr_pcur_store_position(pcur, &mtr);
		}

		if (prebuilt->innodb_api) {
			prebuilt->innodb_api_rec = result_rec;
		}
	}

	goto normal_return;

next_rec:

	if (end_loop >= 99
	    && need_vrow && vrow == NULL && prev_rec != NULL) {

		if (!heap) {
			heap = mem_heap_create(100);
		}

		prev_vrow = NULL;
		row_sel_fill_vrow(prev_rec, index, &prev_vrow, heap);
	} else {
		prev_vrow = vrow;
	}

	end_loop++;

	/* Reset the old and new "did semi-consistent read" flags. */
	if (UNIV_UNLIKELY(prebuilt->row_read_type
			  == ROW_READ_DID_SEMI_CONSISTENT)) {
		prebuilt->row_read_type = ROW_READ_TRY_SEMI_CONSISTENT;
	}
	did_semi_consistent_read = FALSE;
	prebuilt->new_rec_locks = 0;
	vrow = NULL;

	/*-------------------------------------------------------------*/
	/* PHASE 5: Move the cursor to the next index record */

	/* NOTE: For moves_up==FALSE, the mini-transaction will be
	committed and restarted every time when switching b-tree
	pages. For moves_up==TRUE in index condition pushdown, we can
	scan an entire secondary index tree within a single
	mini-transaction. As long as the prebuilt->idx_cond does not
	match, we do not need to consult the clustered index or
	return records to MySQL, and thus we can avoid repositioning
	the cursor. What prevents us from buffer-fixing all leaf pages
	within the mini-transaction is the btr_leaf_page_release()
	call in btr_pcur_move_to_next_page(). Only the leaf page where
	the cursor is positioned will remain buffer-fixed.
	For R-tree spatial search, we also commit the mini-transaction
	each time  */

	if (mtr_has_extra_clust_latch || spatial_search) {
		/* If we have extra cluster latch, we must commit
		mtr if we are moving to the next non-clustered
		index record, because we could break the latching
		order if we would access a different clustered
		index page right away without releasing the previous. */

		/* No need to do store restore for R-tree */
		if (!spatial_search) {
			btr_pcur_store_position(pcur, &mtr);
		}

		mtr_commit(&mtr);
		mtr_has_extra_clust_latch = FALSE;

		mtr_start(&mtr);

		if (!spatial_search
		    && sel_restore_position_for_mysql(&same_user_rec,
						   BTR_SEARCH_LEAF,
						   pcur, moves_up, &mtr)) {
			goto rec_loop;
		}
	}

	if (moves_up) {
		bool	 move;

		if (spatial_search) {
			move = rtr_pcur_move_to_next(
				search_tuple, mode, pcur, 0, &mtr);
		} else {
			move = btr_pcur_move_to_next(pcur, &mtr);
		}

		if (!move) {
not_moved:
			if (!spatial_search) {
				btr_pcur_store_position(pcur, &mtr);
			}

			if (match_mode != 0) {
				err = DB_RECORD_NOT_FOUND;
			} else {
				err = DB_END_OF_INDEX;
			}

			goto normal_return;
		}
	} else {
		if (UNIV_UNLIKELY(!btr_pcur_move_to_prev(pcur, &mtr))) {
			goto not_moved;
		}
	}

	goto rec_loop;

lock_wait_or_error:
	/* Reset the old and new "did semi-consistent read" flags. */
	if (UNIV_UNLIKELY(prebuilt->row_read_type
			  == ROW_READ_DID_SEMI_CONSISTENT)) {
		prebuilt->row_read_type = ROW_READ_TRY_SEMI_CONSISTENT;
	}
	did_semi_consistent_read = FALSE;

	/*-------------------------------------------------------------*/
	if (!dict_index_is_spatial(index)) {
		btr_pcur_store_position(pcur, &mtr);
	}

lock_table_wait:
	mtr_commit(&mtr);
	mtr_has_extra_clust_latch = FALSE;

	trx->error_state = err;

	/* The following is a patch for MySQL */

	if (thr->is_active) {
		que_thr_stop_for_mysql(thr);
	}

	thr->lock_state = QUE_THR_LOCK_ROW;

	if (row_mysql_handle_errors(&err, trx, thr, NULL)) {
		/* It was a lock wait, and it ended */

		thr->lock_state = QUE_THR_LOCK_NOLOCK;
		mtr_start(&mtr);

		/* Table lock waited, go try to obtain table lock
		again */
		if (table_lock_waited) {
			table_lock_waited = FALSE;

			goto wait_table_again;
		}

		if (!dict_index_is_spatial(index)) {
			sel_restore_position_for_mysql(
				&same_user_rec, BTR_SEARCH_LEAF, pcur,
				moves_up, &mtr);
		}

		if ((srv_locks_unsafe_for_binlog
		     || trx->isolation_level <= TRX_ISO_READ_COMMITTED)
		    && !same_user_rec) {

			/* Since we were not able to restore the cursor
			on the same user record, we cannot use
			row_unlock_for_mysql() to unlock any records, and
			we must thus reset the new rec lock info. Since
			in lock0lock.cc we have blocked the inheriting of gap
			X-locks, we actually do not have any new record locks
			set in this case.
			Note that if we were able to restore on the 'same'
			user record, it is still possible that we were actually
			waiting on a delete-marked record, and meanwhile
			it was removed by purge and inserted again by some
			other user. But that is no problem, because in
			rec_loop we will again try to set a lock, and
			new_rec_lock_info in trx will be right at the end. */

			prebuilt->new_rec_locks = 0;
		}

		mode = pcur->search_mode;

		goto rec_loop;
	}

	thr->lock_state = QUE_THR_LOCK_NOLOCK;

	goto func_exit;

normal_return:
	/*-------------------------------------------------------------*/
	que_thr_stop_for_mysql_no_error(thr, trx);

	mtr_commit(&mtr);

	/* Rollback blocking transactions from hit list for high priority
	transaction, if any. We should not be holding latches here as
	we are going to rollback the blocking transactions. */
	if (!trx->hit_list.empty()) {

		ut_ad(trx_is_high_priority(trx));
		trx_kill_blocking(trx);
	}

	DEBUG_SYNC_C("row_search_for_mysql_before_return");

	if (prebuilt->idx_cond != 0) {

		/* When ICP is active we don't write to the MySQL buffer
		directly, only to buffers that are enqueued in the pre-fetch
		queue. We need to dequeue the first buffer and copy the contents
		to the record buffer that was passed in by MySQL. */

		if (prebuilt->n_fetch_cached > 0) {
			row_sel_dequeue_cached_row_for_mysql(buf, prebuilt);
			err = DB_SUCCESS;
		}

	} else if (next_buf != 0) {

		/* We may or may not have enqueued some buffers to the
		pre-fetch queue, but we definitely wrote to the record
		buffer passed to use by MySQL. */

		DEBUG_SYNC_C("row_search_cached_row");
		err = DB_SUCCESS;
	}

#ifdef UNIV_DEBUG
	if (dict_index_is_spatial(index) && err != DB_SUCCESS
	    && err != DB_END_OF_INDEX && err != DB_INTERRUPTED) {
		rtr_node_path_t*	path = pcur->btr_cur.rtr_info->path;

		ut_ad(path->empty());
	}
#endif

func_exit:
	trx->op_info = "";

	if (end_range_cache != NULL) {
		ut_free(end_range_cache);
	}

	if (heap != NULL) {
		mem_heap_free(heap);
	}

	/* Set or reset the "did semi-consistent read" flag on return.
	The flag did_semi_consistent_read is set if and only if
	the record being returned was fetched with a semi-consistent read. */
	ut_ad(prebuilt->row_read_type != ROW_READ_WITH_LOCKS
	      || !did_semi_consistent_read);

	if (prebuilt->row_read_type != ROW_READ_WITH_LOCKS) {
		if (did_semi_consistent_read) {
			prebuilt->row_read_type = ROW_READ_DID_SEMI_CONSISTENT;
		} else {
			prebuilt->row_read_type = ROW_READ_TRY_SEMI_CONSISTENT;
		}
	}

#ifdef UNIV_DEBUG
	{
		btrsea_sync_check	check(trx->has_search_latch);

		ut_ad(!sync_check_iterate(check));
	}
#endif /* UNIV_DEBUG */

	DEBUG_SYNC_C("innodb_row_search_for_mysql_exit");

	DBUG_RETURN(err);
}

对于UPDATE操作执行流程的简单描述如下：
根据WHERE条件扫描一条记录（row_search_for_mysql）
更新当前获取的记录(ha_innobase::update_row)
重新将更新后的记录写入InnoDB存储引擎(row_upd_step)

从上面的代码我们可以看到，对于UPDATE操作更新的记录包含幻读读取到的已提交事务的最新记录。那么接下来看为什么UPDATE之后的SELECT语句对于UPDATE之后的所有语句都可见了？原因是前面的UPDATE语句执行之后，会将当前记录上存储的事务信息更新为当前的事务，而当前事务所做的任何更新，对本事务所有SELECT查询都变的可见，因此最后输出的结果是UPDATE执行后更新的所有记录。

原文地址：http://mysql.taobao.org/monthly/2017/06/07/ 原文是分为两个步骤，版本应该是mysql5.6的，所以针对mysql5.7版本做了合并整理。

MySQL · 源码分析 · InnoDB Repeatable Read隔离级别之大不同

一序

二 源码分析

猜你喜欢

二源码分析