MySQL — InnoDB engine, MySQL architecture, transaction principle, MVCC

InnoDB engine

We briefly introduced the InnoDB storage engine in this article

MySQL - storage engine and index application

The following content is mainly for understanding

1. Logical storage architecture

It is also introduced in the following article 1.3.1.2 InnoDB logical storage structure

MySQL - storage engine and index application

InnnoDB storage structure diagram

image-20230529105408275

  • TableSpace : table space

    Indexes and data are stored in the table space, which is the outermost logical structure .

    Table space (ibd file), a MySQL instance can correspond to multiple table spaces, used to store records, indexes and other data

xxx.ibd : xxx represents the table name. Each table in the innoDB engine will correspond to such a table space file, which stores the table structure (frm-early, sdi-new version), data and indexes of the table.

比如account表,存储引擎使用的是InnoDB,那account就会对应一个磁盘文件account.ibd

image-20230523144946160

​ Contains many segments (Segment) in the table space

  • Segment : segment

The table space is composed of various segments, which are divided into data segments, index segments, rollback segments , etc.

InnoDB is an index-organized table, the data segment is the leaf node of the B+ tree, and the index segment is the non-leaf node of the B+ tree

The management of segments in InnoDB is done by the engine itself, and no human control is required.

Contains many areas (Extent) in the segment

  • Extent : area

The area is the unit structure of the table space, and the size of each area is 1M.

By default, the InnoDB storage engine page size is 16K, that is, there are 64 consecutive pages in an area

  • Page : page

By default, the InnoDB storage engine page size is 16K, that is, there are 64 consecutive pages in an area

What we store line by line

A page is the smallest unit of an area, and a page is also the smallest unit of InnoDB storage engine disk management . The default size of each page is 16KB. In order to ensure the continuity of pages, the InnoDB storage engine applies for 4-5 areas from the disk each time.

  • Row : row

    The data of the InnoDB storage engine is stored by row.

​The specific field values ​​stored in the row

The InnoDB storage engine is row-oriented, that is to say, data is stored by row, and each row contains two hidden fields in addition to the fields specified when defining the table

  • Trx_id : Every time a record is changed, the corresponding transaction id will be assigned to the trx_id hidden column.
  • Roll_pointer : Every time a reference record is modified, the old version will be written into the undo log, and then this hidden column is equivalent to a pointer, through which the information before the modification of the record can be found .

2. Structure

Starting from MySQL5.5, the InnoDB storage engine is used by default, which is good at transaction processing and has crash recovery features .

(Memory structure on the left, disk structure on the right)

A large part of the memory structure on the left is the buffer buffer

The disk structure on the right has TableSpace table space, Doublewrite Buffer double write buffer

image-20230529131453716

2.1 Memory structure

Four areas marked in the memory structure :

Buffer Pool: buffer pool

Change Buffer: Change the buffer

Log Buffer: log buffer area

Adaptive Hash Index: adaptive hash index

image-20230529131738386

2.1.1 Buffer Pool buffer pool

​The buffer pool is an area in the main memory, which can cache the real data that is frequently operated on the disk. When performing addition, deletion, modification and query operations, the data in the buffer pool is first operated (if there is no data in the buffer pool, it is loaded from the disk and cache), and then refresh to disk at a certain frequency, thereby reducing disk IO and speeding up processing .

Assuming there is no buffer pool, each operation needs to be read from the disk, and there will be a lot of disk IO

​ The buffer pool is based on the Page page, and the bottom layer uses the linked list data structure to manage the Page. According to the state, the Page is divided into three types:

  • free page: Free Page, not used
  • clean page: the page is used, the data has not been modified
  • Dirty page: Dirty page, the page is used, the data has been modified, and the data in the page is inconsistent with the data on the disk.

As shown in the figure below, each block is actually a page.

image-20230529132643370

read data

​ The InnoDB engine will first check whether the data page has been loaded in the Buffer Pool, and return directly if it has been loaded, otherwise read the data page from the disk and store it in the Buffer Pool.

​ For data pages with cached update operations, the InnoDB engine reads the update operations from the Change Buffer, and loads the data page into the Buffer Pool before applying the cached update operations to the corresponding data pages.

2.1.2 Change Buffer Change Buffer

Change Buffer : Change the buffer (for non-unique secondary index pages). When executing DML statements, if these data pages are not in the Buffer Pool, the disk will not be directly operated, but the data changes will be stored in the change buffer Change In Buffer, when the future data is read, the data will be merged and restored to the Buffer Pool, and then the merged data will be refreshed to the disk.

​ During adding, deleting, and modifying, if the data Page is not in the Buffer Pool, the disk will not be operated at this time, but this part of the operation will be buffered in the Change Buffer. When reading this part of the data in the future, this part will be The data of the Buffer Pool is merged into the Buffer Pool, and then the data after the Buffer Pool merge is refreshed to the disk

What is the meaning of Change Buffer

Unlike clustered indexes, secondary indexes are usually non-unique and inserted into secondary indexes in a relatively random order. Similarly, deletion and update may affect non-adjacent secondary index pages in the index tree. If the disk is operated every time, it will cause a lot of disk IO. With ChangeBuffer, we can perform merge processing in the buffer pool to reduce disk IO.

When adding, deleting, or modifying, you can first operate the change Buffer, and then synchronize the data in the change Buffer to the buffer Pool at a certain frequency, and then refresh it to the disk.

image-20230529134657562

2.1.3 Log Buffer log buffer area

​ The log buffer is used to save the log data (redo log, undo log) to be written to the disk. The default size is 16MB. The log in the log buffer will be periodically refreshed to the disk.

If transactions that update, insert, or delete many rows are required, increasing the log buffer size can save disk I/O.

parameter

innodb_log_buffer_size : buffer size

​ The following two methods can be

show variables  like 'innodb_log_buffer_size';

show variables  like '%log_buffer_size%';

image-20230529143512083

innodb_flush_log_at_trx_commit : When the log is flushed to disk, the value mainly includes the following three

  • 1: The log is written and flushed to disk at each transaction commit, the default.
  • 0: Write and flush the log to disk once per second.
  • 2: The log is written after each transaction commit and flushed to disk once per second.
show variables  like 'innodb_flush_log_at_trx_commit';

image-20230529143652767

2.1.4 Adaptive Hash Index adaptive hash index

Only the Memory storage engine supports the Hash structure, but the InnoDB engine has an adaptive Hash function. The Hash index is automatically constructed by the storage engine based on B+Tree under specified conditions.

When the hash index performs equivalent matching, the general performance is higher than that of the B+ tree, because the hash index generally only needs one IO, and the B+ tree may need several matches, so the efficiency of the hash index is higher, but the hash Indexes are not suitable for range queries, fuzzy matching, etc.

Adaptive hash index for optimizing queries on Buffer Pool data .

The InnoDB storage engine will monitor the query of the index page on the table. If it is observed that the hash index can improve the speed, then the hash index will be established. This is called an adaptive hash index .

Adaptive hash index, without manual intervention, is automatically completed by the system according to the situation.

Parameters: adaptive_hash_index , switch for adaptive hash index

Use fuzzy matching to see if it is turned on

show variables  like '%hash_index%';

image-20230529142147111

2.2 Disk structure

image-20230529143857569

2.2.1 System Tablespace system tablespace

The system tablespace is the storage area for the Change Buffer, and it may also contain table and index data if the table is created in the system tablespace rather than in a per-file or general tablespace.

(InnoDB data dictionary, undolog, etc. are also included in MySQL5.x version)

image-20230529144743336

parameter

innodb_data_file_path

show variables  like 'innodb_data_file_path';

System table space, the default file name is ibdata1.

image-20230529144653523

2.2.2 File-Per-Table Tablespaces

​The independent table space for each will not be stored in the system tablespace system table space.

​ If the innodb_file_per_table switch is turned on, the file tablespace per table contains the data and indexes of a single InnoDB table and is stored in a single data file on the file system

image-20230529144835794

Switch parameter, enabled by default

innodb_file_per_table

show variables  like 'innodb_file_per_table';

image-20230529145303043

We have also seen table space files (ending in .ibd) before, each of the following files is a table space file, and the structure of the table stored in it, as well as the data and indexes in the table

image-20230523144946160

2.2.3 General Tablespaces general tablespace

General table space, you need to create a general table space through the CREATE TABLESPACE syntax, you can specify the table space when creating a table.

image-20230529145711778

  • create tablespace

ADD DATAFILE specifies the tablespace file associated with our tablespace

ENGINE Specifies the storage engine

CREATE TABLESPACE 表空间的名字 ADD DATAFILE 'file_name' ENGINE = engine_name;

Create the tablespace as shown below

create tablespace ts_itcast add datafile 'myitcast.ibd' engine = innodb;
  • Specify the tablespace when creating the table
CREATE TABLE xxx ... TABLESPACE ts_name;

Create table and specify tablespace

create table a(
     id int primary key auto_increment,
	 name varchar(10) 
)engine INNODB tablespace ts_itcast;

The corresponding file can be found, and there is our table a in the general table space below

image-20230529155605307

2.2.4 Undo Tablespaces undo tablespace

To undo tablespaces , the MySQL instance will automatically create two default undo tablespaces (initial size 16M) during initialization to store undo log logs.

​ The two files are respectively undo001 and undo002 (the default is these two names, in the data directory)

image-20230529160618237

2.2.5 Temporary Tablespaces

InnoDB uses session temporary tablespaces and global temporary tablespaces. Store data such as temporary tables created by users

2.2.6 Doublewrite Buffer Files double write buffer

​ Double-write buffer, before the innoDB engine flushes the data pages from the Buffer Pool to the disk, it first writes the data pages into the double-write buffer file , which is convenient for recovering data when the system is abnormal.

​ The following is the double write buffer file

image-20230529160909017

2.2.7 Redo Log redo log

Redo logs are used to achieve transaction persistence

​ After the transaction is committed, all modification information will be stored in the log, which is used for data recovery when an error occurs when flushing dirty pages to disk.

This log file consists of two parts

  • Redo log buffer (redo logbuffer)

    in memory

  • Redo log files (redo log)

    in disk

The log will not be saved permanently, and the previously useless redo log will be cleaned up every once in a while.

After the transaction is committed, there is no need for the redo log to exist, because it is to ensure data recovery when an exception occurs

image-20230529161608668

2.3 Background threads

How is the data in the memory refreshed to the disk space?

​ involves a set of background threads

Function : refresh the data in the buffer pool of the InnoDB storage engine to the disk file at the right time

image-20230529162757437

  • Master Thread

​ The core background thread is responsible for scheduling other threads, and is also responsible for asynchronously refreshing the data in the buffer pool to the disk to maintain data consistency, including refreshing dirty pages, merging and inserting caches, and recycling undo pages.

​ In the InnoDB storage engine, AIO is widely used to process IO requests, which can greatly improve the performance of the database, and IO Thread is mainly responsible for the callback of these IO requests.

  • IO Thread

​ AIO (asynchronous non-blocking IO) is widely used in the InnoDB storage engine to handle IO requests, which can greatly improve the performance of the database

​ IOThread is mainly responsible for the callback of these IO requests .

thread type default number responsibility
Read thread 4 responsible for read operations
Write thread 4 responsible for write operations
Log thread 1 Responsible for flushing the log buffer to disk
Insert buffer thread 1 Responsible for flushing the contents of the write buffer to disk

View the status information of InnoDB, which contains IO information

show engine innodb status;

These threads all use AIO, asynchronous threads

image-20230529171550004

At present, the read thread and the write thread are waiting to receive the request

image-20230529171640843

  • Purge Thread

It is mainly used to recycle the undo log that has been committed by the transaction. After the transaction is committed, the undo log may not be used, so it is used to recycle.

  • Page Cleaner Thread

A thread that assists the Master Thread to flush dirty pages to disk, which can reduce the work pressure of the Master Thread and reduce blocking

Three, the principle of things

Basic knowledge of transactions: MySQL basics - multi-table query and transaction management

​ Atomicity, consistency, and durability are guaranteed by the two logs at the bottom of the InnoDB storage engine

​ Isolation is achieved by the underlying lock mechanism of the InnoDB storage engine and MVCC multi-version concurrency control

image-20230529185536522

The refo log and undo log are mentioned in the previous 2.1.3 (log buffer area)

3.1 redo log redo log

Persistence is guaranteed by redo log

​redo log The redo log records the physical modification of the data page when the transaction is committed, and is used to achieve the persistence of the transaction.

​ The log file consists of two parts: redo log buffer (redo log buffer) and redo log file (redo log file), the former is in memory and the latter is in disk.

​After the transaction is committed, all modification information will be stored in the log file, which is used for data recovery when dirty pages are flushed to disk and errors occur .


​ In the InnoDB engine memory structure, the main storage area is the buffer pool, in which many data pages are cached.

​ When we perform multiple additions, deletions, and modifications in a transaction, the InnoDB engine will first operate the data in the buffer pool. If there is no corresponding data in the buffer pool, it will load the data from the disk through the background thread and store it in buffer, and then modify the data in the buffer pool, and the modified data pages are called dirty pages .

​ Dirty page: A state where the data cached in memory is inconsistent with the data on disk.

​ And the dirty pages will be flushed to the disk through the background thread at a certain time, so as to ensure that the data in the buffer and the disk are consistent .

​ The dirty page data in the buffer is not refreshed in real time, but the data in the buffer is flushed to the disk after a period of time. If an error occurs during flushing to the disk , the user will be prompted that the transaction is submitted successfully, but the data is not. After persistence, there is a problem, there is no guarantee of transaction durability (once a transaction is committed or rolled back, its changes to the data in the database are permanent).

image-20230529190948018

How to solve the above problems?

InnoDB provides a log redo log redo log , when adding, deleting or modifying the data in the buffer, it will first record the changes of the operated data page in the redolog buffer.

When the transaction is committed, the data in the redo log buffer will be flushed to the redo log disk file.

After a period of time, if an error occurs when flushing the dirty pages in the buffer to the disk, you can use the redo log for data recovery at this time, thus ensuring the durability of the transaction.

And if the dirty pages are successfully flushed to the disk or the data involved has been placed on the disk, the redolog is useless at this time and can be deleted, so the two existing redolog files are written cyclically.

image-20230529191607407

​Why is it necessary to flush the redo log to the disk every time a transaction is committed, instead of directly flushing the dirty pages in the buffer pool to the disk?

Doing so has serious performance issues.

Because in business operations, we generally read and write disks randomly instead of sequentially. The redo log writes data to the disk file. Since it is a log file, it is written sequentially. Sequential writes are much more efficient than random writes. This way of writing logs first is called WAL (Write-Ahead Logging).

3.2 undo log rollback log

​Solve the atomicity of transactions (transactions are the smallest indivisible unit of operation, either all succeed or all fail)

​The rollback log is used to record the information before the data is modified. It has two functions : providing rollback (guaranteeing the atomicity of the transaction), **MVCC (**multi-version concurrency control)

For example, when we execute an update statement, what the statement looks like before the update will be recorded in the undolog

​Undo log is different from redo log . Undo log records logical logs, while redo log records physical logs.

​ Physical log: mainly records what the contents of the data look like

​ Logical log: what kind of operation is performed in each step

**对逻辑日志的理解**:

​ It can be considered that when a record is deleted, a corresponding insert record will be recorded in the undo log, and vice versa.

​ When updating a record, it records a corresponding update record.

​ When rollback is executed, the corresponding content can be read from the logical records in the undo log and rolled back.

Undo log destruction : The undo log is generated when the transaction is executed, and the undo log will not be deleted immediately when the transaction is committed, because these logs may also be used for MVCC.

Undo log storage : The undo log is managed and recorded in segments, and stored in the rollback segment introduced earlier, which contains 1024 undo log segments.

4. MVCC

The specific implementation of MVCC also needs to rely on three implicit fields in the database record, undo log log, and readView.

4.1 Basic concepts

4.1.1 Current read

What is read is the latest version of the record . When reading, it must be ensured that other concurrent transactions cannot modify the current record, and the read record will be locked.

Such as: select ... lock in share mode (shared lock), select ... for update, update, insert, delete (exclusive lock) are all current reads.

Both clients start transactions. Client B modifies the data whose id is 1. Under normal circumstances, client A cannot read the modified data of client B, because the transactions are isolated from each other.

If client B's transaction is committed at this time, client A still cannot find it, because the current isolation level is Repeatable Read (default) repeatable read

Transaction isolation level: MySQL basics - multi-table query and transaction management

image-20230529201443829

The Select statement in our figure above is not the current read. If we want the above Select statement to become the current read, we only need to change it to select ... lock in share mode (shared lock) or select ... for update

As shown below: Our client A did not commit the transaction, client B submitted the transaction, but at this time client A can read the transaction submitted by client B.

​That is to say, what is currently read is the latest data record

image-20230529201954561

4.1.2 Snapshot read

A simple select (without locking) is a snapshot read, a snapshot read, which reads the visible version of the recorded data, which may be historical data, without locking, and is a non-blocking read.

For example, in the figure below, client B submitted the transaction, but client A still cannot read it, because the Select in the figure below is a snapshot read, and the read data is also historical data

image-20230529201443829

Read Committed : Every time Select, a snapshot read is generated

Repeatable Read (default) Repeatable read: the first Select statement after opening the transaction is the place where the snapshot is read

​ For example, select * from stu is the first place to execute the select statement. It is a snapshot read, and a snapshot will be generated. When we use select * from stu to query data later, we will actually directly check the snapshot data generated earlier ( historical data), which ensures repeatable reading.

Serializable serialization: the snapshot read will degenerate into the current read, and each read data will be locked

4.1.3 MVCC multi-version concurrency control

​Refers to maintaining multiple versions of a data, so that there is no conflict in read and write operations. Snapshot read provides a non-blocking read function for MySQL to implement MVCC .

The specific implementation of MVCC also needs to rely on three implicit fields in the database record, undo log log, and readView.

4.2 Hidden fields

​ When we create the following table, in addition to the following three displayed fields, the InnoDB engine will automatically add three hidden fields to us.

image-20230529204139773

hidden field meaning
DB_TRX_ID Last modification transaction ID, the transaction ID of the record inserted or last modified.
DB_ROLL_PTR Rollback pointer, pointing to the previous version of this record, used to cooperate with undo log, pointing to the previous version.
DB_ROW_ID Hidden primary key, if the table structure does not specify a primary key, this hidden field will be generated.

4.3 undo log rollback log

​Rollback log, which is generated during insert, update, and delete to facilitate data rollback .

​ When inserting, the generated undo log is only needed for rollback, and can be deleted immediately after the transaction is committed.

​ During update and delete, the generated undo log is not only needed for rollback, but also for snapshot read, and will not be deleted immediately.

4.3.1 undo log version chain

Insert a new data. This record is newly inserted, so there is no rollback pointer

image-20230529210027290

Then, there are four concurrent transactions accessing this table at the same time.

image-20230529210148953

  • First transaction 2 executes

As shown in the figure below, first transaction 2 modifies the record with id 30 to age 3

image-20230529210148953

​ Before modifying the record, InnoDB records the undo log log, and records the appearance before the data change; then updates the record, and records the transaction ID of this operation, the rollback pointer, and the rollback pointer is used to specify that if a rollback occurs, the rollback Which version to roll to. As follows

image-20230529210655565

Commit the transaction after performing the above operations

  • After that, transaction 3 is executed again

Change the record with id to 30, and change the name to A3

Similarly, the original data needs to be recorded in the undo log before updating, and then the data is updated

After the execution is complete, it will look like the following

image-20230529211330069

The undo log is not deleted after we execute it, because there are other transactions using this uodo log

commit transaction

  • Executive affairs four

    same process

image-20230529211607109

Finally , we found that if different transactions or the same transaction modify the same record, the undolog of the record will generate a record version linked list. The head of the linked list is the latest old record, and the tail of the linked list is the earliest old record.

When we query, which version will we eventually return to?

This is not controlled by the version chain. Which version to go back to specifically involves the third component in the MVCC implementation principle: readView

4.4 readView

​ ReadView (read view) is the basis for MVCC to extract data when snapshot read SQL is executed, and records and maintains the currently active transaction (uncommitted) id of the system .

​ Snapshot reading is not necessarily the latest record, it is likely to be a historical record. The data we just brought in the undo log are all historical records

​Which historical record do we read when we read the snapshot?

​ It is determined by readview, because readView records and maintains the system's current active transaction (uncommitted) id

ReadView contains four core fields : the basis for MVCC to extract transactions when reading snapshots depends on the following four core fields

field meaning
m_ids Set of currently active transaction IDs
min_trx_id Minimum active transaction ID
max_trx_id Pre-allocated transaction ID, the current maximum transaction ID+1 (because the transaction ID is self-incrementing)
creator_trx_id The transaction ID of the creator of the ReadView

Access rules for version chain data

trx_id represents the transaction ID corresponding to the current undolog version chain.

image-20230529214634264

Different isolation levels have different timings for generating ReadView :

  • READ COMMITTED: Generate ReadView every time a snapshot read is executed in a transaction
  • REPEATABLE READ: A ReadView is generated only when the snapshot read is executed for the first time in a transaction, and the ReadView is subsequently reused.

4.5 Analysis of MVCC principle

The realization principle of MVCC is realized through the hidden fields of the InnoDB table (as long as the transaction id and the rollback pointer are relied on), the UndoLog version chain , and ReadView .

The MVCC + lock realizes the isolation of transactions .

The consistency is guaranteed by redolog and undolog .

image-20230530135025147

Understand the access rules of the 4.4 readView version chain data

Under the RC isolation level, ReadView is generated every time a snapshot read is executed in a transaction

Under the RR isolation level, a ReadView is only generated when the snapshot read is executed for the first time in a transaction, and the ReadView is subsequently reused

4.5.1 Principle of RC isolation level extraction

Under the RC isolation level, ReadView is generated every time a snapshot read is executed in a transaction

We can analyze the ReadView generated under the RC isolation level just now in transaction 5

image-20230530123836380

first query

The record m_ids:{3,4,5} with id 30, because transaction 2 has been submitted at this line.

The minimum active transaction id, min_trx_id is 3

Pre-allocated transaction id, that is, max_trx_id is 6 (, the current maximum transaction ID+1 )

Creator transaction id, creator_trx_id is 5

second query

The record m_ids:{4,5} with id 30, because transactions 2 and 3 have been committed at this line.

The minimum active transaction id, min_trx_id is 4

Pre-allocated transaction id, that is, max_trx_id is 6 (, the current maximum transaction ID+1 )

Creator transaction id, creator_trx_id is 5

image-20230530123708881

Which version did transaction 5 first choose to read?

Take db_trx_id to the table on the right for comparison,

When trx_id = 4, none of the four inequalities are satisfied

When trx_id = 3, none of the four inequalities are satisfied

When trx_id = 2, the second equation is satisfied, so this record can be accessed

image-20230530124736059

The result of returning the snapshot read is the following piece of data, and this piece of data is officially submitted by Transaction 2

image-20230530124946132

Which version of transaction 5's second select read?

same process as above

image-20230530125149680

The final access is that transaction 3 submitted

image-20230530125223239

4.5.2 Principle of RR isolation level extraction

Under the RR isolation level, a ReadView is only generated when the snapshot read is executed for the first time in a transaction, and the ReadView is subsequently reused

When we execute the first select statement, a snapshot will be generated to read ReadView

recorded

The record m_ids:{3,4,5} with id 30, because transaction 2 has been submitted at this line.

The minimum active transaction id, min_trx_id is 3

Pre-allocated transaction id, that is, max_trx_id is 6 (, the current maximum transaction ID+1 )

Creator transaction id, creator_trx_id is 5

If we execute the second Select statement again, no more readView will be created, and the first Select statement will be reused

image-20230530134456570

Guess you like

Origin blog.csdn.net/weixin_51351637/article/details/130963982