[MySQL] Detailed explanation of InnoDB storage engine

InnoDB engine is the default storage engine after MySQL version 5.5

logical storage structure

The first is the tablespace Tablespace (ibd file): one mysql power can correspond to multiple tablespaces, used to store and record, index and other data

These storage records, indexes and other data are stored in segments
. Segments are divided into

  1. Data segment (Leaf node segment)
  2. Index segment (Non-leaf node segment)
  3. Rollback segment

InnoDB is an index-organized table. The data segment is the leaf node of the B+ tree, and the index segment is the non-leaf node of the B+ tree.

Segments are used to manage multiple extents (Extent)

Area: The unit structure of the table space. Each area is 1MB. By default, the InnoDB storage engine page size is 16KB, that is, there are 64 pages in one area.

Page: It is the smallest unit of disk management of the InnoDB storage engine. The default size of each page is 16KB. In order to ensure the continuity of the page, the InnoDB storage engine applies for 4-5 areas from the disk each time.

Rows are stored in pages , and InnoDB storage engine data is stored by rows.

Such as the hidden fields in the table:
Trx_id: Every time a record is changed, the corresponding id will be assigned to the trx_id hidden column
Roll_pointer: Every time a certain imprint path is changed, the old version will be written Enter the undo log, and then this hidden column is equivalent to a pointer, which can be used with the Forbidden City to find the information before the record is modified.

Architecture

InnoDB engine is the default storage engine after MySQL version 5.5. It is good at transaction processing and has crash recovery features. It is widely used in daily life.

First introduce the memory structure of InnoDB

InnoDB memory structure

Buffer Pool:

The buffer pool is an area of ​​main memory, which can cache the real areas that are frequently operated on the disk. When performing a CRUD operation, the data in the cache pool is first operated (if not, it is loaded from the disk and cached), and then the data is loaded and cached from the disk in a certain manner. Frequently flush to disk, thereby reducing disk IO and speeding up processing

The reason for this is that if it is a disk operation every time, there will be a lot of disk IO, and it will be random IO, which consumes very much performance.

The buffer pool is based on Page pages. The bottom layer uses a linked list data structure to manage Pages. Pages are divided into three types according to their status.

  1. free page: free page, not used
  2. clean page: has been used, but the data has not been modified
  3. Dirty page: Dirty page, the page has been used, the data has been modified, and it is inconsistent with the data on the disk.

Change Buffer:

Change the buffer (for non-unique L2 cache pages). The 5.x version is called Insert Buffer, which only appeared in 8.0.

When the DML statement is executed again, if these data pages are not in the Buffer Pool, the disk will not be directly operated, but the data changes will be stored in the change buffer Change Buffer. When the data is read in the future, the data will be merged and restored to the buffer. pool, and then flush the merged data to disk.

The significance of Change Buffer :
Unlike clustered indexes, secondary indexes are usually non-unique, and secondary indexes are inserted in a relatively random order. Similarly, deletions and updates may affect secondary index pages that are not adjacent in the reference. If each Operating the disk all at once will cause a lot of disk IO. With ChangeBuffer, we merge it in the buffer pool to reduce IO.

Adaptive Hash Index:

Adaptive Hash index is used to optimize queries for Buffer Pool data. The InnoDB storage engine will monitor the insertion of each index on the table. If the hash index can be changed to improve the speed, a hash index will be established, which is called an adaptive hash index.

Adaptive hash index does not require manual intervention and is automatically completed by the system according to the situation.

Log Buffer

The log buffer is used to save the log data (redo log, undo log) to be written to the disk. The default size is 16MB. The log buffer will be refreshed to the disk regularly. If it needs to be updated, inserted or deleted For transactions with many rows, increasing the size of the log buffer can save disk IO;
you can configure the buffer size and the timing of flushing the log to disk.

disk structure

InnoDB disk structure

System Tablespace:

The system tablespace is the storage area for the change buffer. If it is created in the system tablespace rather than per table file or common table file, it may also contain table and index data (in version 5.X it also contains InnoDB data dictionary, undolog, etc.)

File-Per-Table Tablespaces:

The file-per-table tablespace contains the data and indexes for a single InnoDB table, stored in a single data file on the file system

General Tablespace:

The general table space is created manually instead of the system's own. It needs to be created through the CREATE TABLESPACE syntax. When creating, you can specify the table space, such as
:

# 创建一个通用表空间ts_xx, 这个表空间对应的磁盘文件xxx.ibd
create tablespace ts_xx add datafile 'xxx.ibd' engine = innodb;

# 创建一个表xxx,他的表空间是上面创建的ts_xx
create table xxx(id long primary key auto_increment , name varchar(20)) engine = innodb tablespace ts_xx;

Undo Tablespaces

Undo table space. MySQL instance will automatically create two default undo table spaces (initial size 16M) during initialization to store undo logs.

Temporary Tablespaces

InnoDB uses painting temporary table space and global temporary table space to store data such as temporary tables created by users.

Doublewrite Buffer Files:

Double-write buffer. Before the innoDB engine refreshes the data page from the Buffer Pool to the disk, it first writes the data page into the double-shoe buffer file to facilitate data recovery when the system is abnormal.

Redo Log:

Redo log is used to achieve transaction durability. The log file consists of two parts: redo log buffer and redo log file. The former is in memory and the latter is on disk. middle. After the transaction is committed, all modified information will be stored in the log, which is used to recover data after an error occurs when flushing dirty pages to disk. Transaction durability depends on this log
.

After introducing the memory and disk structures respectively, we need to connect the two structures, and the background thread is the tool to connect them.

background thread

The function of the background thread is to refresh the data in the InnoDB storage engine buffer pool to the disk file at the appropriate time.

There are four types of background threads

1. Master Thread:

The core background thread is responsible for scheduling other threads, and is also responsible for asynchronously refreshing the data in the buffer pool to the disk to maintain data consistency, including refreshing dirty pages, merging and inserting cache, and recycling undo pages.

2. IO Thread:

AIO (asynchronous non-blocking) is used extensively in the InnoDB storage engine to handle IO requests, which can greatly improve the performance of the database, and IO Thread is mainly responsible for the callback of these IO requests.

IO Thread includes four more types

  1. Read Thread: Responsible for read operations, default 4
  2. Write Thread: Responsible for write operations, the default page is 4
  3. Log Thread: Responsible for flushing the log buffer to disk, default 1
  4. Insert Buffer Thread: Responsible for flushing the buffer contents to disk, default 1

Here's how to check the status

show engine innodb status;

Then go directly to FILE I/O and you can see the IO thread.

3. Purge Thread

It is mainly used to recycle the undo log that has been submitted by the transaction. After the fifteenth submission, the undo log may not be used, so use it to recycle it.

4. Page Cleaner Thread

Assist the Master Thread in refreshing dirty pages to disk, which can reduce the work pressure of MT and reduce blocking.

How InnoDB transactions work

A very important feature of the InnoDB engine is that it supports transactions

Transaction : It is a set of operations. It is an indivisible unit of work. A transaction will submit or revoke operation requests to the system as a whole, and these operations will either succeed or fail at the same time.

Four characteristics of transactions: ACID

Atomicity: The smallest indivisible unit of a transaction, which succeeds or fails at the same time

Consistency: When a transaction is completed, all data must be in a consistent state

Isolation: The isolation mechanism provided by the database system ensures that transactions run in an independent environment that is not affected by external concurrent operations.

Durability: Once a transaction is committed or rolled back, its changes to the data in the database are permanent

Isolation also involves isolation levels: read uncommitted, read committed, repeatable read, serialization, default repeatable read

principles of affairs

For atomicity, consistency, and durability, InnoDB is guaranteed by redo log and undo log.
Isolation is guaranteed by lock mechanism and MVCC (Multiple Version Concurrency Control).

redo log ensures durability

redo log: The physical modification of the data page when the redo log transaction is committed. The log file used to achieve transaction durability
consists of two parts: the redo log buffer (redo log buffer), and the redo log file (redo log), the former is in memory, and the latter is in disk.
After the transaction is committed, all modification information will be stored in the log file, which is used to refresh dirty pages to disk. When an error occurs, it can be used for data recovery.

Process
When a transaction request is sent to the database, it will first check whether there is corresponding data in the memory buffer pool. If not, it will read it from the disk and load it into the cache. When the page data in the buffer pool is changed, it
Insert image description here
will Dirty pages will be formed and will be changed the next time the disk is read or written.
Dirty page data write back
However, if an error occurs in the process of refreshing the data of the dirty pages to the disk, the memory data is not refreshed at this time, but the transaction has been submitted. At this time - Since the disk data is different from the transaction data, durability is not guaranteed.

In order to ensure transaction consistency, redo log appears. Different from the direct submission operation just now, it will first give the dirty pages to the Redolog Buffer in the memory. After the client transaction is submitted, the RedologBuffer will submit the data to the disk
Record data page changes in Redolog Buffer
. Page changes, persisted in disk files

Flush directly to disk log

After this, if the dirty page fails to be flushed to the disk file, it can be restored through redo log.

The process of directly refreshing ibd and direct redolog recording when a transaction is committed seems to be the same, but in fact there is a big difference:
in the transaction, most of our operations are random operations on each data page, which involves a lot of random disk IO , very performance consuming, and logs are recorded sequentially, that is, sequential disk IO. There is a big performance difference between the two. This
mechanism is called WAL (Write-Ahead Logging)

Because dirty pages will be written normally sooner or later, there will be a lot of useless logs in the log. At this time, it needs to be cleaned regularly, and the two log files will be recorded in a loop.

Undo Log solves transaction atomicity

undo log: rollback log, used to record information before the data is modified. It has two functions: providing rollback and MVCC

Undolog is different from redolog in recording physical logs. It records logical logs :
for example, if I write a delete to delete the data with id 1, then undolog records the opposite, inserting a piece of data with id 1 and the content of the data. It is the deleted content.
For example, if I write an update and change the name of the person with ID 2 from Li Si to Zhang San, then the undolog records the update - changing the name of the person with ID 2 to Li Si.

When rollback is executed, the corresponding undolog is directly called, thereby realizing the rollback of the content.

Undo log destruction: undo log is generated when a transaction is executed. When the transaction is committed, the undo log will not be deleted immediately, because these logs may also be used in MVCC. Undo log storage: undo log is managed
and recorded in segments, and is stored in The rollback segment rollback segment contains 1024 undo log segments;

MVCC

Current read : What is read is the latest version of the record. When reading, it must be ensured that other concurrent transactions cannot modify the current record, and the read record will be locked.
For example: select lock in share mode, insert, delete

Snapshot read : What is read is the visible version of the recorded data, which may be historical data. It is not locked and is a non-blocking read. Read Committed:
A snapshot is generated for each select.
Repeatable Read: The first select statement after starting the transaction is The place where the snapshot is read
is Serializable: the snapshot read will degenerate into the current read

MVCC:
Full Multi-Version Concurrency Control, multi-version concurrency control, refers to maintaining multiple versions of a data. There is no conflict in read and write operations. Snapshots provide a non-blocking read function for MySQL to implement MVCC.

The specific implementation of MVCC also needs to rely on three implicit fields recorded in the database, undolog and readView.

MVCC implementation principle

Hidden fields in the record :
DB_TRX_ID: The last modified ID, record the transaction ID of inserting this record or the last modification of this record
DB_ROLL_PTR: Rollback pointer, pointing to the previous version of this record, used to cooperate with undo log, pointing to Next version
DB_ROLL_ID: Hidden primary key. If the table structure does not specify a primary key, this hidden field will be generated.

How to look at the table structure:

	ibd2sdi xx.ibd;

Of course, you can also use navicat to view it directly.

undo log version chain :

If different or the same transaction modifies the same piece of data, the undolog of the record will generate a record version linked list. The head of the linked list is the latest old record, and the tail is the oldest old record.

The specific version to be returned requires a readview to determine
readview :
the read view is the basis for MVCC to extract data when the snapshot read sql is executed. It records and maintains the system's currently active transactions (that is, uncommitted transactions) ids.

readview has four core fields:
m_ids: the currently active transaction ID set
min_trx_id: the minimum active transaction ID
max_trx_id: the pre-allocated transaction ID, which is the current maximum transaction ID + 1
creator_trx_id: the transaction ID of the ReadView creator

Version chain data access rules:
We let the current transaction ID be trx_id

  1. trx_id == creator_trx_id ? You can access this version -> If established, it means the data was changed by this transaction
  2. trx_id < min_trx_id ? This version can be accessed -> If it is established, it means that the data has been submitted
  3. trx_id > max_trx_id ? This version cannot be accessed -> If it is established, it means that the transaction was started after the readview was generated.
  4. min_trx_id <= trx_id < max_trx_id? If trx_id is not in m_ids, you can access this version -> If it is true, it means the data has been submitted

Different isolation levels, the actual process of generating readview is different

Read committed: a readview is generated every time a snapshot read is performed in a transaction

Repeatable read: a readview is generated only when a snapshot read is performed for the first time in a transaction, and the readview is reused subsequently.

The four core fields generated by readview are brought into the above formula to determine whether it can be accessed, and then the undolog version chain is searched from top to bottom. Which version matches the version that can be accessed?

This is MVCC, which determines which version to use when reading a snapshot.
MVCC adds locks to ensure data isolation .

The consistency is guaranteed by redo log and undo log.

Guess you like

Origin blog.csdn.net/qq_67548292/article/details/132202646