MySQL's InnoDB storage engine (1)

Table of contents

A logical storage structure

1.1 Table space

1.2 paragraph

1.3 Zone

1.4 pages

1.5 lines

Two architecture

2.1 Memory structure

2.1.1 Buffer Pool

2.1.2 Change Buffer

2.1.3 Adaptive Hash Index

2.1.4 Log Buffer

2.2 Disk architecture

2.2.1 System Tablespace

2.2.2 File-Per-Table Tablespace

2.2.3 General Tablespace

2.2.4 Undo Tablespaces

2.2.5 Temporary Tablespaces

2.2.6 Doublewrite Buffer Files

2.2.7 Redo Log

2.3 Background threads

2.3.1 Master Thread

2.3.2 IO Thread

2.3.3 Purge Thread

2.3.4 Page Cleaner Thread


A logical storage structure

The logical storage structure of InnoDB is shown in the following figure:

Write picture description here

1.1 Table space

The table space is the highest level of the logical structure of the InnoDB storage engine. If the user enables the parameter `innodb_file_per_table` (enabled by default in 8.0), each table will have a table space (xxx.ibd), and one MySQL instance can correspond to multiple Table space is used to store data such as records and indexes.

1.2 paragraph

Segment is divided into data segment (Leaf node segment), index segment (Non-leaf node segment), rollback segment (Rollback segment), InnoDB is an index organization table, data segment is the leaf node of B+ tree, and index segment is B+ The non-leaf nodes of the tree. Segments are used to manage multiple zones.

Under normal circumstances, our retrieval is carried out in the doubly linked list of leaf nodes, that is to say, we will distinguish the areas. If we do not distinguish between leaf nodes and non-leaf nodes, the effect will be greatly reduced, so For an index B+ tree, we treat leaf node areas and non-leaf node areas differently, and call the set of areas storing leaf nodes a segment and the set storing non-leaf node areas also a segment, so An index generates two segments: a leaf node segment and a non-leaf node segment.

1.3 Zone

Area, the unit structure of the table space, the size of each area is 1M. By default, the InnoDB storage engine page size is 16KB, that is, there are 64 consecutive pages in an area.

In fact, a complete function has been formed through the page. When we query the data, we can find the data along the doubly linked list, but the physical location between the pages may not be continuous. If the distance is too far, then we start from a page. When moving to another page, the disk will redefine the position of the head, generating random IO and affecting performance, so we need to introduce the concept of a zone. A zone is 64 pages in continuous physical locations. Zones belong to a segment (or mix).

1.4 pages

A page is the smallest unit of InnoDB storage engine disk management, and the default size of each page is 16KB, in order to ensure the continuity of pages. The InnoDB storage engine applies for 4 to 5 areas from the disk each time.

1.5 lines

row, the data of InnoDB storage engine is stored according to the row, in the row, there are two hidden fields by default:

  •  Trx_id: Every time a record is modified, the corresponding transaction ID will be assigned to the Trx_id hidden column.

  • Roll_pointer: Every time a record is changed, the old version will be recorded in undo.log, and this column is equivalent to a pointer. It can be used to find the information of the record before modification.

Two architecture

MySQL uses the InnoDB storage engine by default since version 5.5, which is good at transaction processing. With crash recovery features, it is widely used in daily development. The following is the architecture diagram of InnoDB. On the left is the memory structure, on the right is the disk structure

2.1 Memory structure

The memory structure of MySQL is mainly divided into four blocks: Buffer Pool, Change Buffer, Adaptive Hash Index, Log Buffer

2.1.1 Buffer Pool

The layered architecture of the application system , in order to speed up data access, puts the most frequently accessed data in the cache to avoid accessing the database every time. 

The operating system will have a buffer pool ( buffer  pool) mechanism to avoid accessing the disk every time to speed up data access. 

As a storage system, MySQL also has a buffer pool mechanism to avoid disk IO every time data is queried.

The InnoDB storage engine is based on disk file storage. The gap between CPU speed and disk speed is very large. In order to make up for the gap as much as possible, the concept of cache pool is proposed. , so the cache pool is simply a "memory area". The speed of the memory is used to make up for the slow speed of the disk, resulting in an impact on the performance of the database. , to avoid accessing the disk every time. In the buffer pool of InnoDB, not only index pages and data pages are buffered, but also undo pages, insert caches, adaptive hash indexes, and lock information of InnoDB are also included.

The basic principle of buffer pool

"Read operation":

To read a page in the database, first store the page read from the disk in the cache pool, and when reading the same page next time, first determine whether the page is in the cache pool.

If it is, it is said that the page is hit in the cache pool, then read the page directly, otherwise, read the page on the disk.

"Write operation":

For the page modification operation in the database, the pages in the cache pool are first modified, and then refreshed to the disk at a certain frequency. It is not refreshed back to the disk every time a page changes, but the page is refreshed back to the disk through the checkpoint mechanism .

It can be seen that whether it is a read operation or a write operation, it operates on the buffer pool instead of directly operating on the disk.

Buffer pool structure

The buffer pool is a continuous memory space, and the InnoDB storage engine manages this memory through pages.

The structure of the buffer pool is as follows:

The speed is fast, so why not put all the data in the buffer pool?

Everything has two sides. Regardless of data volatility, the opposite of fast access is small storage capacity:

(1) The cache access is fast, but the capacity is small. The database stores 200G data, and the cache capacity may only be 64G;

(2) The memory access is fast, but the capacity is small. If you buy a notebook disk with 2T, the memory may only be 16G;

Therefore, only the "hottest" data can be placed in the "nearest" place to "maximum" reduce disk access.

How to manage and eliminate the buffer pool to maximize performance?

Pre-reading: disk read and write, not read on demand, but read by page, read at least one page of data (usually 16K) at a time, if the data to be read in the future is in the page, you can save the follow-up disk IO, improving efficiency. Data access usually follows the principle of "concentrated reading and writing". When using some data, there is a high probability that nearby data will be used. This is the so-called "locality principle", which shows that loading in advance is effective and can indeed reduce disk IO.

The unit of the buffer pool is Page, and the bottom layer uses a linked list data structure to manage Page. According to the state, the Page is divided into three types:

    • free Page: Idle Page, not used
    • clean Page: The Page being used, the data has not been modified.
    • Dirty Page: Dirty page, the used Page, the data has been modified, and the data in the page is inconsistent with the data in the disk.

references:

MySQL--buffer pool, this time I understand it thoroughly! ! ! _Coconut Pineapple Blog-CSDN Blog

MySQL cache pool - Zhihu

2.1.2 Change Buffer

Change Buffer, change buffer (for non-unique secondary index pages). When executing a DML statement, if the data page is in memory, it will be updated directly; otherwise, without affecting data consistency, InnoDB will cache these operations in the change buffer, so that there is no need to read data from the disk. When a query needs to access this data page, the data page is read into the memory, and then the operation related to this page in the change buffer is executed, and finally the query result is returned.

Merge: The process of applying the change buffer operation to the data page is called merge. In addition to accessing the data page will trigger the merge; there are threads in the background of the system that will merge periodically; the merge operation will also be triggered during the normal shutdown of the database. The update operation is recorded in the change buffer, which can reduce disk reads and improve execution efficiency; and read data will occupy the buffer pool, which can also improve memory usage.

Conditions of Use

For a unique index, all update operations need to judge the uniqueness constraint. The data page must be read into the memory and updated directly in the memory without using the change buffer.

For ordinary indexes, when the data page is in the memory, the update operation can be performed directly; when the data page is not in the memory, the update operation can be directly written into the change buffer.

effect

  1. The index is closely related to the table. Inserting, updating, and deleting a record will trigger the operation of the index related to the table. You can regard them as a transaction. If any of the operations on the index fail, you The operation of the data on the table should also fail.
  2. The operation of the index attached to the table will inevitably affect the operation speed of the data in the source table, and the DML operation will affect the performance related to data reading and transaction isolation, and then cause a chain reaction, table insertion, change, deletion, Slow, the performance of the table's SELECT will inevitably be affected.
  3. When inserting, updating, and deleting operations are performed on a table, the values ​​of indexed (non-clustered index) columns are usually out of order, which requires a lot of I/O to update the secondary index. The change buffer caches changes to the index when the relevant page is not in the buffer pool, thereby avoiding expensive I/O operations by not reading the page from disk immediately. When a page is loaded into the buffer pool, the cached changes are merged and the updated page is flushed to disk.

Unlike clustered indexes, secondary indexes are typically non-unique, and are inserted into secondary indexes in a relatively random order. Similarly, deletion and update may affect non-adjacent secondary index pages in the index tree. If the disk is operated every time, it will cause a lot of disk IO. With Change Buffer, we can perform merge processing in the buffer pool to reduce disk IO. If MySQL undertakes a large number of DML operations, the Change Buffer is essential. Its existence is to minimize the consumption of I/O, perform data merging operations through memory, and reduce multiple operations to a small amount of I/O as much as possible. O operation. The disadvantage is that Change Buffer will use the space of innodb_buffer.

2.1.3 Adaptive Hash Index

The self-adaptive Hash index is used to optimize the query of Buffer Pool data. Although the InnoDB engine of MySQL does not directly support the Hash index, it provides a function that is the self-adaptive Hash index. When the Hash index performs equivalent comparison, its performance is It is higher than the B+ tree, because the Hash index only needs one IO, so the InnoDB storage engine will monitor the query of each index page on the table. If it is observed that under certain conditions, the Hash index can improve the speed, it will create a Hash Index, called adaptive Hash index. Adaptive Hash indexes require no human intervention. It is automatically completed by the system according to the situation

references:

Graphical MySQL | [Principle Analysis] How is the Adaptive Hash Index established-Knowledge

2.1.4 Log Buffer

Log Buffer: log buffer, used to save the log data (redo log, undo log) to be written to the disk, the default is 16MB, the log buffer will be periodically refreshed to the disk, if you need to update, insert and delete multiple lines For transactions, increasing the log buffer can save disk IO.

Parameters: innodb_log_buffer_size: buffer size innodb_flush_log_at_trx_commit: when the log is flushed to disk, the values ​​mainly include the following three:

1: The log is written and flushed to disk at each transaction commit, the default.

0: Write and flush the log to disk once per second.

2: The log is written after each transaction commit and flushed to disk once per second

2.2 Disk architecture

2.2.1 System Tablespace

The system tablespace is the storage area for the change buffer, and it may also contain table and index data if the table is created in the system tablespace rather than the tablefile or general tablespace. The InnoDB system tablespace contains the InnoDB data dictionary (metadata of InnoDB related objects). At the same time, Doublewrite Buffer, Change Buffer and undo logs are also stored in the system tablespace. In addition, the system tablespace also contains data such as tables and indexes created by users in this tablespace. Since the system tablespace can store multiple tables, it is a shared tablespace. The system tablespace consists of one or more data files. By default, it contains a system data file called ibdata1, which is located under the mysql data directory (datadir). The location, size, and number of system tablespace data files are controlled by the innodb_data_home_dir and innodb_data_file_path startup options.

show variables like 'innodb_data_file_path'

System table space, the default file name is ibdata1.

references:

MySQL :: MySQL 8.0 Reference Manual :: 15.6.3.1 The System Tablespace

2.2.2 File-Per-Table Tablespace

If the `innodb-file-per-table` switch is turned on, a file-per-table tablespace contains data and indexes for a single InnoDB table and is stored in a single data file on the file system

show variables like 'innodb_file_per_table'

In other words, every time we create a table, a table space file will be generated

# 查看数据存放位置 
show global variables like '%datadir%'

references:

Win10 finds the file storage path: How to find the location of the data stored in the MYSQL database under WIN10. (default path)_qq_1144521901's Blog - CSDN Blog

2.2.3 General Tablespace

General table space, you need to create a general table space through `CREATE TABLESPACE` syntax, you can specify the table space when creating a table

# 创建表空间 
CREATE TABLESPACE ts_name ADD DATAFILE 'file_name' ENGINE = engine_name; 

# 创建表时指定表空间 
CREATE TABLE xxx ... TABLESPACE ts_name;

The general tablespace functionality provides the following functionality:

  • Similar to system tablespaces, regular tablespaces are shared tablespaces that can store data for multiple tables.
  • Regular tablespaces have potential memory advantages over file-per-table tablespaces. The server keeps table space metadata in memory for the lifetime of the table space. Multiple tables in fewer general tablespaces consume less memory for tablespace metadata than the same number of tables in separate file-per-table tablespaces.
  • Regular tablespace data files can be placed in a directory relative to or independent of the MySQL data directory, which provides you with storage management capabilities for many data files and file-per-table tablespaces . As with file-per-table tablespaces, the ability to place data files outside of the MySQL data directory allows you to manage the performance of critical tables individually, set up RAID or DRBD for specific tables, or bind tables to specific disks.
  • General tablespaces support the Antelope and Barracuda file formats and therefore support all table row formats and related features. Two file formats are supported, and the general tablespace does not depend on

innodb_file_format or innodb_file_per_table settings, these variables also do not affect general tablespaces.

  • The TABLESPACE option can be used with CREATE TABLE to create a table in a general tablespace, a file-per-table tablespace, or a system tablespace.
  • The TABLESPACE option can be used with ALTER TABLE to move tables between general tablespaces, file-per-table tablespaces, and system tablespaces. Previously, it was not possible to move tables from a file-per-table tablespace to a system tablespace. You can do this now with regular tablespace functionality.

references:  

mysql general tablespace (general tablespace) bzdww

2.2.4 Undo Tablespaces

To undo tablespaces, the MySQL instance will automatically create two default undo tablespaces (initial size 16M) during initialization to store undo log logs.

MySQL 5.6 began to support the separation of the undo log into a separate table space and put it in a separate file directory. This brings convenience to the deployment of different IO types of file locations. For concurrent write loads, undo files can be deployed to a separate high-speed SSD storage device.

references:

MySQL :: MySQL 8.0 Reference Manual :: 15.6.3.4 Undo Tablespaces

2.2.5 Temporary Tablespaces

InnoDB uses session temporary tablespaces and global temporary tablespaces. Store data such as temporary tables created by users

2.2.6 Doublewrite Buffer Files

Double-write buffer. Before the innoDB engine flushes the data pages from the Buffer Pool to the disk, it first writes the data pages into the double-write buffer file, which is convenient for recovering data when the system is abnormal.

What is Double Write

Doulbe Write is a space with 100 consecutive Pages created on the Innodb Tablespace file. Note that it is on the Innodb Tablespacee file. Its function is that when MySQL refreshes the data to the data file, it first writes the data + fsync( ) to the Doulbe Write space. Then at a certain moment, the data will be written from the Doulbe Write Space to the corresponding page that actually needs to be written.

Why do we need Doulbe Write

This is because there will be partial write problems. Partial write means that when mysql writes data to the data file, only half of the data will be written, but the remaining data will not be written to the innodb file for some reason. This problem may be caused by the system power failure, mysql crash The more fundamental reason for this problem is that the page size of mysql is inconsistent with the page size of the system file, so when writing data, the system does not write the entire buffer pool page to the disk at one time, so when writing Halfway through, the system is powered off, and partial write occurs; if partial write occurs, what will happen? Because we know that the redo log has already been written when the buffer cache is flushed. Why do we need to worry about partial write? What? This is because mysql determines whether the page needs to be restored by checking the checksum of the page when it is restored. The checksum is the transaction number of the last transaction of the current page; Row data performs write operations.

What are the advantages of Double Write?

Double write solves the problem of partial write. It can ensure that even if partial write occurs in the double write part, it can be recovered. Another advantage is that double write can reduce the amount of redo log. With double write, redo log only records binary The amount of change is equivalent to the binary log, and it was found through the test some time ago that when the double write is turned off, the redo log is larger than the binary logs.

What are the disadvantages of double write?

Although mysql claims that double write is a buffer, it is actually a buffer opened on a physical file, which is actually a file, so it will cause more fsync operations in the system, and we know that the fsync performance of the hard disk is very slow , so it will reduce the overall performance of mysql. But it will not be reduced to 50% of the original. This is mainly because: 1) double write is a connected storage space, so the hard disk is written sequentially when writing data, not Write randomly, which has higher performance. In addition, when data is written from the double write buffer to the real segment, the system will automatically merge the connection space refresh method, and multiple pages can be refreshed each time.

How does Double Write work during recovery?

If there’s a partial page write to the doublewrite buffer itself, the original page will still be on disk in its real location.When InnoDB recovers, it will use the original page instead of the corrupted copy in the doublewrite buffer. However, if the doublewrite buffer succeeds and the write to the page’s real location fails, InnoDB will use the copy in the doublewrite buffer during recovery.InnoDB knows when a page is corrupt because each page has a checksum at the end; the checksum is the last thing to be written, so if the page’s contents don’t match the checksum, the page is corrupt. Upon recovery, therefore, InnoDB just reads each page in the doublewrite buffer and verifies the checksums. If a page’s checksum is incorrect, it reads the page from its original location

If the doublewrite buffer itself fails, the data will not be written to the disk. Innodb will load the original data from the disk at this time, and then calculate the correct data through the transaction log of innodb, and rewrite it to the doublewrite buffer.

If the doublewrite buffer is successfully written, but the write to the disk fails, InnoDB does not need to calculate through the transaction log, but directly uses the buffer data to write it again. When recovering, InnoDB directly compares the checksum of the page. If it is not correct, it will The original data is loaded from the hard disk, and the correct data is deduced from the transaction log, so the recovery of innodb usually takes a long time.

Is double write necessary?

In some cases, the doublewrite buffer really isn’t necessary—for example, you might want to disable it on slaves. Also, some filesystems (such as ZFS) do the same thing themselves, so it is redundant for InnoDB to do it. You can disable the doublewrite buffer by setting innodb_doublewrite to 0.

If double write is enabled and the use of double write is controlled

innodb_doublewrite=1 means start double write

# 查询double write的使用情况 
show status like 'innodb_dblwr%'

2.2.7 Redo Log

Redo logs are used to achieve transaction persistence. The log file consists of two parts: redo log buffer (redo logbuffer) and redo log file (redo log), the former is in memory and the latter is in disk. After the transaction is committed, all modification information will be stored in the log, which is used for data recovery when an error occurs when flushing dirty pages to disk. Write redo log files in a round-robin fashion, involving two files:

One of the four characteristics of transactions is persistence. Specifically, as long as the transaction is successfully committed, the changes made to the database will be permanently saved, and it is impossible to return to the original state for any reason. So how does mysql guarantee persistence? The easiest way is to flush all the data pages involved in the transaction to the disk every time the transaction is committed. But doing so will have serious performance problems, mainly reflected in two aspects:

Because InnoDB interacts with the disk in units of pages, and a transaction may only modify a few bytes in a data page, it is a waste of resources to flush the complete data page to the disk at this time!

A transaction may involve modifying multiple data pages, and these data pages are not physically continuous, and the performance of using random IO to write is too poor!

Therefore, mysql designed the redo log. Specifically, it only records the changes made to the data page by the transaction, so that the performance problem can be perfectly solved (relatively speaking, the file is smaller and it is sequential IO).

basic concept

The redo log is generated in InnoDB, that is, the storage engine layer, and is a physical log.

Redo log consists of two parts: one is the log buffer (redo log buffer) in memory, and the other is the log file (redo log file) on disk. Every time mysql executes a DML statement, it first writes the record to the redo log buffer, and then writes multiple operation records to the redo log file at a certain point in time. This technology of writing logs first and then writing to disk is the WAL (Write-Ahead Logging) technology often mentioned in MySQL.

In the computer operating system, the buffer data in the user space (user space) generally cannot be directly written to the disk, and must pass through the operating system kernel space (kernel space) buffer (OS Buffer). Therefore, writing the redo log buffer to the redo log file is actually writing to the OS Buffer first, and then flushing it to the redo log file through the system call fsync(). The process is as follows:

mysql supports three timings for writing redo log buffer to redo log file, which can be configured by innodb_flush_log_at_trx_commit parameter

The basic storage structure of Mysql is a page (records are stored in the page), so MySQL first finds the page where this record is located, then loads the page into memory, and modifies the corresponding record

redo log record form

The redo log actually records the changes of the data pages, and it is not necessary to save all such change records. Therefore, the redo log implementation adopts a fixed-size, circular writing method. When the writing reaches the end, it will return to the beginning and write the log circularly . As shown below:

In InnoDB, not only the redo log needs to be flushed, but also the data pages also need to be flushed. The significance of the existence of the redo log is mainly to reduce the requirements for data page flushing. In the figure above, write pos represents the LSN (logical sequence number) position of the current record of the redo log, and check point represents the LSN (logical sequence number) position of the corresponding redo log after the data page change record is flushed.

The part between write pos and check point is the empty part of the redo log, which is used to record new records; the part between check point and write pos is the data page change record of the redo log to be placed on the disk. When the write pos catches up with the check point, it will first push the check point forward to make room for a new log.

When starting InnoDB, regardless of whether it was shut down normally or abnormally last time, the recovery operation will always be performed. Because the redo log records the physical changes of the data page, the recovery speed is much faster than the logical log (such as binlog).

When restarting InnoDB, the LSN of the data page in the disk will be checked first. If the LSN of the data page is smaller than the LSN in the log, it will be restored from the checkpoint.

In another case, before the downtime, the checkpoint is in the disk flushing process, and the disk flushing progress of the data page exceeds the disk flushing progress of the log page. At this time, the LSN recorded in the data page is greater than the LSN in the log. The part that exceeds the progress of the log will not be redone, because this itself means that something has already been done and there is no need to redo it.

2.3 Background threads

In the background thread of InnoDB, it is divided into 4 categories, namely: Master Thread, IO Thread, Purge Thread, Page Cleaner Thread

2.3.1 Master Thread

The core background thread is responsible for scheduling other threads, and is also responsible for asynchronously flushing the data in the buffer pool to the disk to maintain data consistency, including flushing of dirty pages, merging and inserting into the cache, and recycling of undo pages.

The work in the Master Thread thread is mainly to refresh the log, clear the LRU cache, merge the insert buffer, ensure that the log space is sufficient, and set a new checkpoint. When the state of the database is different, the amount of these tasks is different.

However, the actual cleaning of dirty pages is not performed in the Master Thread, but in the Page Cleaner Thread alone, which greatly reduces the workload of the Master Thread.

references:

[MySQL innoDB Reading Notes] 04 Master Thread Source Code Analysis_Code Eaten Blog-CSDN Blog

2.3.2 IO Thread

In the InnoDB storage engine, AIO is widely used to process IO requests, which can greatly improve the performance of the database, and IO Thread is mainly responsible for the callback of these IO requests

thread type

default number

responsibility

Read thread

4

responsible for read operations

Write thread

4

responsible for write operations

Log thread

1

Responsible for flushing the log buffer to disk

Insert buffer thread

1

Responsible for flushing the contents of the write buffer to disk

We can view the status information of InnoDB through the following command, which includes IO Thread information

show engine innodb status \G;

2.3.3 Purge Thread

It is mainly used to recycle the undo log that has been committed by the transaction. After the transaction is committed, the undo log may not be used, so it is used to recycle.

The deletion made by delete in InnoDB is only marked as deleted, but not actually deleted. Because of the existence of the MVCC mechanism, the previous version must be kept for concurrent use. The final deletion is determined by the purge thread when to actually delete the file

references:

Introduction to MySQL Innodb Purge - Nuggets

MySQL purge thread - Tencent Cloud Developer Community - Tencent Cloud

2.3.4 Page Cleaner Thread

A thread that assists the Master Thread to flush dirty pages to disk, which can reduce the work pressure of the Master Thread and reduce blocking

references:

MySQL: Innodb page clean thread (1): Basic articles - Alibaba Cloud Developer Community

Guess you like

Origin blog.csdn.net/weixin_46058921/article/details/127841282