Introduction to MySQL storage engine and understanding of InnoDB engine structure

Storage Engine Overview

The data engine deals with the disk files where the data is actually stored. Its upper layer (service layer) passes the processed SQL instructions (SQL processing and analysis results) to the data engine, and the data engine stores the data in disk files according to the instructions, or Read disk file data. Therefore, the data engine is the core of the entire DBMS, and it has the most essential functions: reading data and storing data.

The storage engine can also be regarded as the type of table (the original name of the storage engine 表处理器), because different tables can choose different storage engines, and different storage engines have different data processing methods, table structures, file formats, etc.

MySQL's data storage engine is pluggable, that is, it can be switched at will. The unit of action of the data storage engine is a table, and a suitable storage engine can be selected for each table.
The default storage engine after MySQL5.5 is the InnoDB engine (previously MyISAM), which has many advantages compared with other data engines (the specific advantages will be discussed later), so except for very special scenarios, it is often recommended that every Both tables use the default storage engine InnoDB

The general execution flow of SQL statements
insert image description here
When we want to operate the database, we must first make a connection to the database service, whether using a graphical client or a command line. In short, to be able to connect, the connection layer needs to provide connection and authentication. , authorization and other services.
After the connection is completed, we can input SQL to operate. At this time, the service layer will check the semantics and grammar of the SQL statement we input, create a syntax parse tree, and perform automatic optimization (before version 5.7, the cache must be queried, if the cache Yes, return the cached result directly. Because the hit rate is too low, it has been discarded after version 8). Then hand over the processed data to the engine layer.
The engine layer is responsible for storing or extracting the data according to the results of the upper layer. The
storage layer actually stores the data and logs, and manages them with the engine layer.

Introduction to each storage engine

You can show enginesview
insert image description here
. As shown in the figure, it can be seen that MySQL supports 8 types of data storage engines. The engine is used by default. InnoDBIts biggest feature is that it supports transactions and distributed transactions, and supports row-level locks. , support foreign keys
Each engine has its own characteristics, because it is open source, so large companies can customize the engine according to the specific business.
The most important in the database are the InnoDB engine and the MyISAM engine, each of which has its own advantages Disadvantage, not a replacement relationship, one engine used by the current version, and one used by the old version


This engine appears after MySQL 3.2, and is the default storage engine after MySQL 5.5

Advantages of InnoDB

  1. The InnoDB engine supports foreign keys, but foreign keys can cause performance problems
  2. The InnoDB engine supports transactions (very important), and introduces two operations, commit and rollback. For important data, having transactions can ensure the ACID (atomicity, consistency, isolation, and durability) of the data. After a crash, it can recover by itself and roll back uncommitted data.
  3. InnoDB supports row-level locking. When multiple threads are concurrent, in order to solve the problem of thread safety, it is necessary to lock. Other engines usually use table-level locks. Second, InnoDB can lock specific operation rows, so that it does not affect the concurrent operations of other data in the entire table and improves concurrency efficiency. Therefore, the characteristics are particularly prominent when the amount of data is large and the amount of concurrency is high

Features of InnoDB

  • Before MySQL8, table storage files were divided into two categories: table structure .frm, table data, and index .idb; after MySQL8 version, table structure, table data, and table index were all put into one file .idbfor storage.
  • InnoDB's primary key index stores the entire data in the leaf node

Disadvantages of InnoDB compared to MyISAM

  • Batch query is not as good as MyISAM
  • High memory usage: Because InnoDB puts data and indexes together, the data of this page is also loaded when loading leaf pages. However, MyISAM loads the address address of this page data, which saves a lot of space compared to the data, which makes InnoDB have higher requirements for memory.

InnoDB does not support hash indexes, but the engine will automatically optimize and create adaptive hash indexes based on queries inside the engine. Human intervention is not possible.
When adding a large amount of data, you can delete all indexes in the table first, add them in batches, and then recreate them. It makes it unnecessary to maintain the index when adding, improving efficiency


Advantages of MySIAM

  • faster access
  • count(*) is more efficient at O(1) level, because there are variables for statistical calculation, and InnoDB is used to check O(n) level one by one

Disadvantages of MySIAM compared to InnoDB:

  • Does not support foreign keys, transactions, row-level locks (MyISAM is a table-level lock)

MySIAM processing is suitable for small, low concurrency (because table locks affect concurrency), no data integrity requirements (because it does not support transactions, but high performance), query and addition are much higher than modification and deletion of tables, can Save memory resources and improve access efficiency (InnoDB data engine is recommended for most scenarios)


  • The table structure is stored on the disk, and the table data is stored in the memory. Because it is in the memory, the speed is faster, but it is also limited by the size of the memory, which makes it impossible to store a large amount of data.
  • The default index structure is a hash index, which is characterized by being very fast when locating a piece of data, but it is not as good as a B+ tree index when searching for a range.
  • Affected by physical factors such as memory size and power failure, it can only be used to store temporary data or be used as a cache. But these features are replaced by other server software, such as Redis. So use less.

Comparison of the three engines

features InnoDB MyISAM Memeory
affairs support not support not support
foreign key support not support not support
lock level row level lock table level table level
B+ tree index support support support
hash index Not supported, but internal optimizations are created and cannot be intervened not support support (default)
insertion speed slow middle fast (memory level)
memory usage Medium (index structure) Low High (data is stored in memory)
scenes to be used A large amount of data, concurrent writing, transaction requirements Small data, saving resource consumption, simple business, no transaction requirements for data The amount of data is small, do not care about data security, and have high performance requirements

The main reason for using a relational database is that its transactional features can make data more secure . If the data does not require transactional features, then NoSQL will be considered instead.

other engines

  • Archive: Provides the perfect solution for storing and retrieving large amounts of rarely referenced historical, archived, or security audit information. Indexes are not supported, updates are not supported, and row-level locks are supported. It is suitable for inserting data that is never modified once and has fewer queries.
  • CSV engine: The csv file can be processed as a mysql table. The storage format is ordinary csv files
    and other rarely used engines

Engine-related SQL statements

  • View the current system default storage engine show variables like '%storage_engine%';
    insert image description here
  • Modify the default storage engine of the system SET DEFAULT_STORAGE_ENGINE=引擎名称;. This setting will return to the original setting after the server is restarted. It can my.cnfbe modified in the SQL configuration configuration file.
    insert image description here
  • Specify the storage engine of the table when creating the table, do not specify the default storage engineCREATE TABLE 表名( 建表语句; ) ENGINE = 存储引擎名称;
  • Modify the storage engine of the tableALTER TABLE 表名 ENGINE = 存储引擎名称;

InnoDB engine

recommended reading

logical storage structure

The InnoDB engine is divided into five levels of structure: from large to small, it is table space, segment, area, page, and row
table space.
Table space is the highest structure of InnoDB logical storage, and is divided into system table space, independent table space, general Table space, temporary table space, and undo table space
After MySQL 8, by default, each table will exist in an independent table space (the structure is still placed in the system table space), and it is stored in a separate file on the disk. It can be set by, if it is set to 0, the data of each table will be put into the system table space数据索引xx.ibdinnodb_file_per_tableibdata1

Segment is divided into data segment, index segment, rollback segment and so on. The data and indexes of a table are put into a table space, and in a table space, indexes and data are also divided into different segments for storage. According to the structure of the index B+ tree, the data segment is stored in the index tree 叶子节点, and the index segment is stored in the index tree 非叶子节点.
The space of a segment expands with the size of the table, the size of the table, the size of the segment. A segment contains at least one area, and the smallest extension unit of a segment is an area. The default size of an area is 1M, that is, 64 consecutive pages. When performing area expansion, in order to ensure the continuity of pages and avoid random IO as much as possible, 4~5 areas will be requested from the disk each time. The default size of each page is , exactly , the page in InnoDB is , the operating system will read or write in 4 times. Each node on the index tree is a page, and at least 2 data items are stored on a page. If you sequentially insert data into a full data page, a new page is reallocated from the extent. If a data is inserted from the middle of the index tree, a page split will occur when the page is full.


  • Page split
    The storage of row data in the leaf must be stored in order. If it is inserted out of order, if it is inserted into a full page, page split will occur. think? Why not insert into a new page? Because the data storage is sequential. How to avoid page splitting? Do not insert data into the data pages that are full in the middle, that is, sequential insertion.
    insert image description here
    Separate directly from the middle of the full page, divide it into two pages, and insert it into one page in order
    insert image description here
    . After splitting, there will be space left on the page, and then insert the data to be inserted.
    insert image description here
    Finally, connect the newly opened pages in order. Note that it is InnoDB The engine has optimized the bB+ tree. Pages are bidirectionally connected. Page
    insert image description here
    splits result in a large number of pages remaining, wasting space. Page splitting consumes performance, so try to insert in order when inserting data

  • Page merging
    When data is deleted, the InnoDB engine just makes a mark. When the number of deletions in a page reaches 50% of the page, it will look for the pages before and after to determine whether it can be merged.
    insert image description here
    Merge two pages that are not full. The data rows in the table are stored in the row. The storage size of each row depends on the field design of the table and the data size in the row. InnoDB stores in units of , and a page contains multiple rows.
    insert image description here

insert image description here


insert image description here

The InnoDB engine consists of three parts: memory pool, background thread, and disk file.

memory part

insert image description here
From the figure, the memory is mainly composed of 4 parts, namely Buffer Pool, Change buffer, Log Buffer,Adaptive Hash Index

Buffer Pool:
First note : Buffer pool 查询缓存is not the same thing (because I was confused about this when I started)
Query cache : When the service layer receives the SQL to execute the query type, it first goes to the memory query cache to find out whether there is such a query If the results of the SQL cache are not parsed by semantic syntax, etc., they are finally handed over to the engine layer for processing. The Buffer Pool is at the engine level. And this kind of query cache is very unintelligent, that is, although it is SQL with the same semantics, if there is a slight change (such as adding a space), the cache of the previous SQL cannot be found, and the hit rate is very low, which has been removed in MySQL8 version.
Why introduce Buffer Pool :
All kinds of data and indexes are placed in the table space in the form of pages, and the table space is also an abstraction of files on the disk. That is to say, the data is placed on the disk.
However, there is a gap between disk speed and memory. When accessing a certain data, the entire page where the data is located is loaded to the disk, which will greatly improve the efficiency. However, after accessing, the pages are not flushed back to the disk immediately. First, the random IO of such frequent flash disks will greatly affect the performance. Second, the data in the memory may still be used, so as not to read it again.
Introduction to Buffer Pool :
The default size of Buffer Pool is 128M, the larger the size of the memory, the more data, the lower the probability of querying the disk, and the higher the performance. Therefore, the larger the better, the better the device allows. There can be multiple Buffer Pools, the default is one, and the number can also be modified.
The buffer pool is based on pages, and the size of each page is the same as the size of the disk 16K. According to the function of the page, it is divided into: 索引页, 数据页undo page, insert cache, adaptive hash, lock information, etc.
According to the state of the page, it is further divided into:

  • free page : free page, unused page in memory
  • clean page : After the data is loaded from the disk into the memory, the page has not been modified and is consistent with the disk
  • dirty : The data in the page has been modified and the page has not been refreshed to the disk, that is, the page whose memory data is inconsistent with the disk
    In the memory used by MySQL, usually more than 80% of the space is given to the buffer pool, and its size will directly affect the overall performance . When the memory is not enough, the InnoDB engine uses LRU页面置换算法the least recently used pages to remove the pages that have not been used recently from the disk and refresh them to the disk. However, the InnoDB engine optimizes the LUR algorithm. For example, select * will perform a full table scan, which allows infrequently used data to be scanned and become the latest used data. The InnoDB engine is optimized for this situation.

Changer Pool: It is also called the insertion buffer pool, 是针对非主键索引插入的优化because the primary key is usually inserted in order, which is more convenient when re-indexing the primary key index tree. (Just add leaf nodes as needed). When inserting, if a non-primary key index is established, it is not only necessary to insert into the primary key index tree, but also to insert into the non-primary key index tree. The index keys of non-primary key indexes are often inserted out of order, discrete access to non-primary key indexes, random IO will lead to performance degradation. Using change pool, when inserting a non-primary key index, first determine whether the page of the index tree used for inserting is in memory, if it exists, insert it directly, if it does not exist, put the corresponding index page into memory after reading from the disk , and then insert. When reading data, merge the Change Buffer with the Buffer Pool, and refresh the Buffer Pool to the disk at a certain frequency.
For example: to insert a piece of data with id=2 and name='zs', the id is the primary key index, and the name is the common index . When inserting data, the id is ordered and directly inserted into the primary key index page, while when inserting the name, the name is usually out of order, which requires putting part of the index into the insertion buffer pool and inserting it first changer Pool. Finally, flush it to disk together.提高非主键索引的插入性能

自适应哈希索引 AHI: The engine automatically builds a hash index for hot pages, in unit. Because the hash index can access the data only once without hash collision, and the B+ tree primary key index usually needs 1~3 times. However, the hash index does not support range query, which requires the engine to judge the situation and build a hash index for the appropriate data.无需人为干预

logo buffer: Log buffer, the memory part of the redo log is here, the default size is 16M, the data in the memory will be refreshed to the disk at the specified frequency, there are three options. The first type is shared and flushed to the disk every 1s; the second type is flushed to the disk after the transaction is committed, and the first type is also used at the same time. The third is that after the transaction is committed, it will be handed over to the system cache page, and the system will refresh it to the disk at a specified time, and the first method is also used at the same time.

disk part

insert image description here
The disk structure mainly has the following parts:

  • System tablespace : system tablespace
    system tablespace can have one or more data files. By default, a system tablespace data file ibdata1named .
    The system tablespace includes data dictionary (including metadata), double-write buffer (no storage area after version 8), change buffer, undo log, and data and indexes of tables created in the system tablespace (not created in the system tablespace stored in a separate tablespace).

  • File-Per-Table Tablespaces : Independent tablespace
    Each table uses an independent tablespace by default to store data and indexes, corresponding to a single .idbfile on the disk

  • General Tablespaces : general tablespace
    is a shared tablespace that can store multiple tables

  • Redo log
    Redo's two loops write files ib_logfile0 and ib_logfile1

background thread

It is the thread that travels between the memory and the disk, schedules accordingly, and provides data services. It
is mainly divided into Master Thread, IO Thread, Purge Thread,Page Cleaner Thread

The core background thread of the Master Thread
is responsible for the scheduling of other threads, and is also responsible for merging buffer pooland change buffermerging, dirty page storage, recovery of undo pages, etc.
IO Thread
insert image description here
Purge Thread
is mainly to recover the undo log after the transaction is submitted, reducing the pressure on the Master thread
page Cleaner Thread
Flush the dirty page data in the memory to the disk to reduce the pressure on the Master thread

Three major features of InnoDB

The three major features of the recommended article
InnoDB are double write, Buffer pool, 自适应哈希索引
the latter two have been introduced above, and only the double-write mechanism is introduced here.
Why is it called double writing? Because a piece of data in the memory is first 顺序written into the double-write buffer in the disk, after confirming that the writing is successful, the same data in the memory 离散is written into the corresponding table space.

Why is there a double write mechanism?
First, let’s talk about the disadvantages of the double-write mechanism: the same data is written to the disk twice, and an additional IO is performed. Although it is written sequentially, the additional IO reduces the overall performance by 5-10%.
Let's talk about why the double-write mechanism was introduced? The fundamental reason is to solve the problem of page corruption in the process of refreshing the memory data, that is, 部分写问题
the problem of page corruption: the InnoDB engine uses pages as units, and the size of each page is 16K. The operating system (Linux) also uses pages as units, but Its size is 4K, that is to say, when the data in the memory wants to be refreshed to the disk, it must first be handed over to the operating system page, and the operating system page will be refreshed to the disk. This means that every time a page of memory data is refreshed, it must be written into the operating system four times. However, if an accident such as a downtime occurs during these 4 times, an incomplete page in the memory is written to the disk, resulting in page damage ( 部分写问题). At this time, the principle of persistence cannot be violated, so how to recover the data that has not been correctly written to the disk due to downtime and so on?
First of all, what I thought of was to ensure that memory data is written to disk and there is a redo log. But what is recorded in the redo log? It is a physical record, that is, the address data of the xxx page whose offset is xxx is xxxx, but the page of the record that caused the downtime was not written completely, and the page is 已经损坏!damaged, which means that the data record of the corresponding page address is also invalid. So redo can't use
binlog? The function of binlog is data backup and master-slave replication. It does not pay attention to the memory and disk data at all, that is, it cannot distinguish which ones have been written to the disk and which ones are in dirty pages.
If the full recovery is performed, it seems to be ok. But the cost is too high, it is unrealistic.
At this time, the double-write mechanism is introduced, that is, the pages recorded in the redo log are not operated first, so as to prevent the page recorded in the redo from being damaged and invalidated after an accident occurs during the writing process.
So where to write the data first? The place to write must be on the disk, otherwise it is meaningless. Because what we do is interactive backup between memory and disk.
The InnoDB engine gives that the order is first written to the system table space, and then discretely written to the corresponding independent table space after success. InnoDB also optimizes the writing speed of double write, because writing in this way is sequential writing, one more IO operation is added, and the performance loss caused by it is also minimized.
The above is why the introduction双写机制

Double write file
With this mechanism, there must be a corresponding storage file.
The double-write cache has a shared part, part in memory and part in disk. Each part is fixed at 2M.
The double-write buffer is 128 pages on the disk table space, that is, two areas with 1M each, totaling 2M. When Buffer Pool data is copied, the data must first be copied to the memory part of the double-write cache, and then the memory part is divided into two times, and the double-write file of the disk part is first written to the double-write file with a size of 1M each time. After the write is successful, Then discretely write the same data into the corresponding independent table space.

When the memory triggers the checkponit mechanism, it starts writing some pages to disk. The specific process is as follows:

  1. Copy the dirty pages to be written in memory and copy them to double writethe memory part
  2. The double write memory part is written to one area at a time, that is, 1M, and written to the system table space part in two order
  3. After confirming that the writing is successful, the memory part of double write discretely writes the same data into the corresponding independent table space

What if an accident occurs in the double write process?
An accident occurs in the first write: the first time is to write the memory data (double write memory buffer) into the system table file for backup.
If an accident occurs in this process, 但redo记载的页没有发生损坏you can use redo to directly restore
the second Accident in the second write: The second write is to write the memory data (double write memory buffer) into the corresponding independent table space. An
accident in writing to disk will cause the page to be damaged, and the page recorded in redo will also be invalid at this time. But with the file in the system tablespace, when the system is restarting, it will check whether the process page is intact. If the page is found to be damaged, it will be restored from the system tablespace. The
redo page is impossible to have an accident, because only after the redo page is successfully written, The business is considered complete.

Example: There is a user table, id is the primary key, name is a common index, and age is a common field. When I connect to MySQL and open a transaction to the user table 修改and submit the age of the user whose name is 'zs' to 10, the general internal process of MySQL: The
connection layer of MySQL establishes a connection with the client, and checks the syntax and semantics of the SQL statement Finally, hand it over to the engine layer. The engine that found this table is the InnoDB engine. The InnoDB engine will look for the user whose name is 'zs' from the Buffer Pool. If it is not found, it will use the non-clustered index to use the name as the index value to find the id of this row of data. id goes to the clustered index to find the page (in xxx.ibd文件中) where this row of data is located, loads it into Buffer Pool, and puts a write lock, MDL lock, etc. on this row (this process has an MVCC mechanism to read this row of data). At the same time, point the undo log pointer in the data to the undo log record before the transaction is opened, and open a new undo log at the same time. Then change the age to 10 in the Buffer Pool. At this time, redo records the modification of the data in the disk address, undo records how to restore the original record operation, and binlog records the user’s command operation. When the transaction is submitted, redo and binlog will immediately The contents of the buffer are flushed to disk, and undo is handed over to the Purge Thread thread for recycling. However, the modified data in the memory may not be flushed to the disk. When there is other data to read, the data in the Buffer Pool will be returned directly. When the dirty page of this data triggers cheakPointthe mechanism, the dirty page will be copied from the Buffer Pool to double writerthe memory buffer, and the data will be written into the system table twice in the memory. Write the data in memory to the corresponding page. The corresponding redo log records are useless and will be overwritten by new data.

Guess you like