MySQL 8.0: Lock-free scalable WAL design

This article is compiled from the official MySQL documentation and introduces the modification of 8.0 on the implementation of pre-written logs. The views are summarized as follows:

Before 8.0, in order to ensure the order of the flush list, the redo log buffer writing process needs to be locked, which cannot achieve parallelism. In a high-concurrency environment, there will be a lot of min-transactions (mtr) that need to copy data to the Log Buffer at the same time. If the lock is mutually exclusive, there is no doubt that this will become an obvious performance bottleneck.

For this reason, starting from MySQL 8.0, a lock-free log writing mechanism has been designed. The core idea is to introduce recent_written, allowing different mtrs to write to different locations of the Log Buffer concurrently.

1. There is a performance bottleneck in writing to RedoLog

The write-ahead log (WAL) is one of the most important components of the database. All changes to the data files are recorded in the WAL (called Redo logs in InnoDB), and the time to refresh (Flushed) the modified pages to disk is postponed. , while still preventing data loss.

When writing to Redo logs, the performance of data-intensive write services will be significantly reduced due to the limitation of thread synchronization. This is especially true when testing performance on servers with multiple CPU cores and fast storage devices such as modern SSD disks.

We needed a new design to solve the problems our customers and users are facing now and in the future. With the new design, we wanted to ensure that it would work with existing APIs and most importantly not break the contracts the rest of InnoDB relies on, a challenging task given these constraints.

Redo logs can be thought of as a producer/consumer persistent queue. User threads performing updates can be thought of as producers, and when InnoDB has to recover from a crash, the recovery threads are consumers. The InnoDB service does not need to read the Redo log when it works as expected.

2. The order of writing Redo logs by multiple threads

Implementing a write-logging scalable model with multiple producers is only part of the problem. There are also some InnoDB-specific details that need to come into play. The biggest challenge is to keep the dirty pages on the flush list in the buffer pool to be arranged according to the increasing LSN.

First, a buffer pool instance maintains a flush list, and mtr (mini transaction) is responsible for the modification of physical pages by atomic applications. Therefore, mtr is the smallest transaction unit for InnoDB to operate on physical files. The redo log is generated by mtr, and is usually first written in the cache of mtr. When mtr is submitted, the redo log in the cache is flushed into the log buffer (public buffer), and the log number (LSN, Log Sequence Number) maintained globally is incremented at the same time.

Then Mtr ​​is responsible for adding the modified dirty pages (or a list of dirty pages) to the flush list, and the dirty pages on the flush list are sorted in ascending order according to the LSN.

In the implementation before 8.0, we added internal locks log_sys_t::mutex and log_sys_t::flush_order_mutex to ensure that the pages on the flush list are in order according to the LSN obtained by writing the log buffer.

Therefore, the working method before 8.0 is as follows: When a certain mtr enters dirty pages into the flush list, it will hold the lock flush_order_mutex. At this time, even if another thread A needs to add dirty pages to the flush list of other bp (buffer pool), it will Have to wait. In this case, this thread A blocks other threads from writing log buffer by holding log_sys_t::mutex. Simply removing these two lock structures will make the constraint of LSN increment in the flush list not work

3. Solve the problem that the LSN on the log buffer may be discontinuous

The second problem we still face is that since each transaction can cross-copy the redolog to the log buffer, there may be holes in the LSN on the log buffer (as shown in the figure below), so the log buffer cannot be flushed full log buffer in one go.

We solve the second problem by keeping track of the completed log buffer write operations (as shown in the figure below). 

In terms of design, we introduce a new lock-free data structure (the corresponding relationship between the arrangement of elements and the original log buffer is shown in the figure below).

  

The data structure is shown in the figure below. First of all, this is a fixed-length array, and the update of the array elements (slots) is guaranteed to be atomic, and the released space is reused in a circular form (so it is a circular array aha). And enable a separate thread to be responsible for the traversal and space recovery of the array, and the thread is suspended when it encounters an empty slot. Therefore, this thread also maintains the maximum reachable LSN in this data structure, and we assume the value is M.

We introduce two variables to this data structure: recent_written and recent_closed. recent_written maintains the completion status of writing log buffer. The largest LSN maintained in recent_written, M, indicates that all LSNs smaller than this M have written their redo logs into the log buffer. And this M is also the cut-off point of crash recovery (it may be triggered if a crash occurs), and it is also the start point of the next write log buffer operation.

Flushing the log buffer to disk and traversing recent_written is done by one thread, so the memory read and write operations on the log buffer are guaranteed by the barrier formed by the sequential access elements (slots) on recent_written.

Assume that the current log buffer and recent_written states are as shown in the figure above, and then a buffer log write is completed, making the log buffer state as shown in the figure below.

The log buffer status update triggers a specific thread (log_writer) to scan recent_written backwards, (as shown below)

Then update the maximum reachable LSN value it maintains (it can guarantee no holes), and write it to the variable buf_ready_for_write_lsn (this variable is just as its name implies XD)

We also introduce another variable recent_closed of the lock-free structure, which is used to complete the work of the original log_sys_t::flush_order_mutex lock, and is used to maintain the LSN increment property of the flush list. Before introducing the implementation details, we still need to explain some details in order to clearly explain the graph and use the lock-free structure to maintain (flush list/bp) the overall LSN monotony.

4. Use CheckPoint to ensure the flush list is correct

 So first of all, the flush list in each bp has a dedicated internal lock protection. But we have removed the lock structure log_sys_t::flush_order_mutex, which makes the LSN increment property of concurrent write flush list unguaranteed.

Even so, the flush list must satisfy the following two native constraints for it to work correctly:

  1. Checkpoint - Constraints for checkpoint advancement: Assuming there is a dirty page P1, LSN = L1, dirty page P2, LSN = L2, if L2 > L1, and the dirty page of P1 has not been flushed to disk, it is not allowed to flush the dirty page corresponding to L2 to disk .
  2. FLushing - Constraints on flushing the dirty strategy on the flush list: each flush must start from the oldest page (that is, the LSN corresponding to the page is the smallest). This operation ensures that the earliest modification is updated from the flush list to the disk first, and checkpoint_lsn is pushed backwards.

The newly introduced lock-free structure recent_closed is used to track the state of the concurrent write flush list, and also maintain a maximum LSN, which we mark as M, satisfying that all dirty pages smaller than the current LSN have been added to the flush list.

Only when the M value is not so far from the current thread, its dirty pages can be flushed to the flush list. After the thread writes the dirty page to the flush list, update the status information in recent_closed.

Guess you like

Origin blog.csdn.net/leread/article/details/129991878