Adventures of a data roaming

Ali sister REVIEW: The database storage engine is an old technology, after several years of development, there have been a lot of good mature products. Alibaba X-Engine team wrote the paper "X-Engine: An Optimized Storage Engine for Large-scale E-Commerce Transaction Processing", a detailed account of the original team work on the database storage engine, earlier this year received SIGMOD'19 Industrial Track (SIGMOD database field is the most important one of the most influential conference). This article will do an analysis of the leading paper.

background

X-Engine Ali database products division from the research of OLTP database storage engine, as a self-study database POLARDB X storage engine, it has been widely used in many Ali Group's internal business systems, including transaction history library, nailing history libraries core applications, a significant reduction in costs for the business, as well as the key database technology of two-eleven big promotion, survived the impact of several hundred times the usual traffic.

Database storage engine is an old technology, after several years of development, there have been a lot of good mature products. All kinds of storage engine has been organized in the index, cache management, transaction processing, query optimization aspects have done meticulous research. Even so, the evolution in this area continues, every year the emergence of many new technologies.

In recent years, LSM (Log-Structured Merge-Tree) structure is subjected to more and more attention, although the technology itself around for many years, is not nothing new, but some more previously applied in the KV storage system, in recent years, began to emerge in the field of database storage engine, RocksDB that is typical.

LSM has become fashionable, first, because of its simplicity, and second, distinct characteristics. Write simple model is added, does not update existing data, organize data into simple logic sequencing, the resulting characteristic is to write strong and weak read, read-only persistent data compression features easy. But the scenario most databases are actually reading and writing less directly using LSM structure may not be appropriate, you want to look for other ways, be subject to the provision of short Advantage.

Architecture

X-Engine uses LSM as infrastructure, the goal is as a low-cost, high-performance general-purpose storage engine, the pursuit of a more balanced read and write performance and thus improve on it in a lot of, mainly around several directions:

  1. The use of natural advantages, continue to optimize write performance.
  2. Optimization of compaction to reduce the impact on system performance, making the system performance has stabilized.
  3. The use of persistent data layer read-only features, to play the advantages of compression to reduce costs.
  4. The use of natural hierarchical structure, combined with the ability to use the hardware cold hierarchical structure, reducing the cost.
  5. Using fine access mechanism and caching, read performance to make up for shortcomings.

X-Engine overall architecture below, persistent layered hierarchical data based on the data itself in place of LSM cold, the heat used to update the data layer and data storage memory, a technique using a large amount of memory database (Lock-Free index structure / append only) to improve the performance of transaction processing, transaction processing designed a pipeline processing mechanism, parallel up to several stages of the transaction, to enhance throughput. Low access frequency of cold (temperature) phase out or merge data into persistent storage hierarchy in the storage device with the current rich hierarchy (NVM / SSD / HDD) is stored.

We optimize the performance a lot greatest impact compaction process, the main data storage granularity is split, with the data update feature hot more concentrated, as the multiplexed data during the merge process, the shape of LSM fine control, reduced I / O and computation cost, while greatly reducing the enlarged space merge process. At the same time the use of access control and more granular caching mechanism to optimize the performance of reading.

image

Since the X-Engine is the LSM-based architecture, so everything but also from LSM itself.

LSM basic logic

A data structure LSM journey, the write WAL (Write Ahead Log) starts, and then enters the MemTable, which is at a first foothold Ta entire life cycle. Then, flush operation Ta engraved on a more stable medium, compaction operation will be more far-reaching Ta taken place, or discarded on the way, depending on when Ta successor soon.

LSM is the essence of all write operations do not update the place, but with an additional way to write memory. Each wrote a certain degree, to freeze a layer (Level), written to persistent storage. All lines written to both the primary key (Key) after a good sort storage, either in memory or persistent storage. In a sort of memory is the memory data structure (Skiplist, B-Tree, etc.), also in persistent storage as a whole sort of read-only persistent storage structure.

Ordinary storage system to support transaction processing, especially ACI, need to add a time dimension, thereby constructing a separate sight interference from concurrent for each transaction. Storing each transaction engine will impart a global sequencer and monotonically increasing transaction version number (SN), each transaction record is stored in the SN to determine visibility between individual transactions, thereby achieving isolation mechanism transaction.

If LSM storage structure continued to write, do not do other actions, then the final structure will be as follows:

Note that SN identifies the scope of each layer of the written order of affairs has persisted data can no longer be modified. Each layer data are sorted Key, Key range between the layers will overlap.

image

This structure is to be written very friendly, as long as the memory is added to the latest list that is complete for the realization of crash recovery, simply record WAL (Redo Log), because the new data will not overwrite the old version, additional recording will form a natural the multi-version architecture.

It is conceivable that more and more so the accumulated frozen persistence level inquiry would have an adverse impact on multiple versions of the same key produce the different transaction commit records are scattered at various levels, the key will be scattered in different different levels, the sequential scanning such as a read operation will need to find and merge the individual layers to produce the final result.

LSM introduction of operating a compaction solution to this problem, this operation continue to merge adjacent-level data, and writes the lower level. The merger process is actually put to merge two adjacent layers (or multiple layers) read out the data, sorted by key, the same key if there are multiple versions, only the new (than the active transaction is currently being executed the minimum number of the new version) version, throw away the old version of the data, and then write the new layer. One can imagine this operation is very resource-intensive.

LSM compaction operation, has several functions, one is to discard the old version of the data is no longer being used, in order to control the two-level shape LSM, LSM shape is generally the lower level, the larger (multiple of) the amount of data, so placing the main purpose is to enhance the read performance.

image

In general, the data storage system has access to any locality, a large number of accesses are concentrated on a small number of data, which is the basic premise of caching system can effectively work in LSM storage structure, if we put a high frequency of visit data as possible on a higher level, keeping some large amount of data that can be stored in flash memory devices (such as NVM, DRAM), while the less frequently accessed data on the lower level, the use of low-cost slow storage device stores. This is based on the concept of cold stratification of X-Engine.

To achieve this effect, the core issue is how to select the appropriate data consolidation to a lower level, which is the first compaction scheduling strategy to solve the problem, according to the logic of cold stratification is cold priority merge data (access frequency is relatively low ).

There are many ways to identify cold data, not necessarily the same for different services, for many water-based service (e.g., transaction log system), there will be more data newly written probability of being read, the writing time order i.e. by cold can be distinguished, there are many applications wherein access to the write time is not necessarily with the relationship, the data necessary to identify the cold or heat according to the actual data access frequency.

In addition to the data heat, pick merge data, there are other dimensions, will have an impact read performance, such as updating frequency data, a large number of multiple versions of data at query time will waste more I / O and CPU, so priority to reduce the number of versions merge records, X-Engine considering various strategies to form their own compaction scheduling mechanism.

Refined LSM

The above is the macro LSM logical structure, if a particular theory of how to read and write operations and carry out compaction, you need to explore every level of data organization, implementation of each variant LSM vary.

X-Engine is used memtable Locked-free SkipList. Demand is simple, and concurrent read and write performance are high. Of course, a more efficient data structures, or use multiple indexing techniques. This part of the X-Engine optimization does not do too much, because the logic of the transaction is more complex, the write memory table has not become a bottleneck.

Persistence layer how to organize more efficiently, which need to be discussed each of the fine structure.

Data Organization

Briefly, X-Engine Extent of each layer is divided into fixed size, storing a consecutive segments each level data (Key Range). In order to quickly locate Extent, establish a set of index (Meta Extents of each index), all of these indexes, together with the composition of all the memory tables (active / immutable) together with a metadata tree (metadata tree), root node "metadata Snapshot", the tree structure is similar to the B-tree, of course, vary the same.

image

Note that, X-Engine in addition to the active memtable currently being written, other structures are read-only and will not be modified. Given some point in time, for example, LSN = 1000, the figure above "Metadata Snapshot1" contains a reference to a structure that is a snapshot of all the data (LSN = 1000) time (which is why this structure is called a Snapshot the reason).

Even Metadata structure itself, but also will not be modified once created. All are based on reading the "Snapshot" structure for the entrance, this is the basis of X-Engine realize SI isolation level. As data is written before spoken, the more accumulated data, the need to freeze memtable, the flush, and the compaction of the layers. These operations will modify data storage structure of each layer, all of these operations, are used copy-on- write is achieved, the method is modified each time (switch / flush / compaction) writes the result of the new generation the Extent, and sequentially generates a new "Meta Index" structure, as well as the new "Metadata Snapshot", a compaction operation to for example:

image

See "Metadata Snapshot 2" with respect to "Metadata Snapshot 1" and there is not much change, only modify some of the leaf nodes and index nodes are changed. This technique is quite similar to the "B-Trees, Shadowing, and Clones" ( https://liw.fi/larch/ohad-btrees-shadowing-clones.pdf ), if you read the papers, you would understand this process help.

image

Transaction Processing

Thanks to lightweight LSM write mechanism, the write operation is of course its obvious advantages, but the transaction is much more than the updated data is written to the system so simple, here to ensure ACID, involves a complex set of processes. X-Engine will handle the entire transaction process is divided into two phases: phase read and write and submit stages.

Write transaction verification phase requires write conflicts, read and write conflict, determining whether a transaction can be executed or rolled back to retry the lock or the like. If all data transactions through conflict check, put the modified write "Transaction Buffer", submitted stage includes writing WAL, write memory table, and the submission of the entire process and return to the user the results, there is both I / O operations ( the write log, the return message), and a CPU operation (copy log write memory table).

There will be a large number of concurrent transactions in order to improve internal transaction processing throughput, the system performs a single I / O operation is relatively expensive, most storage engines tend to gather together a group of committed transactions, known as "Group Commit", able to consolidate I / O operations , but the process is a set of transactions submitted, there are still a lot of waiting process, such as the log is written to disk process, doing nothing but wait off the plate.

To further enhance the X-Engine transaction throughput, using a pipeline technique: the commit phase is divided into four separate finer stages: to copy the log buffer (Log Buffer), log off disc (Log Flush) write memory table (write memtable), submit returns (Commit). Our transaction commit thread to the processing stage, are free to choose to perform any one stage of the pipeline, so that each stage can be parallelized up, as long as the proper size of the task of dividing lines, it will be able to fully parallel, pipelined in a state close to full.

In addition, the use of a transaction processing thread, rather than a background thread, each thread in the implementation of, or chose a stage pipeline work, or stroll around found nothing, simply go back to receiving more the request, there is no waiting, no need to switch to fully mobilize the capabilities of each thread.

image

Read

LSM in the way to handle multiple versions of the data is the new version of the data record will be added in later versions of the old data, from the physical point of view, different versions of a record may be stored in different layers, it is necessary to find the right version (based on the transaction at the time of the query the isolation level defined visibility rules), a general query to find all the latest data, always (most recently written by the new level) to find directions to the old level.

For finding a single record, once found may be terminated, if the record is still on the level by comparison, such as memtable, soon returned; if the record has unfortunately fallen into a very low level (may be very random read ), it would have to go through the long journey to find step by step, perhaps bloomfilter can skip some levels to speed up the journey, but after all, there are more I / O operations.

X-Engine for the introduction of a single record queries Row Cache, made a persistent cache on all levels of the data, not a single line in the query hit in memtable, it will be caught in the Row Cache. Row Cache Cache the need to ensure that all levels of persistence in the latest version of the record, and this record is likely to change, such as the read-only flush every time memtable write persistence level, you need to properly update the Row Cache cache recording, this operation is relatively delicate, require careful design.

Operating range of scanning not so lucky. Because not determine a range of key data in which hierarchy, perhaps each floor, can only scan in order to return the final result after all the levels do merge. X- Engine also uses a number of methods: such as Surf (SIGMOD'18 best paper) to provide a scanning range scan filter to reduce the number of layers; and asynchronous I / O and a wide range of scanning prefetching also significantly improved.

image


Read in the core is the cache design, Row Cache to cope single-row query, Block Cache responsible slip through the net Row Cache miss, but also to cope with the scan; Because the LSM compaction operation will be a large number of high-volume update Data Block, lead Block Cache failure of large amounts of data in a short time, bringing a sharp jitter performance.

X-Engine has also done a lot of treatment:

  1. Compaction of reduced size;
  2. Reduce compaction process changes the data (see later section);
  3. compaction process to make updates to the existing fixed-point data cache, which can cache will fail to bring basic minimum level of jitter.

X-Engine cache in more diverse, memtable may be counted as one. With limited memory, how appropriate is assigned to each of the cache in order to maximize value, is not yet a proper solution of the issue, X-Engine is also exploring them.

Of course, to bring the LSM read is not all bad, in addition to the structure other than read-only MemTable, on the read path can be completely free lock (MemTable also be designed to read no lock).

Compaction

compaction operation is relatively heavy. We need to adjacent level crossing key range data read out, merge, and then written to the new location. This is a simple operation had to pay the price written as the front. X-Engine redesigned to optimize the operation of the storage structure.

As described above, X-Engine data for each layer is divided into fixed size "Extent", a small Extent corresponding to a complete SSTable, stores a hierarchy of a continuous segment, which will be further divided a smaller contiguous fragment "Data Block", equivalent to the traditional database "Page", but is read-only, and is variable length.

image

Looking back at the difference between "Metadata Snapshot2" and "Metadata Snapshot1" data organization in a "merge operations change the metadata of" contrast, can be found Extent of design intent. Yes, each modification of the structure of the adjustment came not all, but only a small part of a modified overlapping data, and to to "Meta Index" node.

Two "Metadata Snapshot" structure is actually a large number of shared data structures. This is known as data multiplexing (Data Reuse), and Extent size is the impact of key data reuse rate, Extent as a complete physical structure is multiplexed need as small as possible, so that data with other cross Extent points will become less, but not very small, or too much to index, manage costs too much.

X-Engine in the compaction of data multiplexing is thorough, assuming Extents select two adjacent levels (Level1, Level2) intersecting the Key Range is covered merged, progressive scan algorithm, as long as the response to any the "physical structure" (data Block and including Extent) the data does not overlap another layer, it can be multiplexed. But, Extent of multiplexing can modify Meta Index, and Data Block reuse can only be copied, even so, you can save a lot of CPU.

A typical data multiplexed in the compaction process illustrated in Figure 17:

image

As can be seen, the data multiplexing process is performed in the progressive iteration process, but such fine multiplexed data leads to another side, i.e., fragmentation of data, in practice the process also requires a compromise based on the actual situation.

Data Multiplexing only to the compaction operation itself brought benefits, reducing I / O and CPU consumption during the operation, but also on the overall performance of the system produced a series of impact. For example, the data compaction process without completely rewritten, greatly reducing the space for write amplification; but also because most of the data intact, the data cache will not be updated because the data fail to reduce the merger process due to the failure to bring the read cache performance jitter.

In fact, the compaction process optimization is only part of X-Engine work, and more importantly, is to optimize the compaction scheduling strategy, choose what Extent, size defined compaction task execution priority will be on the overall system performance have an impact, but what the perfect strategy does not exist, X-Engine accumulated some experience, defines a number of rules, and to explore how reasonable scheduling strategy is an important direction for the future.

postscript

X-Engine is a cloud Ali Smart Business Group - one of the important core database technology products division.

As a MySQL-compatible database POLARDB X storage engine, before the service is gradually grinding Ali Group's business in mature, the second half of this year, we will launch MySQL (X-Engine) of RDS public cloud service platform in the cloud Ali, Ali cloud public cloud customers on a low-cost high-performance database service.

Guess you like

Origin yq.aliyun.com/articles/706292