Memory database analysis and comparison of mainstream products (2)

Author: laboratory Chen / Big Data Laboratory

In the previous article "Analysis of In-Memory Database and Comparison of Mainstream Products (1)", we introduced the knowledge of disk-based database management systems and briefly described the technical development of in-memory databases. This article will introduce the characteristics of in-memory databases from the perspective of data organization and indexing, and introduce the actual technical implementation of several products.

— Data organization in database management system —

Fixed length block VS variable length block

When the in-memory database manages the data in the memory, although it is no longer necessary to organize the data in the form of Slotted Pages, it is also not possible to arbitrarily allocate address space for the data in the memory. The data still needs to be organized into blocks (Block/Page) form. The traditional disk-based DBMS uses Slotted Page to organize data for read and write performance considerations, because the disk interface uses Block/Page as the read and write unit. The in-memory database uses blocks to organize data to facilitate addressing and management. Data blocks are usually divided into fixed-length data blocks (Fixed-Length Data Block) and variable-length data blocks (Variable-Length Data Block).

Assuming that a data set has all been loaded into the memory, for ease of use, the in-memory database will take out all the fixed-length attributes of the record during data organization and put them in a fixed-length data block; all variable-length attributes are stored in another Variable length data block. For example, usually all attributes less than 8 bytes in the data table are placed in a fixed-length data block, and variable-length attributes and attributes with more than 8 bytes are separately placed in a variable-length data block. Put a pointer to its address in the block. The advantage of using a fixed-length data block to manage data is that it is fast in addressing, and the location of the record in the data block can be determined by the record length and number; the space required for recording the address pointer is small, making the index structure or other structures store this record The memory address is the most concise, and the prediction is more accurate when the CPU does Pre-Fetch.

In the traditional disk-based DBMS, the record address saved by the index leaf node is Page ID + Offset, and the Page Table is responsible for mapping the Page ID to the Buffer frame; in the memory database, the record address saved by the index leaf node is the direct memory address. In the traditional disk-based DBMS, when accessing the Page in the Buffer, it is necessary to lock/unlock/modify the lock of the Page. Since there may be many types of locks (Latch) in the real system, if a thread wants to access a Page, Several types of Latch are often added. Now there is no Buffer in the memory database, so the overhead of Latch is eliminated, and the performance is greatly improved.

Data organization: Data partitions, multiple versions, rows/columns are stored
in multi-core or multi-CPU shared memory systems, and conflicts of concurrent access to data always exist. The current memory database system can be divided into Partition System and Non-Partition System. Partition System divides all data into disjoint multiple Partitions. Each Partition is assigned to a core (or a node in a distributed system). All operations are executed serially without concurrent data access. Ideally, the best performance can be obtained. But the shortcomings of this type of system are also obvious, such as how to divide Partitions and how to handle cross-partition transactions. For the Non-Partition System, all cores and all threads can access all data, so there must be concurrent access conflicts, and data structures that support concurrent access must be adopted. At present, general databases are more designed with Non-Partition System. The main reason why Partition design is not adopted is that it is difficult to effectively partition data in general scenarios, and Partition databases cannot be used.

In the Non-Partition System, if two threads access the same data item will conflict, then you can consider the Multi-Version solution. The advantage of Multi-Version is that it can improve the degree of concurrency. The basic idea is to use multi-version data so that all read operations do not block write operations, thereby improving the performance of the entire system. For those systems that read more and write less, Multi-Version performance will be good, but for some Write Heavy systems, the performance is not ideal.

Another thing to consider in data organization is the organization of Row and Column. When traditional database systems maintain data on disks, they are divided into row storage and column storage. As the name implies, row storage stores data in rows, and column storage stores data in columns. If you operate on all the attributes of a small number of records, row-based storage is more appropriate. If you read only part of the column data of a large number of records, the performance of column-based storage is better. For example, a record has 100 attributes. This read operation needs to read one of all the attributes of the record. If it is stored by row, the column needs to be filtered after the block is read in; if it is stored by column, you can only read this column of data. Corresponding Block, so the performance will be better, suitable for statistical analysis. But in-memory databases do not have this problem. All data is stored in memory. Regardless of row storage or column storage, the cost of access is similar. Therefore, in the memory database, row memory/column memory can be exchanged or selected arbitrarily. Of course, for TP applications, it is more to use row storage, because all attributes can be read out at once. But even with column storage, the performance is not as bad as in a disk-based database system. For example, SAP HANA is a mixed row and column storage. The front-end transaction engine is row storage. After consolidation and integration, the back-end is converted to column storage.

— Comparison of memory database systems —

Next, from the perspective of data organization, briefly introduce four representative systems: SQL Server's in-memory database engine Hekaton, Munich Technical University's in-memory database system HyPer, SAP's HANA, Turing Award winner Michael Stonebraker's H- Store/VoltDB.
Memory database analysis and comparison of mainstream products (2)
Hekaton

Hekaton is a Non-Partition system, all threads can access arbitrary data. Hekaton's concurrency control does not use a lock-based protocol, but uses a multi-version mechanism. Each version of each record has a start timestamp and an end timestamp to determine the visible range of the version.

Each table in Hekaton has up to 8 indexes, which can be Hash or Range indexes. At the same time, all record versions do not require continuous storage in the memory, but can be non-contiguous storage (No-Clustering), which associates different versions of the same record through a pointer (Pointer Link).
Memory database analysis and comparison of mainstream products (2)
As shown in the figure above, there is a table containing name, city, and amount fields. There is a Hash index on the name field and a B-Tree index on the city field. The black arrow represents the pointer corresponding to the name index. The first record corresponding to the name John points to the next record with the same initial letter name. Each record contains a start and end timestamp. Red indicates that there is a transaction that is updating the record, and the end timestamp will be replaced after the transaction is committed. The same is true for the B-Tree index, the blue arrow pointer is connected in series according to the city value.

H-Store/VoltDB

H-Store/VoltDB is a Partition System. Each Partition is deployed on a node, and tasks on each node are executed serially. H-Store/VoltDB does not have concurrency control, but has simple lock control. A Partition corresponds to a lock. If a transaction is to be executed on a Partition, it needs to get the lock of this Partition before it can be executed. In order to solve the problem of cross-partition execution, H-Store/VoltDB requires that Transaction must simultaneously acquire the locks of all relevant Partitions before it can start execution, which is equivalent to simultaneously locking all Partitions related to the transaction.

H-Store/VoltDB adopts a two-tier architecture: the upper layer is the Transaction Coordinator, which determines whether the transaction needs to be executed across Partitions; the lower layer is the execution engine responsible for data storage, indexing and transaction execution, using a single-version row storage structure.

The data blocks in H-Store/VoltDB are divided into fixed-length and variable-length categories: each record of the fixed-length data block has the same length, and an 8-byte address is used in the index to point to the position of each record in the fixed-length data block ; The variable-length attribute is stored in the variable-length data block, and there is a pointer (Non-Inline Data) in the record of the fixed-length data block, which points to its specific location in the variable-length data block. In this way of data organization, a compressed Block Look-Up Table can be used to address data records.
Memory database analysis and comparison of mainstream products (2)
HyPer

HyPer is a multi-version Non-Partition System, and each Transaction can access any data. At the same time, HyPer is a TP and AP hybrid processing system established for HTAP services. HyPer realizes TP and AP mixed processing through Copy on Write mechanism. Assuming that the current system is doing transaction processing on the data set, if an AP request occurs at this time, HyPer will take a snapshot of the data set through the Fork function of the operating system, and then perform analysis on the snapshot. The Copy on Write mechanism does not copy all the data in the memory. Only when the data changes due to the OLTP business, the snapshot will actually copy the original data, while the unchanged data is referenced to the same physical memory through the virtual address address.
Memory database analysis and comparison of mainstream products (2)
In addition, Hyper uses multi-version control, all updates are based on the original record, each record will maintain an Undo Buffer to store incremental update data, and indicate the current latest version through the Version Vector. Therefore, the modified record can be found through Transaction, and the previous version can be retrieved by applying incremental data reversely. Of course, the data version can also be periodically merged or restored.
Memory database analysis and comparison of mainstream products (2)
SAP HANA

SAP HANA is a non-partition hybrid storage system. The physical records in the storage medium will go through three stages: 1. Transaction processing records are stored in L1-Delta (row storage method); 2. The records are then converted to columnar parallel Stored in L2-Delta (column storage, unsorted dictionary coding); 3. SAP HANA's main memory is column storage (highly compressed and using sorted dictionary coding). Each record undergoes mapping and merging from row storage to column storage, which is equivalent to a multi-version design.
Memory database analysis and comparison of mainstream products (2)

— Index technology in database management system —

In the field of memory database, when designing indexes, it is mainly considered from two aspects: cache-oriented indexing technology (Cache-Awareness) and multi-core and multi-CPU parallel processing (Multi-Core and Multi-Socket Parallelism).

Since the in-memory database no longer has disk I/O restrictions, the purpose of indexing is to accelerate the access speed between CPU and memory. Although the price of memory is low now, there is still a large difference between the increase in memory speed and the increase in CPU frequency. Therefore, for in-memory databases, the purpose of indexing technology is to provide the CPU with data in time, so as to transfer the required data as fast as possible. Put it in the Cache of the CPU.

For the parallel processing of multi-core and multi-CPU, in the 1980s, we began to consider how to construct an index reasonably if the data structure and data were stored in memory. Among them, the MM-DBMS project of the University of Wisconsin in 1986 proposed a self-balanced binary search tree T-Tree index. Each binary node stores data records within the range of values, and there are two pointers to its two sub-nodes. node. The memory overhead of the T-Tree index structure is small. Because memory was expensive in the 1980s, the main measurement is not whether the performance is optimal, but whether it takes up the smallest memory space. The disadvantage of T-Tree is the performance problem, it needs to be regularly load balanced, and scanning and pointers will also affect its performance. Early commercial systems such as Times Ten used the T-Tree data structure.

So why does the index design need to consider Cache-Awareness? In 1999, a study found that Cache Stall or Cache Miss in memory access was the main performance bottleneck of the memory system. The research conducted a performance test to test the percentage of time spent on the following processes by evaluating A/B/C/D 4 systems: Computation, Memory Stalls, Branch Mispredicitons, and Resource Stalls. Computation represents the time actually used for calculation; Memory Stall is the time waiting for memory access; Branch Mispredicitons refers to the cost of CPU instruction branch prediction failure; Resource Stalls refers to the time waiting for the open source of other resources, such as network, disk, etc.
Memory database analysis and comparison of mainstream products (2)
It can be seen that Memory Stall will occupy a large proportion of the overhead in different test scenarios. Therefore, for the memory index structure, the main purpose of developing cache-oriented indexes is to reduce the overhead of Memory Stall.

CSB+-Tree
introduces several typical examples of memory index structure. The first one is CSB+-Tree, which is still B+-Tree logically, but with some changes. First of all, the size of each Node is a multiple of the length of the Cache Line; at the same time, CSB+-Tree organizes all the child nodes of a node into Children Group, a parent node points to its Children Group through a pointer, the purpose is to reduce the data structure Number of pointers. Because the nodes of CSB+-Tree match the length of the Cache Line, as long as they are read sequentially, better pre-fetche performance can be achieved. When the tree is split, CSB+-Tree will re-allocate the Group in memory, because CSB+-Tree nodes do not need to be continuous in memory, just create a new pointer link after arranging them.
Memory database analysis and comparison of mainstream products (2)
PB+-Trees

Another example is PB+-Trees (Pre-fetching B+-Tree). It is not a new structure, but a B+-Tree is implemented in memory, and the size of each node is equal to a multiple of the length of the Cache Line. The special feature of PB+-Trees is the introduction of Pre-fetching in the entire system implementation process, which helps the system prefetch data by adding some additional information.

PB+-Trees tends to use flat trees to organize data. The performance of Search and Scan is given in the paper. Among them, Search performance is improved by 1.5 times and Scan performance is improved by 6 times. Compared with CSB+-Tree, PB+-Trees has a smaller proportion of Data Cache Stalls when processing Search.

Another performance comparison is that when prefetching is not used, it takes 900 clock cycles to read a three-level index with a Node size equal to two Cache Lines, and only 480 cycles are required after prefetching is added. Another implementation of PB+-Trees is that it adds a Jump Pointer Array to each node to determine how many Cache Lines to skip when scanning to prefetch the next value.
Memory database analysis and comparison of mainstream products (2)
Bw-Tree

Bw-Tree is an index used in the Hekaton system. The basic idea is to compare memory values ​​through Compare-and-Swap instruction-level atomic operations. If the new and old values ​​are equal, they will be updated. If they are not, they will not be updated. For example, the original value is 20 (stored on the disk), and the memory address corresponds to 30, then if you update 30 to 40, it will not succeed. This is an atomic operation, which can be used to realize uninterrupted data exchange operations in multi-threaded programming.

There is a Mapping Table in the Bw-Tree, and each node has a storage location in the Mapping Table, and the Mapping Table stores the address of the node in the memory. For Bw-Tree, the pointer from the parent node to the child node is not a physical pointer, but a logical pointer, that is, the location recorded in the Mapping Table is not a real memory location.
Memory database analysis and comparison of mainstream products (2)
The design adopted by Bw-Tree is that the update of the node does not directly modify the node, but saves the modified content by adding Delta Record (incremental record), and then points to the Delta Record in the Mapping Table, if there is a new update, it continues to point The new Delta Record. When reading the content of a node, all Delta Records are actually merged. Because the update of Bw-Tree is achieved through an atomic operation, only one change can succeed when there is a competition, so it is a Latch-Free structure, and only rely on Compare-and-Swap to solve the competition problem. Need to rely on the lock mechanism.


Adaptive Radix Tree

The design of Hyper's index tree uses Adaptive Radix Tree. Traditional Radix Tree is a prefix tree. Its advantage is that the depth of the tree does not depend on the number of indexed values, but is determined by the length of the Search Key. Its disadvantage is that each node has to maintain the information of the child nodes that may take values, which results in a large storage overhead for each node.

In Adaptive Radix Tree, different types of length formats are provided for each node, and different numbers of child nodes such as 4/16/48/256 can be saved respectively. Node4 is the smallest node type and can store up to 4 child node pointers. Key is used to indicate the value stored by the node. The pointer can point to the leaf node or the next internal node. Node16 and Node4 are structurally consistent, but Node16 can store 16 unsigned chars and 16 pointers. When storing the 17th key, it needs to be expanded to Node48. The structure of Node48 is different from Node4/16. There are 256 index slots and 48 pointers. These 256 index slots correspond to the 0-255 of unsigned char. The value of each index slot corresponds to the position of the pointer, which is 1-48. , If a byte does not exist, the value of its index slot is 0. When storing the 49th key byte, it needs to be expanded to Node256. The result of Node256 is relatively simple. It directly stores 256 pointers, and each pointer corresponds to the 0-255 range of unsigned char.

For example, in this example, we want to index an integer (+218237439), the binary representation of the integer is 32 bits, and then the 32-bit bit is converted to 4 Bytes, which are represented as 13, 2, 9, and 10 in decimal. 255, this is its Search Key. In the index tree, the first layer is Node 4, 13 meets the storage requirements of this layer, so it is reserved on the first layer of nodes, and the following digits go to the next layer of storage, and the next layer is Node 48. To store 2; every next bit is stored in each layer below. Since the integer in this example is represented by 4 bytes, there are 4 levels in total. It can be seen that the structure of each node is different. It is stored one by one according to the number of bytes and order. The number of different values ​​in this layer currently depends on the type of node. If the current type is not enough, you can increase the number. The key that each node can hold is dynamically changing, which can save space and improve cache locality.
Memory database analysis and comparison of mainstream products (2)
In addition, Adaptive Radix also uses a path compression mechanism, if the parent node of a path has only one child node, it will be compressed and merged. Adaptive Radix adopts such an index structure because the size of each node is equal to a Cache Line, and all operations can be implemented on the basis of a Cache Line.

OLFIT on B+-Trees

OLFIT on B+-Trees (Optimistic Latch Free Index Access Protocol) is the indexing technology adopted by HANAP*Time, which can guarantee the Cache Coherence of the CPU on a multi-core database. In the architecture of a multi-processor computer, the Cache of multiple CPUs caches data in the same memory. In the memory database, the stored data will be read to the corresponding Cache before processing; if the memory data changes during the cache data processing, the Cache data will become invalid due to inconsistencies with the memory data. Cache Coherence is to solve this out of synchronization problem .

Consider such a scenario: as shown in the figure below, there is a tree-shaped data structure in the memory, which is processed by 4 CPUs, and each CPU has its own Cache. Assuming that CPU-P1 reads n1, n2, and n4 first, there are n1, n2, and n4 in the cache. When CPU-P2 reads n1, n2, and n5 later, it is assumed that this data structure is not Latch-Free. If modification is allowed while reading, a Latch is needed to lock during reading, and then release after reading. Because the Latch and the data are put together in the memory database, although the data has not changed, the state of the Latch has changed, and the computer's hardware structure will think that the memory has changed. Therefore, when multiple CPUs read the same data, only the last read status is valid, and the previous read will be considered invalid. This will happen even if all read operations are performed, but because the Latch state changes, the CPU Cache fails. Therefore, OLFIT has designed a set of mechanisms that require Latch for write operations, but not for read operations. OLFIT maintains read and write transactions through the version number. Each CPU copies the version number to the local register before reading, and then after reading the data, it judges whether the version number is the same as the one before reading. If they are the same, continue to execute normally, and if they are different, the Cache is invalid. Therefore, the read request will not cause the Cache of other CPUs to fail.
Memory database analysis and comparison of mainstream products (2)
Through this example, we can see that the considerations of in-memory databases are different from disk-based databases. Without the factor of disk I/O, you need to consider other aspects of performance limitations.

Skiplists

Skiplists is the technology used by MemSQL's data processing engine. The bottom layer is an ordered list, and the upper layer extracts data according to a certain probability (P-value), and then makes a list. When searching a large list, starting from the top layer and step by step, similar to binary search, the granularity can be customized according to the actual situation. The reason for this design is that all insertion operations to the list can be completed through the Compare-and-Swap atomic operation, thereby eliminating the overhead of locking.
Memory database analysis and comparison of mainstream products (2)
— Summary of this article —

This article first introduces the data organization of the in-memory database, from the data division, Partition/Non-Partition system differences and storage methods, and compares the actual implementation of the four products. Subsequently, six kinds of indexing techniques of memory database systems are introduced, and the principle of index query is briefly described through examples. The next article will continue to analyze the in-memory database and discuss the optimization design of the in-memory database for query performance and availability from the perspective of concurrency control, persistent storage and compiled queries.

Note: The relevant content of this article refers to the following materials:

  1. Pavlo, Andrew & Curino, Carlo & Zdonik, Stan. (2012). Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. Proceedings of the ACM SIGMOD International Conference on Management of Data. DOI: 10.1145/2213836.2213844.

  2. Kemper, Alfons & Neumann, Thomas. (2011). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. Proceedings - International Conference on Data Engineering. 195-206. DOI: 10.1109/ICDE.2011.5767867.

  3. Faerber, Frans & Kemper, Alfons & Larson, Per-Åke & Levandoski, Justin & Neumann, Tjomas & Pavlo, Andrew. (2017). Main Memory Database Systems. Foundations and Trends in Databases. 8. 1-130. DOI: 10.1561/1900000058.

  4. Sikka, Vishal & Färber, Franz & Lehner, Wolfgang & Cha, Sang & Peh, Thomas & Bornhövd, Christof. (2012). Efficient Transaction Processing in SAP HANA Database –The End of a Column Store Myth. DOI: 10.1145/2213836.2213946.

  5. Diaconu, Cristian & Freedman, Craig & Ismert, Erik & Larson, Per-Åke & Mittal, Pravin & Stonecipher, Ryan & Verma, Nitin & Zwilling, Mike. (2013). Hekaton: SQL server's memory-optimized OLTP engine. 1243-1254. DOI: 10.1145/2463676.2463710.

Guess you like

Origin blog.51cto.com/15015752/2554364