Combination of software and hardware, optimization practice of distributed database ZNBase storage architecture

ZNBase is the first distributed database project under the Open Atomic Open Source Foundation, which was open sourced and donated by the Inspur Big Data team. This article will introduce the storage architecture of ZNBase and the optimization practices made by the ZNBase technical team based on its KV storage engine.

ZNBase overall storage architecture

Yunxi database ZNBase adopts a layered architecture, which is divided into computing layer and storage layer. Its overall architecture is shown in the following figure:

In the OLTP scenario, when developers send SQL statements to the cluster, the data will eventually be read and written to the storage layer in the form of key-value pairs KV. Each ZNBase node starts with at least one storage node. The storage layer uses the KV storage engine RocksDB by default to store data. Each storage instance may contain multiple Ranges, and Range is the lowest KV data unit. Range is replicated across clusters using the Raft consensus protocol.

In order to support HTAP scenarios, ZNBase's row storage data will be synchronized to the column storage engine through Raft Learner, and the storage layer also extends the vector interface to meet the AP requirements of business applications through column storage, computing pushdown, and multi-node parallel computing. In addition, the storage layer also extends the timing interface, which can manage timing data.

Next, we will focus on the KV storage cluster in the ZNBase storage layer architecture.

RocksDB engine

The storage layer of ZNBase uses RocksDB to store data by default. RocksDB is a high-performance KV storage engine open sourced by Facebook. The read and write data can be any byte stream. Its official architecture is shown in the figure, which is an implementation of LSM-Tree:

 The RocksDB writing process is as follows:

1. Write WAL files sequentially in batch form for failure recovery

2. Then write to Memtable in memory, Memtable is stored as SkipList by default

3. When Memtable is filled with a certain size, the default is 64MB, and it will be converted into an immutable Immutable.

4. The asynchronous thread will flush the Immutable to the disk L0 layer and turn it into an SST file.

5. Multiple layers are used in the disk to manage SST files, and the SST files of the upper layer will be dropped to the lower layer through the merging process, and GC will be cleaned up during the merging process.

It can be seen that the latest data of RokcsDB is in the memory or the upper-layer SST file, so when reading, read Memtable and Immutable first, and then read each layer of files accordingly. Hot data just written can be read faster. In addition, RocksDB also provides Block Cache for read caching. 

RocksDB itself is a high-speed KV storage engine with LSM-Tree architecture, and both read and write will use memory as much as possible. RocksDB provides atomic batch write, snapshot and other functions to facilitate the upper layer to implement transaction control. Based on RocksDB, it also supports the highly secure AES algorithm to encrypt the data stored on the disk, so that even if the disk is stolen, the data in the disk cannot be accessed.

Since RocksDB adopts the LSM-Tree architecture, there are also certain problems. For example, under a large amount of data, there are relatively high read amplification, write amplification, and space amplification, and the performance degradation is also relatively severe. In order to improve the performance of ZNBase under massive data, the ZNBase team combined the hardware advantages of Inspur to develop a storage engine that uses SPDK drivers to integrate software and hardware on the dedicated hardware ZNS SSD ; Read and write speed, and also developed a quasi-memory engine implementation in large memory scenarios .

SPDK + ZNS SSD

Solid-state drives (SSDs) are rapidly expanding their presence in data centers, and the new flash media offers performance, power, rack space, and more advantages over traditional storage media.

Inspur's self-developed ZNS SSD has achieved leaps in capacity, lifespan, cost, ease of use, and performance. ZNS SSD is a partition namespace solid state drive, which can expose FTL (Flash Translation Layer) to users to give full play to SSD performance. ZNS technology is applied in cloud scenarios, and is mainly suitable for data stored in large-capacity space, such as high-definition video, images, etc. ZNBase is integrated with ZNS SSD. Through intelligent data deployment, better space utilization can be achieved, and write amplification can be reduced by more than one time; files can be separated by data partitions according to their life cycles to minimize garbage collection; SST files can be classified according to LSM. Layer clear separation of hot and cold data to improve access efficiency; minimized write amplification can improve read performance, reduce merge costs and garbage collection overhead; theoretically, ZNS SSD can bring 15% performance improvement and 10% cost earnings .

 

Due to the physical characteristics of SSD itself, its data access is very fast compared to traditional media, and the bottleneck of performance lies in the interface and protocol between the computer and the device. As an intuitive example, we take a plane from Beijing to the United States. According to the current flight speed, it takes 13 hours in the sky. In this case, the security check time, customs clearance time, and waiting time add up to 3 hours, which is not too long compared to the total of 13+3=16 hours. Imagine that if the flight speed of the aircraft is now increased by 100 times and the flight time is shortened from 13 hours to 10 minutes, the 3-hour ground procedure process becomes very inefficient. In other words, with the rapid development of storage hardware performance today, the performance and efficiency of storage software protocol stacks play an increasingly important role in the overall storage system.

SPDK (Storage performance development kit) was initiated by Intel and provides a set of tools and libraries for writing high-performance, scalable, user-space storage applications. The foundation of SPDK is a userland, polled, asynchronous, lock-free NVMe driver that provides zero-copy, highly parallel access from userspace applications to NVMe devices.

The full name of NVMe is Non-Volatile Memory Express. If translated, it is the non-volatile memory host controller interface specification. It is a new type of storage device interface specification used to define hardware interfaces and transmission protocols. AHCI standard higher read and write performance. We can think of NVMe as an example of hardware advancements driving the need for software innovation, a push that will become even more urgent as subsequent faster storage media hit the market. 

The SPDK project is also a product of hardware advancements to promote software innovation. Its goal is to fully utilize the latest performance advances in computing, networking, and storage of the hardware platform. So how does SPDK work? Its ultra-high performance actually comes from two core technologies: the first is user mode operation , and the second is polling mode driver .

First of all, running the device driver code in user mode is relative to running in "kernel mode". Moving the device driver out of the kernel space avoids kernel context switching and interrupt handling, which saves a lot of CPU load and allows more instruction cycles to be used to actually process the data storage. Whether the storage algorithm is complex or simple, and whether deduplication, encryption, compression, or simple block reads and writes are performed, less wasted instruction cycles mean better overall performance.

Secondly, the traditional interrupted IO processing mode adopts passive dispatch work. When there is IO to be processed, an interrupt is requested. After receiving the interrupt, the CPU performs resource scheduling to process the IO. Taking an example of a taxi as an analogy, the IO tasks of traditional disk devices are like taxi passengers, and the CPU resources are scheduled to handle IO interrupts like taxis. When the disk speed is much slower than the CPU, the CPU interrupt processing resources are abundant, and the interrupt mechanism can cope with these IO tasks freely. This is like off-peak hours, when the supply of taxis exceeds the demand, there are always empty cars sweeping the road, and passengers can call a taxi at any time. However, during peak hours, such as calling a car in a downtown area on a Friday evening (without Didi or a special car), it is common to see a car approaching, only to find that there are already passengers in the back seat. How long you need to wait is often unpredictable. I believe you must have seen people stranded on the side of the road and beckoning to stop cars. Similarly, when the speed of the hard disk is increased by thousands of times, a large number of IO interrupts will be generated, and the Interrupt Driven IO Process of the Linux kernel is not efficient.

Driven by polling mode, packets and blocks are dispatched quickly and latency is minimized, resulting in lower latency, more consistent latency (less jitter), and better throughput.

Of course, polling mode drivers are not the most efficient way to handle IO in all cases. For low-speed SATA HDDs, the processing mechanism of PMD not only brings little improvement to IO performance, but also wastes CPU resources.

Quasi-memory engine

In addition to using the storage engine combining SPDK+ZNSSD software and hardware, in order to use memory more effectively and achieve faster read and write speed, ZNBase also developed the implementation of a quasi-memory engine in large memory scenarios.

The background for quasi-memory engine development stems from some limitations of RocksDB:

1. The single-threaded write mode of WAL cannot give full play to the performance advantages of high-speed devices; 

2. During the read operation, it may be necessary to search for multiple files in level 0 and files in other layers, which also causes a lot of read amplification. Especially after purely random writing, reading almost needs to query all files in level 0, which leads to inefficiency of read operations.

3. For the second point, RocksDB performs front-end write flow control and back-end merge triggers based on the number of level 0 files to balance the performance of read and write. This in turn leads to performance jitter and inability to perform high-speed media performance. 

4. The merging process is difficult to control, and it is easy to cause performance jitter and write amplification.

In recent years, with the increase in dynamic random access memory (DRAM) capacity and the decline in unit price, it has become possible to store and process large amounts of data in memory. Compared with disk, the speed of reading and writing data in memory is several orders of magnitude higher. Saving data in memory can greatly improve the performance of applications compared to accessing from disk.

In response to this concept, the quasi-memory engine design principles are as follows:

1. Store data in memory as much as possible to make full use of the read and write performance of memory.

2. Based on the idea of ​​separate storage of indexes and data, indexes reside in memory. Data can be partially stored in persistent storage devices according to memory capacity.

3. Improve the cpu cache hit rate, and make the KV data adjacent to the key adjacent in the memory space through the reconstruction of the data block (disassembly, expansion, and reduction) to improve the speed of key value traversal.

4. Use the asynchronous disk placement mechanism to ensure a smooth IO rate and eliminate IO bottlenecks.

The retrieval mechanism based on ART index ensures data access speed. The ART algorithm uses a synchronization mechanism based on optimistic locking, the read operation does not block, and the write operation uses the CAS atomic operation of version information and the retry mechanism.

The WAL write process can be eliminated as data recovery has been achieved through Raft Log.

 

The overall processing flow of the quasi-memory engine is as follows:

1. The newly inserted data is stored in Memtable and Immutable in the temporary memory area.

2. The asynchronous thread Mem-Flush dumps the data in the Immutable to the memory storage area.

3. The memory storage area will occupy as much memory space as possible for storing data. These data are indexed using the ART algorithm .

4. When the memory space in the memory storage area is about to run out, enable the L1-Flush background thread to clean up the data in the memory storage area.

5. The L1-Flush background thread will place all KV data and range deletion data in the memory storage area to disk. Due to the large amount of data and the orderly data, the disk is directly placed to the L1 layer. Before placing the order, it is necessary to ensure that there is no file in the L1 layer. After the placement is completed, create the task of merging the L1 layer files downward.

The quasi-memory engine uses the ART algorithm as the main index implementation, which can perform fast range search, and uses the Hash index to facilitate Get queries. The reason why the ART algorithm is used is that its performance in reading and writing is better than that of B+ trees, red-black trees, and binary trees. .

参见论文:The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases

Summarize

This article introduces the storage engine architecture of ZNBase based on RocksDB, and the limitations of RocksDB by the ZNBase team. By developing the SPDK+ZNS SSD storage engine that combines software and hardware, and the quasi-memory engine in large memory scenarios, the performance of the storage engine is greatly improved. Read and write performance.


More details about ZNBase can be found at:

Official code repository: https://gitee.com/ZNBase/zn-kvs

ZNBase official website: http://www.znbase.com/ 

If you have any questions about related technologies or products, please submit an issue or leave a message in the community for discussion. At the same time, developers who are interested in distributed databases are welcome to participate in the construction of the ZNBase project.

Contact email: [email protected]

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324211237&siteId=291194637