VLDB 2023 | CDSBen: ByteDance veDB database storage system performance test model

background

With the explosive growth of business and the maturation of cloud native technology, a large number of cloud native distributed database products have sprung up. Some of the distributed databases focusing on OLTP scenarios emphasize obtaining elastic benefits from the computing-storage separation architecture; for the industry For databases with various computing-storage separation architectures, how to use real end-to-end database workloads to benchmark their underlying storage systems has always had the following problems:

  • For database-specific storage systems, there is no "de facto standard" benchmark model like fio

  • The implementation of database-specific storage systems is quite different from classic storage. If we still treat it as a classic storage to design benchmarks without considering the characteristics of the database, it will lead to an end-to-end "disjoint" phenomenon.

In order to solve the above problems, we hope to design a benchmark model for the dedicated storage system underlying ByteDance veDB that fits the characteristics of the database. This time we proposed the CDSBen model, which uses machine learning methods to predict the IO pattern of the storage layer based on the real database end-to-end transaction pattern , thereby achieving real and accurate benchmarking of the storage system. The related article " CDSBen: Benchmarking the Performance of Storage Services in Cloud-native Database System at ByteDance " was published in VLDB 2023.

Introduction to veDB

veDB is a cloud-native distributed database implemented by ByteDance based on the computing-storage separation architecture and serves OLTP scenarios. Its goals are:

  • High elasticity : The computing layer & storage layer are decoupled, allowing independent expansion and contraction on demand, solving the problem of stand-alone scalability.

  • High cost performance : Through a series of in-depth computing/storage core optimizations, performance is ultimately improved and costs are reduced.

  • High ease of use : fully compatible with open source database engines such as MySQL and PG, reducing learning/usage overhead.

  • High reliability/high availability : Computing/storage supports multiple copies, cross-data center deployment, fast PITR and other capabilities to improve system availability & reliability.

The system architecture of veDB is as follows:

b5c72a6f0245ccc6205e20653871d704.png

As can be seen from the above figure, veDB is divided into three layers:

  • Access layer: Provides authentication, flow control, read and write routing and other functions.

  • Computing layer: fully compatible with open source database engines such as MySQL and PG, and supports DML, DDL, and transactions.

  • Storage layer: A dedicated distributed storage system designed for databases that can support different database engines in the form of plug-ins.

Problems and challenges in veDB storage layer Benchmark

The underlying storage of veDB is a dedicated distributed storage system specially designed for databases, called veDB DBStore. Within veDB DBStore, it is divided into a log storage module that specifically persists WAL and a PageStore module that manages multi-version data pages. Under this architecture, if we want to benchmark veDB DBStore, the two methods we commonly used before are :

  • Use the end-to-end benchmark model of the database (such as TPC series & sysbench series) to adjust various parameters for stress testing.

  • Use the modified YCSB/fio-like tool to adjust various parameters for pressure testing.

However, the above two methods have the following challenges for veDB Store :

  • The end-to-end benchmark model (TPC/sysbench) essentially executes SQL statements concurrently, and the storage layer cannot directly process SQL. Therefore, TPC/sysbench cannot separate the computing layer from the computing layer to perform stress testing on the storage layer alone.

  • The modified ycsb/fio-like tool can perform stress testing on the storage layer alone. It is universal for classic storage systems, but the IO pattern of classic storage has little to do with the transaction scenarios of the database and is "out of touch" with the database. ".

Based on these issues and challenges, we designed the CDSBen model for PageStore , which manages multi-version data pages in veDB Store . We try to match the end-to-end database transaction execution pattern with the IO pattern of the storage system, so that we can benchmark PageStore with the most realistic, end-to-end pattern without the computing layer.

CDSBen scheme

Learning-based model

CDSBen includes two learning models, one is an IOPS sequence prediction model, based on a recurrent neural network; the other is a joint distribution prediction model, based on a random forest. This model is mainly used to predict the target address of read and write requests (PageStore segment ID) and Joint distribution of the amount of data written.

The workflow of CDSBen is as follows:

  • Select one/more real business scenarios, extract workload features from the running logs of the computing layer and storage layer of veDB, and use the extracted features to train the CDSBen model.

  • The user inputs the workload characteristics of the computing layer into CDSBen, and CDSBen predicts the workload characteristics corresponding to the storage layer.

  • CDSBen uses a modified YCSB to generate specific read and write requests and run them directly on the veDB DBStore for benchmark testing.

Next, we introduce the specific details of CDSBen, which mainly includes three parts: feature extraction , model design and load generation .

Feature extraction refers to extracting features from the running logs of the computing layer and storage layer respectively. These features will be used for the input and output of the IOPS prediction model and the joint distribution prediction model respectively.

  • The workload of the computing layer is composed of a mixture of multiple transaction types. We count the TPS of each transaction as the characteristic vector of the computing layer workload. For example, the veDB_OSS business includes four single SQL statement transactions, namely SELECT, INSERT, UPDATE and DELETE; TPC-C contains five types of long transactions, namely Line-Order, Payment, Order-Status, Delivery and Stock-Level. We count the TPS of each transaction as the feature vector of the computing layer workload.

  • The workload of the storage layer consists of read and write requests to the PageStore segment. For read requests, we focus on the timestamp and target address (Segment ID) of the request; for write requests, we focus on the timestamp, target address, and amount of written data. We get the specific information of each read and write request from the storage engine log. Through the timestamp of each request, we can obtain the IOPS sequence of the storage layer workload. The information contained in it can simultaneously describe the intensity and fluctuation of the storage layer workload. degree. We also focus on the joint distribution of target addresses of read and write requests and the amount of written data.

To simplify data processing, we treat all read requests as write requests with a write amount of 0 data. We use a two-dimensional array to represent the distribution of read and write requests on the target address and the amount of written data. For example, the i-th column in the array represents the j-th column of the read-write request with a written data amount of j for the target address i. Proportion.

After the feature extraction is completed, we train the two models using the collected data. The input of these two models is the feature vector of the computing layer workload, and the output is the IOPS sequence and joint distribution respectively. After the training is completed, we input the feature vector of the computing layer workload we want to simulate, and CDSBen will predict the feature vector of the corresponding storage layer workload. Then we use YCSB to randomly generate read and write requests based on the predicted feature vector of the storage layer workload and run them on the storage layer for performance testing.

It should be noted that because CDSBen only focuses on the TPS of the computing layer during the feature extraction process, and does not pay attention to the workload content itself (what SQL statement was run), therefore the CDSBen model must be retrained for each workload. But in practice, we found that the overhead of training the model is small (depending on the complexity of the model itself), and this overhead is acceptable.

The main advantages of CDSBen are accuracy, flexibility, and ease of use . For accuracy, you can directly refer to the experimental results in the paper or some screenshots in Section 4.2 below this article. Flexibility and ease of use lie in that CDSBen can run directly on the storage layer like YCSB and does not need to deploy the computing layer like TPC/sysbench. At the same time, CDSBen allows users to enter only the simple input of the feature vector of the computing layer load they want to simulate. Under the circumstances, generate read and write requests close to the real situation, and have the ability to answer what-if questions (TPS changes and transaction ratio changes).

In the experiment, we found that compared with YCSB , the performance measured by the read and write requests generated by CDSBen is significantly closer to the performance under real business traffic .

Model effect

f16abbec59862737aa0a8b6fa2e9b37e.png

The above picture is a comparison chart between the IOPS curve generated by the model and the real online business IOPS curve. During the verification process, we sampled an online business codenamed SYNC. The black line in the figure is the IOPS curve caused by the real business to veDB DBStore, and the blue line is the IOPS curve predicted by the CDSBen model. It can be seen from the figure that the two curves have a high degree of overlap, and some mathematical features of the two curves are highly matched (the interval average of the real IOPS is 1046, and the interval average of the predicted IOPS is 999).

Summarize

In the stage when there is no CDSBen model, if we want to benchmark the performance of the underlying storage system (veDB DBStore), we cannot use real business workloads to stress test the storage layer without breaking away from the computing layer.

The role of CDSBen is to help us build a conversion bridge between transaction pattern (SQL statement) and storage IO pattern (segment reading and writing), allowing us to break away from the computing layer and test veDB DBStore with more realistic business workloads . Its benefits are:

  • Help database R&D engineers to accurately tune the underlying storage system during the development phase to ensure the stability of performance when each version of veDB is launched end -to-end, and continue to provide users with high-performance database services.

  • Accurately benchmarking the (extreme) performance of the storage system under real workloads is a prerequisite for us to achieve accurate flow control in the distributed database storage layer. No matter what business scenario, we can smoothly and stably serve the real business traffic of a large number of veDB instances. .

Guess you like

Origin blog.csdn.net/ByteDanceTech/article/details/132644727