Hybrid Blockchain Database Systems: Design and Performance(VLDB‘2022)



Abstract

With the emergence of hybrid blockchain database systems, we aim to deeply analyze the performance and trade-offs among some representative systems. To achieve this, we implemented Veritas and BlockchainDB from the ground up. For Veritas, we provide two flavors for Crash Fault Tolerance (CFT) and Byzantine Fault Tolerance (BFT) application scenarios. Specifically, we use Apache Kafka to implement Veritas for CFT application scenarios, and use Veritas and Tendermint for BFT application scenarios. We compare these three systems with existing open-source implementations of BigchainDB. BigchainDB uses Tendermint for consensus and offers two flavors: a default implementation with blockchain pipelining, and an optimized version that includes blockchain pipelining and parallel transaction validation. Our experimental analysis confirms that CFT designs commonly used by distributed databases exhibit higher performance than blockchain-specific BFT designs. On the other hand, our extensive analysis highlights the various design choices developers face and illuminates the trade-offs that need to be made when designing a hybrid blockchain database system.


提示:以下是本篇文章正文内容,下面案例可供参考

2 background and related work

Over the past few years, several works integrating blockchain and databases have been published at database conferences [22, 30, 31, 37, 42, 46]. Such a system allows companies to conduct business with confidence because every database operation is tracked by the distributed ledger. However, different business scenarios have different requirements, so many hybrid blockchain database systems have been built for different application scenarios. In this section, we describe existing hybrid blockchain database systems and briefly mention some similar systems that can be classified as ledger databases.

2.1 Hybrid blockchain database system

Veritas [22] is a shared database design that integrates the underlying blockchain (ledger) to maintain auditable and verifiable proofs. The interaction between the database and the blockchain is done through validators. Validators fetch transaction logs from the database, make correctness-check decisions, and send them to other validators. The final decision agreed by all validators is recorded on the blockchain. Therefore, the correctness of the data can be verified against the historical records stored in the blockchain.
The alternative design of Veritas [22] that we chose to implement in this paper uses a shared verifiable table as its key store. In this design, each node has a full copy of the shared table and a tamper-proof log in the form of a distributed ledger, as shown in Figure 1. The ledger stores a log of updates (writes) to shared tables. Each node sends its local logs and receives remote logs via the broadcast service. Veritas' verifiable table design uses timestamp-based concurrency control. The timestamp of the transaction is used as the sequence number of the transaction log, and each node has a watermark of the commit log sequence. When a transaction request is sent to a node, it first executes the transaction locally and caches the result in memory. It then sends the transaction log to other nodes via broadcast service. A node flushes the transaction buffer and updates the committed log watermark as soon as it receives approval from all other nodes.

BigchainDB [42] uses MongoDB [3] as its storage engine. That is, each node maintains its own local MongoDB database, as shown in Figure 2. MongoDB is used because of its support for assets, which is the main data abstraction in BigchainDB. Tendermint [8] is used for consensus among nodes in BigchainDB. Tendermint is a BFT consensus protocol, which ensures that when a node is controlled by a malicious hacker, the MongoDB database in other nodes will not be affected. When a node receives a user's update request, it first generates the result locally and proposes a transaction proposal, which is sent to other nodes through Tendermint. Once the majority of nodes in BigchainDB reach a consensus on the transaction, the nodes submit the buffered results and respond to the user client.
BlockchainDB [30] adopts the design of building a shared database on top of the blockchain. It differs from other systems because it divides the database into several shards, as shown in Figure 3, thereby reducing the overall storage overhead. While saving some storage space, this design results in higher latency, as data requests may require additional lookups to locate the appropriate shard. In BlockchainDB, each peer integrates a shard manager to locate the shard where a particular key resides. In terms of verification, it provides synchronous (online) and asynchronous (offline) verification, which is done in batches.
FalconDB [46] is a system that provides auditability and verifiability by requiring both server nodes and clients to persist data summaries. FalconDB's server nodes hold the shared database and the blockchain that records the shared database update log. The client node only saves the block header of the blockchain saved by the server node. Using these headers, the client is able to verify the correctness of the data obtained from the server node. These client nodes act as intermediaries between users and the actual database.
ChainifyDB [37] proposes a new transaction processing model called “Whatever-Ledger Consensus” (WLC). Unlike other processing models, WLC makes no assumptions about the behavior of the local database. The main principle of WLC is to seek consensus on the effect of transactions, not the order of transactions. When a ChainifyDB server receives a transaction from a client, it requests help from the protocol server to verify the transaction, and then sends the transaction proposal to the ordering server. The ordering server batches proposals into a block in FIFO order and distributes the block via Kafka [20]. When the transaction is approved by the consensus server, it will be executed sequentially in the underlying database of the execution server.

At a high level, the blockchain relational database [31] is very similar to Veritas [22]. However, in a blockchain relational database [31] (BRD), consensus is used to order blocks of transactions rather than serialize transactions within a single block. Transactions in BRD blocks are executed concurrently with Serializable Snapshot Isolation (SSI) on each node, and they are validated and committed serially. PostgreSQL [5] supports Serializable Snapshot Isolation, which is used as the underlying storage engine in BRD. Transactions are executed independently on all "untrusted" databases, but then they are committed in the same serializable order by the ordering service.

2.2 Ledger DB

Unlike hybrid blockchain database systems, ledger databases [6, 12, 26, 44, 48] are centralized in that the ledger is kept by a single organization. In this paper, we briefly describe some existing ledger databases. However, we did not evaluate and analyze their performance.

Amazon Quantum Ledger Database [6] (QLDB) contains an immutable log that records every data change in an exact and sequential manner. The log consists of append-only blocks arranged in a hash chain. This means that data can only be appended to the log, it cannot be overwritten or deleted. The entire log is designed as a Merkle Tree, allowing users to track and check the integrity of data changes.

Immudb [12] is a lightweight, high-speed immutable database with built-in cryptographic proof and verification. Immudb is written in pure Go and uses BadgerDB [17] as its storage engine. Badger is a fast key-value database implemented in pure Go using LSM tree structure. Immudb guarantees immutability by using a Merkle Tree structure internally, where data is hashed using SHA-256. Additionally, immudb builds a consistency checker to periodically check the correctness of the data.

Spitz [48] is a ledger database that supports tamper-proof
and immutable transaction logs. It uses Forkbase [43] as its underlying storage, providing git-like multi-version control for data. Spitz provides cell storage for storing data and an append-only ledger for storing transaction logs. Additionally, it builds a Merkle tree based on the ledger to provide verifiability.

LedgerDB [44] is the centralized database of Alibaba Group. It uses TSA time notarization anchors to provide auditability. These anchors are generated by the two-way peg protocol [41]. Compared with previous ledger databases, LedgerDB is different in that it supports not only create, update and query methods, but also methods of clearing and obscuring verifiable data. With these approaches, LedgerDB is designed to meet real-world needs. However, it may break immutability while providing strong verifiability. As for the underlying storage, LedgerDB supports file systems including HDFS [39], key-value stores like RocksDB [2], Merkle Patricia Tree [47], and a linearly structured append-only file system called L-Stream [44]. It is specially designed for LedgerDB.

2.3 Summary

To summarize, hybrid blockchain databases are decentralized systems with three key components, namely (1) shared databases using storage engines such as MySQL, MongoDB, Redis, and (2) shared ledgers replicated through consensus mechanisms such as Kafka (CFT) and Tendermint (BFT), etc., and (3) generally support a simple key-value (KV) store and a user interface (API) for put and get operations. In Table 1, we summarize the techniques used by existing hybrid blockchain database systems for these three key components.

四、Performence analysis

In this section, we analyze the performance of the five systems described in the previous section, namely Veritas (Kafka), Veritas (TM), BigchainDB, BigchainDB (PV) and BlockchainDB. Next, we describe the experimental setup.

4.1 Experiment setting setup

All experiments were performed on a local machine under the Ubuntu 18.04 operating system (OS). The machine has 256 physical CPU cores, 8 TB of RAM, and 3 TB of hard disk (HDD) storage. Using the iostat tool, the machine's IOPS is estimated at 5.35. All server and client nodes of the system under test are running in Docker containers on different CPU cores on this machine.

To evaluate the performance of the five systems under test, we sent 100,000 treansactions to each system. We send multiple transactions to the system in parallel based on a concurrency parameter representing the number of clients. Transactions are evenly distributed to different server nodes in the system. To calculate throughput, we record the start time of the first transaction and the completion time of all transactions. We then record the number of successfully committed transactions and calculate throughput by dividing this number by the previously recorded interval. Note that we only consider successful transactions when calculating throughput, but there may be failures as well. We repeated each experiment 3 times and reported the mean. We loaded 100,000 1,000-byte key-value records into each system before executing the transaction.

We use Yahoo! Cloud Services Benchmark [13] (YCSB) dataset which is widely used to benchmark databases and blockchains [19, 35]. YCSB supports common database operations such as writing (insert, modify, and delete) and reading. Although the YCSB defines six workloads, in our experiments we chose three of these six workloads. The three workloads are (i) Workload A, consisting of 50% update operations and 50% read operations, (ii) Workload B, consisting of 5% update operations and 95% read operations, and (ii) Workload C, by 100% read operations. Furthermore, we used three key distributions, namely (i) a uniform distribution that operates uniformly on all keys, (ii) a zipfian distribution that operates frequently on only a subset of keys, and (iii) a recent distribution that operates on the most recently used key to operate. We default to Workload A and Uniform distribution unless otherwise stated.
The benchmarking tool was developed by us in Go using goroutines [14] and a channel [14] as a concurrent safe request queue. Channels allow goroutines to synchronize without explicit locks or condition variables. Each benchmark client is represented by a goroutine that fetches new requests from the channel once it completes the current request.

1. What is pandas?

Example: pandas is a NumPy-based tool created to solve data analysis tasks.

2. Use steps

1. Import library

The code is as follows (example):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import  ssl
ssl._create_default_https_context = ssl._create_unverified_context

2. Read data

The code is as follows (example):

data = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(data.head())

The data requested by the url network used here.


Summarize

提示:这里对文章进行总结:

For example: the above is what we will talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that allow us to process data quickly and conveniently.

Guess you like

Origin blog.csdn.net/weixin_41523437/article/details/124555901