Xline persistent storage design and implementation

01. Introduction

In the early prototype phase of Xline, we used memory-based storage to achieve data persistence. Although this simplifies the complexity of Xline prototype design and improves the development and iteration speed of the project, the impact is also significant: since the data is stored in the memory, once the process crashes, the data recovery of the node needs to rely on In order to pull the full amount of data from other normal nodes, this requires a long recovery time.

Based on this consideration, Xline introduced a Persistent Storage Layer in the latest version v0.3.0 to persist data to the disk, while shielding the upper-level caller from irrelevant underlying details.

02. Storage engine selection

At present, the mainstream storage engines in the industry can be basically divided into B+ Tree-based storage engines and LSM Tree-based storage engines. They have their own advantages and disadvantages.

B+ Tree read and write amplification analysis

When B+ Tree reads data, it needs to follow the root node and gradually index down to the lower layer until it finally accesses the bottom leaf node. Each layer access corresponds to a disk IO. When writing data, it also searches down the root node, and writes the data after finding the corresponding leaf node.

In order to facilitate the analysis, we make a relevant agreement. The block size of B+ Tree is B, so each internal node contains O(B) child nodes, and the leaf node contains O(B) pieces of data. Assuming that the data set size is N, then B+ Tree the height of

Write amplification: Every insert of B+ Tree will write data in the leaf node, regardless of the actual size of the data, each time a data block of size B needs to be written, so the write amplification is O(B)

Read amplification: A query of B+ Tree needs to search all the way from the root node to a specific leaf node, so I/O equal to the number of layers is required, that is,

, that is, the read amplification is

LSM Tree read and write amplification analysis

When LSM Tree writes data, it first writes a memory file memtable (Level 0) in the form of file append. When the memtable reaches a fixed size, it is converted into immutable memtable and merged into the next level. For reading data, you need to search in the memtable first, and when the search fails, search down layer by layer until you find the element. LSM Tree often uses Bloom Filter to optimize read operations and filter out elements that do not exist in the database.

Suppose the size of the data set is N, the enlargement factor is k, the size of a file in the smallest layer is B, and the size of a single file in each layer is the same as B, but the number of files in each layer is different.

Write amplification: Assuming that a record is written, it will be compacted to the next layer after the current layer is written k times. Therefore the average single layer write amplification should be

. A total of

layer, so the write amplification is

Read amplification: In the worst case, the data is compacted to the last layer, and a binary search needs to be performed on each layer in turn until it is found at the last layer.

for top level 

, the data size is O(N), and a binary search is required, which requires 

secondary disk read

for the second level 

, the data size is 

, need to carry out 

secondary disk read

for 

, the data size is 

, need to carry out 

secondary disk read

……

By analogy, the final read amplification is R = 

Summarize

From the analysis of the complexity of read-write amplification, the storage engine based on B+ Tree is more suitable for scenarios with more reads and fewer writes, while the storage engine based on LSM Tree is more suitable for scenarios with more writes and fewer reads.

Xline, as an open source distributed KV storage software written by Rust, needs the following considerations when choosing a persistent storage engine:

  1. In terms of performance : For storage engines, it is often easy to become one of the performance bottlenecks of the system, so a high-performance storage engine must be selected. A high-performance storage engine must be written in a high-performance language, and asynchronous implementation must be given priority. Priority is given to the Rust language, followed by the C/C++ language.
  2. From the perspective of development : give priority to the Rust language implementation, which can reduce some additional development work at the current stage.
  3. From a maintenance point of view :
    1. Consider the supporters behind the engine: give priority to large commercial companies, open source communities
    2. The industry needs to be widely used so that more experience can be learned in the later debugging and tuning process
    3. Visibility and popularity (github star) should be high in order to attract good contributors to participate
  4. From a functional point of view : the storage engine is required to provide transaction semantics, support basic KV-related operations, and support batch operations, etc.

Requirements are prioritized as: Functionality > Maintenance >= Performance > Development

We mainly researched several open source embedded databases such as Sled, ForestDB, RocksDB, bbolt and badger. Among them, only RocksDB can meet the four requirements we mentioned earlier. RocksDB is implemented and open-sourced by Facebook. It currently has good application production practices in the industry. At the same time, the version still maintains a stable release speed, and it can perfectly cover our needs in terms of functions.

Xline mainly serves the consistent metadata management of cross-cloud data centers, and its working scenarios are mainly scenarios with more reads and fewer writes. Some readers may have questions, isn't RocksDB a storage engine based on LSM Tree? The LSM Tree-based storage engine should be more suitable for application scenarios with more writes and fewer reads, so why choose to use RocksDB?

Indeed, theoretically speaking, the most suitable storage engine should be a storage engine based on B+ Tree. However, considering that B+ Tree-based embedded databases such as Sled and ForestDB lack the practice of large-scale application production, version maintenance is also at a standstill. After making trade-offs, we chose RocksDB as the storage backend for Xline. At the same time, in order to consider that there may be more suitable storage engines for replacement in the future, we have done a good interface separation and packaging in the design of the Persistent Storage Layer, which can minimize the cost of replacing the storage engine later.

03. Design and implementation of persistent storage layer

Before we start discussing the design and implementation of the persistent storage layer, we need to clarify our requirements and expectations for persistent storage:

  1. As mentioned above, after making the corresponding trade-off, we adopted RocksDB as the back-end storage engine of Xline. Therefore, we cannot rule out the possibility of replacing this storage engine in the future. The design of StorageEnginne must conform to the OCP principle and meet the principles of configurability and easy replacement.
  2. We need to provide basic KV interface for upper-level users
  3. To achieve a complete Recover mechanism.

Overall architecture and writing process

Let's take a look at the current overall architecture of Xline, as shown in the following figure:

From top to bottom, the overall architecture of Xline can be divided into access layer, consensus module, business logic module, storage API layer and storage engine layer. Among them, the storage API layer is mainly responsible for providing business-related StorageApi to the business module and the consensus module respectively, while shielding the implementation details of the underlying Engine. The storage engine layer is responsible for the actual data storage operation.

Let's take a PUT request as an example to see the process of writing data. When the client initiates a Put request to Xline Server, the following things will happen:

  1. After KvServer receives the PutRequest sent by the user, it will first check the validity of the request. After the check is passed, it will initiate a propose rpc request to Curp Server through its own CurpClient
  2. After Curp Server receives the Propose request, it will first enter the fast path process. It will save the cmd in the request to the Speculative Executed Pool (aka. spec_pool) to determine whether it conflicts with the command in the current spec_pool, and return ProposeError::KeyConflict if the conflict occurs, and wait for the slow path to complete, otherwise continue to the current path fast_path
  3. In fast_path, if a command neither conflicts nor repeats, it will notify the background cmd_worker to execute it through a specific channel. Once cmd_worker starts to execute, it will save the corresponding command to CommandBoard to track the execution of the command.
  4. When multiple nodes in the cluster reach a consensus, they will submit the state machine log, persist this log to CurpStore, and finally apply this log. In the process of applying, the corresponding CommandExecutor will be called, that is, the store module corresponding to each server in the business module, and the actual data will be persisted to the back-end database through DB.

interface design

The figure below shows the relationship between the two traits of StorageApi and StorageEngine and the corresponding data structures

Storage Engine Layer

The Storage Engine Layer mainly defines the StorageEngine trait and related errors.

StorageEngine Trait 定义(engine/src/engine_api.rs):

/// Write operation
#[non_exhaustive]
#[derive(Debug)]
pub enum WriteOperation<'a> {
    /// `Put` operation
    Put {  table: &'a str, key: Vec<u8>, value: Vec<u8> },
    /// `Delete` operation
    Delete { table: &'a str, key: &'a [u8] },
    /// Delete range operation, it will remove the database entries in the range [from, to)
    DeleteRange { table: &'a str, from: &'a [u8], to: &'a [u8] },
}

/// The `StorageEngine` trait
pub trait StorageEngine: Send + Sync + 'static + std::fmt::Debug {
    /// Get the value associated with a key value and the given table
    ///
    /// # Errors
    /// Return `EngineError::TableNotFound` if the given table does not exist
    /// Return `EngineError` if met some errors
    fn get(&self, table: &str, key: impl AsRef<[u8]>) -> Result<Option<Vec<u8>>, EngineError>;

    /// Get the values associated with the given keys
    ///
    /// # Errors
    /// Return `EngineError::TableNotFound` if the given table does not exist
    /// Return `EngineError` if met some errors
    fn get_multi(
        &self,
        table: &str,
        keys: &[impl AsRef<[u8]>],
    ) -> Result<Vec<Option<Vec<u8>>>, EngineError>;

    /// Get all the values of the given table
    /// # Errors
    /// Return `EngineError::TableNotFound` if the given table does not exist
    /// Return `EngineError` if met some errors
    #[allow(clippy::type_complexity)] // it's clear that (Vec<u8>, Vec<u8>) is a key-value pair
    fn get_all(&self, table: &str) -> Result<Vec<(Vec<u8>, Vec<u8>)>, EngineError>;

    /// Commit a batch of write operations
    /// If sync is true, the write will be flushed from the operating system
    /// buffer cache before the write is considered complete. If this
    /// flag is true, writes will be slower.
    ///
    /// # Errors
    /// Return `EngineError::TableNotFound` if the given table does not exist
    /// Return `EngineError` if met some errors
    fn write_batch(&self, wr_ops: Vec<WriteOperation<'_>>, sync: bool) -> Result<(), EngineError>;
}

Related Error Definitions

#[non_exhaustive]
#[derive(Error, Debug)]
pub enum EngineError {
    /// Met I/O Error during persisting data
    #[error("I/O Error: {0}")]
    IoError(#[from] std::io::Error),
    /// Table Not Found
    #[error("Table {0} Not Found")]
    TableNotFound(String),
    /// DB File Corrupted
    #[error("DB File {0} Corrupted")]
    Corruption(String),
    /// Invalid Argument Error
    #[error("Invalid Argument: {0}")]
    InvalidArgument(String),
    /// The Underlying Database Error
    #[error("The Underlying Database Error: {0}")]
    UnderlyingError(String),
}

MemoryEngine(engine/src/memory_engine.rs) and RocksEngine(engine/src/rocksdb_engine.rs) implement the StorageEngine trait. Among them, MemoryEngine is mainly used for testing, and RocksEngine is defined as follows:

/// `RocksDB` Storage Engine
#[derive(Debug, Clone)]
pub struct RocksEngine {
    /// The inner storage engine of `RocksDB`
    inner: Arc<rocksdb::DB>,
}

/// Translate a `RocksError` into an `EngineError`
impl From<RocksError> for EngineError {
    #[inline]
    fn from(err: RocksError) -> Self {
        let err = err.into_string();
        if let Some((err_kind, err_msg)) = err.split_once(':') {
            match err_kind {
                "Corruption" => EngineError::Corruption(err_msg.to_owned()),
                "Invalid argument" => {
                    if let Some(table_name) = err_msg.strip_prefix(" Column family not found: ") {
                        EngineError::TableNotFound(table_name.to_owned())
                    } else {
                        EngineError::InvalidArgument(err_msg.to_owned())
                    }
                }
                "IO error" => EngineError::IoError(IoError::new(Other, err_msg)),
                _ => EngineError::UnderlyingError(err_msg.to_owned()),
            }
        } else {
            EngineError::UnderlyingError(err)
        }
    }
}

impl StorageEngine for RocksEngine {
    /// omit some code
}

StorageApi Layer

business module

StorageApi definition of business module

/// The Stable Storage Api
pub trait StorageApi: Send + Sync + 'static + std::fmt::Debug {
    /// Get values by keys from storage
    fn get_values<K>(&self, table: &'static str, keys: &[K]) -> Result<Vec<Option<Vec<u8>>>, ExecuteError>
    where
        K: AsRef<[u8]> + std::fmt::Debug;

    /// Get values by keys from storage
    fn get_value<K>(&self, table: &'static str, key: K) -> Result<Option<Vec<u8>>, ExecuteError>
    where
        K: AsRef<[u8]> + std::fmt::Debug;

    /// Get all values of the given table from the storage
    fn get_all(&self, table: &'static str) -> Result<Vec<(Vec<u8>, Vec<u8>)>, ExecuteError>;

    /// Reset the storage
    fn reset(&self) -> Result<(), ExecuteError>;

    /// Flush the operations to storage
    fn flush_ops(&self, ops: Vec<WriteOp>) -> Result<(), ExecuteError>;
}


In the business module, DB(xline/src/storage/db.rs) is responsible for converting StorageEngine into StorageApi for upper-level calls, and its definition is as follows:

/// Database to store revision to kv mapping
#[derive(Debug)]
pub struct DB<S: StorageEngine> {
    /// internal storage of `DB`
    engine: Arc<S>,
}

impl<S> StorageApi for DB<S>
where
    S: StorageEngine
{
    /// omit some code 
}

Different Servers in the business module have their own Store backend, and its core data structure is the DB in the StorageApi Layer.

consensus module

StorageApi definition for Curp module (curp/src/server/storage/mod.rs)

/// Curp storage api
#[async_trait]
pub(super) trait StorageApi: Send + Sync {
    /// Command
    type Command: Command;

    /// Put `voted_for` in storage, must be flushed on disk before returning
    async fn flush_voted_for(&self, term: u64, voted_for: ServerId) -> Result<(), StorageError>;

    /// Put log entries in the storage
    async fn put_log_entry(&self, entry: LogEntry<Self::Command>) -> Result<(), StorageError>;

    /// Recover from persisted storage
    /// Return `voted_for` and all log entries
    async fn recover(
        &self,
    ) -> Result<(Option<(u64, ServerId)>, Vec<LogEntry<Self::Command>>), StorageError>;
}



And RocksDBStorage (curp/src/server/storage/rocksdb.rs) is the CurpStore mentioned in the previous architecture diagram, which is responsible for converting StorageApi into the underlying RocksEngine operation.

/// `RocksDB` storage implementation
pub(in crate::server) struct RocksDBStorage<C> {
    /// DB handle
    db: RocksEngine,
    /// Phantom
    phantom: PhantomData<C>,
}

#[async_trait]
impl<C: 'static + Command> StorageApi for RocksDBStorage<C> {
    /// Command
    type Command = C;
    /// omit some code
}

implementation related

data view

After introducing the Persistent Storage Layer, Xline divides different namespaces through the logical table table, which currently corresponds to the Column Family in the underlying Rocksdb.

Currently there are the following tables:

  1. curp: store curp-related persistent information, including log entries, voted_for and corresponding term information
  2. lease: Save the granted lease information
  3. kv: save kv information
  4. auth: Save the current Xline auth enable situation and the corresponding enable revision
  5. user: Save the user information added in Xline
  6. role: Save the role information added in Xline
  7. meta: Save the currently applied log index

scalability

The reason why Xline splits storage-related operations into two different traits, StorageEngine and StorageApi, and distributes them to two different levels, is to isolate changes. The StorageEngine trait provides the mechanism, and the StorageApi is defined by the upper-level modules. Different modules can have their own definitions to implement specific storage strategies. The CurpStore and DB in the StorageApi layer are responsible for the conversion between these two traits. Since the upper-level caller does not directly depend on the underlying Storage Engine-related content, even if the storage engine is replaced later, the code of the upper-level module will not require a lot of modification.

Recovery process

For the Recover process, there are only two important things. The first is what data to recover, and the second is when to recover? Let's first look at the data involved in recover between different modules.

consensus module

In the consensus module, since RocksDBStorage is exclusively used by Curp Server, recover can be directly added to the corresponding StorageApi trait. The specific implementation is as follows:

#[async_trait]
impl<C: 'static + Command> StorageApi for RocksDBStorage<C> {
    /// Command
    type Command = C;
    /// omit some code
    async fn recover(
        &self,
    ) -> Result<(Option<(u64, ServerId)>, Vec<LogEntry<Self::Command>>), StorageError> {
        let voted_for = self
            .db
            .get(CF, VOTE_FOR)?
            .map(|bytes| bincode::deserialize::<(u64, ServerId)>(&bytes))
            .transpose()?;

        let mut entries = vec![];
        let mut prev_index = 0;
        for (k, v) in self.db.get_all(CF)? {
            // we can identify whether a kv is a state or entry by the key length
            if k.len() == VOTE_FOR.len() {
                continue;
            }
            let entry: LogEntry<C> = bincode::deserialize(&v)?;
            #[allow(clippy::integer_arithmetic)] // won't overflow
            if entry.index != prev_index + 1 {
                // break when logs are no longer consistent
                break;
            }
            prev_index = entry.index;
            entries.push(entry);
        }

        Ok((voted_for, entries))
    }
}



For the consensus module, during the recover process, voted_for and the corresponding term will be loaded from the underlying db first, which is the security guarantee of the consensus algorithm, in order to avoid voting twice in the same term. Then load the corresponding log entries.

business module

For business modules, different Servers will have different Stores, and they all rely on the mechanism provided by the underlying DB. Therefore, the corresponding recover is not defined in the StorageApi trait, but exists in LeaseStore(xline/src/storage/lease_store/mod.rs), AuthStore(xline/src/storage/auth_store/store.rs) in an independent method and KvStore (xline/src/storage/kv_store.rs).

/// Lease store
#[derive(Debug)]
pub(crate) struct LeaseStore<DB>
where
    DB: StorageApi,
{
    /// Lease store Backend
    inner: Arc<LeaseStoreBackend<DB>>,
}

impl<DB> LeaseStoreBackend<DB>
where
    DB: StorageApi,
{
    /// omit some code
    /// Recover data form persistent storage
    fn recover_from_current_db(&self) -> Result<(), ExecuteError> {
        let leases = self.get_all()?;
        for lease in leases {
            let _ignore = self
                .lease_collection
                .write()
                .grant(lease.id, lease.ttl, false);
        }
        Ok(())
    }
}

impl<S> AuthStore<S>
where
    S: StorageApi,
{
    /// Recover data from persistent storage
    pub(crate) fn recover(&self) -> Result<(), ExecuteError> {
        let enabled = self.backend.get_enable()?;
        if enabled {
            self.enabled.store(true, AtomicOrdering::Relaxed);
        }
        let revision = self.backend.get_revision()?;
        self.revision.set(revision);
        self.create_permission_cache()?;
        Ok(())
    }
}


Among them, the recovery logic of LeaseStore and AuthStore is relatively simple, so we will not discuss too much here. We will focus on the recovery process of KvStore, and its flow chart is as follows

When to Recover

The recovery timing of Xline is mainly at the initial stage of system startup, and the recovery of the business module will be executed first, followed by the recovery of the consensus module. Since the recovery of KvStore depends on the recovery of LeaseStore, the recovery of LeaseStore needs to be located before the recovery of KvStore. The corresponding code (xline/src/server/xline_server.rs) is as follows:

impl<S> XlineServer<S>
where
    S: StorageApi,
{
    /// Start `XlineServer`
    #[inline]
    pub async fn start(&self, addr: SocketAddr) -> Result<()> {
        // lease storage must recover before kv storage
        self.lease_storage.recover()?;
        self.kv_storage.recover().await?;
        self.auth_storage.recover()?;
        let (kv_server, lock_server, lease_server, auth_server, watch_server, curp_server) =
            self.init_servers().await;
        Ok(Server::builder()
            .add_service(RpcLockServer::new(lock_server))
            .add_service(RpcKvServer::new(kv_server))
            .add_service(RpcLeaseServer::from_arc(lease_server))
            .add_service(RpcAuthServer::new(auth_server))
            .add_service(RpcWatchServer::new(watch_server))
            .add_service(ProtocolServer::new(curp_server))
            .serve(addr)
            .await?)
    }

The recover process of the consensus module (curp/src/server/curp_node.rs) is as follows, and its function call chain is: XlineServer::start -> XlineServer::init_servers -> CurpServer::new -> CurpNode::new

// utils
impl<C: 'static + Command> CurpNode<C> {
    /// Create a new server instance
    #[inline]
    pub(super) async fn new<CE: CommandExecutor<C> + 'static>(
        id: ServerId,
        is_leader: bool,
        others: HashMap<ServerId, String>,
        cmd_executor: CE,
        curp_cfg: Arc<CurpConfig>,
        tx_filter: Option<Box<dyn TxFilter>>,
    ) -> Result<Self, CurpError> {
        // omit some code
        // create curp state machine
        let (voted_for, entries) = storage.recover().await?;
        let curp = if voted_for.is_none() && entries.is_empty() {
            Arc::new(RawCurp::new(
                id,
                others.keys().cloned().collect(),
                is_leader,
                Arc::clone(&cmd_board),
                Arc::clone(&spec_pool),
                uncommitted_pool,
                curp_cfg,
                Box::new(exe_tx),
                sync_tx,
                calibrate_tx,
                log_tx,
            ))
        } else {
            info!(
                "{} recovered voted_for({voted_for:?}), entries from {:?} to {:?}",
                id,
                entries.first(),
                entries.last()
            );
            Arc::new(RawCurp::recover_from(
                id,
                others.keys().cloned().collect(),
                is_leader,
                Arc::clone(&cmd_board),
                Arc::clone(&spec_pool),
                uncommitted_pool,
                curp_cfg,
                Box::new(exe_tx),
                sync_tx,
                calibrate_tx,
                log_tx,
                voted_for,
                entries,
                last_applied.numeric_cast(),
            ))
        };   
        // omit some code
        Ok(Self {
            curp,
            spec_pool,
            cmd_board,
            shutdown_trigger,
            storage,
        })
    }



04性能评估

In the new version v0.3.0, in addition to introducing the Persistent Storage Layer, we also made some major refactorings to some parts of CURP. After refactoring and adding new features, it passed the validation test and integration test not long ago. The test information of the performance part has been released in Xlinev0.4.0.

For the performance report, please refer to the link:

https://github.com/datenlord/Xline/blob/master/img/xline-key-perf.png

05. Past recommendations

[Notice of Missing Person] Datan Technology continues to recruit people

Xline v0.4.0: A distributed KV store for metadata management

Database isolation level and MVCC

Welcome to reply email [email protected] to join the group for more information~

Xline is a distributed KV store for metadata management. The Xline project is written in Rust language, and everyone is welcome to participate in our open source project!

GitHub link : https://github.com/datenlord/Xline

Xline official website : www.xline.cloud

Xline Discordhttps://discord.gg/XyFXGpSfvb

Guess you like

Origin blog.csdn.net/DatenLord/article/details/130930079