RocksDB implementation principle

Introduction

  • RocksDB is an experimental project of Facebook, which aims to develop an efficient database system that can truly utilize the performance of high-speed storage hardware under server pressure. This is a C++ library that allows storage of arbitrary length binary KV data. Supports atomic read and write operations.
  • RocksDB relies on a large number of flexible configurations to enable it to be tuned for different production environments, including using memory directly, using Flash, using hard disk or HDFS. The use of different compression algorithms is supported, and there is a complete set of tools for production and debugging use.
  • RocksDB reuses a large amount of levedb code and also borrows many HBase design concepts. The original code is forked from leveldb 1.5. At the same time, RocksDB also borrowed some concepts and codes that Facebook had before.
  • RocksDB has a wide range of application scenarios; for example, the pika database that supports the redis protocol uses RocksDB to persist data structures supported by Redis; MySQL supports pluggable storage engines, and the MySQL branch maintained by Facebook supports RocksDB;

Compile and install RocksDB

git clone https://github.com/facebook/rocksdb.git
cd rocksdb
# 编译成调试模式
make
# 编译成发布模式
make static_lib
############################ 使用 cmake
######################
mkdir build
cd build
cmake ..

Compression library

Ubuntu

# rocksdb 支持多种压缩模式
# gflags
sudo apt-get install libgflags-dev
# snappy
sudo apt-get install libsnappy-dev
# zlib
sudo apt-get install zlib1g-dev
# bzip2
sudo apt-get install libbz2-dev
# lz4
sudo apt-get install liblz4-dev
# zstandard
sudo apt-get install libzstd-dev

Centos

# gflags
git clone https://github.com/gflags/gflags.git
cd gflags
git checkout v2.0
./configure && make && sudo make install
# snappy
sudo yum install snappy snappy-devel
# zlib
sudo yum install zlib zlib-devel
# bzip2
sudo yum install bzip2 bzip2-devel
# lz4
sudo yum install lz4-devel
# ASAN (optional for debugging)
sudo yum install libasan
# zstandard
sudo yum install libzstd-devel

Basic interface

Status Open(const Options& options, const std::string& dbname, DB** dbptr);
Status Get(const ReadOptions& options, const Slice& key, std::string* value);
Status Get(const ReadOptions& options,  ColumnFamilyHandle* column_family, const Slice& key, std::string* value);
Status Put(const WriteOptions& options, const Slice& key, const Slice& value);
Status Put(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& key, const Slice& value);

// fix read-modify-write 将 读取、修改、写入封装到一个接口中
Status Merge(const WriteOptions& options, const Slice& key, const Slice& value);
Status Merge(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& key, const Slice& value);

// 标记删除,具体在 compaction 中删除
Status Delete(const WriteOptions& options, const Slice& key);
Status Delete(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& key, const Slice& ts);

// 针对从来不修改且已经存在的key; 这种情况比 delete 删除的快;
Status SingleDelete(const WriteOptions& options, const Slice& key);
Status SingleDelete(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& key);

// 迭代器会阻止 compaction 清除数据,使用完后需要释放;
Iterator* NewIterator(const ReadOptions& options);
Iterator* NewIterator(const ReadOptions& options,ColumnFamilyHandle* column_family)

Highly layered architecture

  • RocksDB is an embedded storage that can store arbitrary binary KV data.
  • RocksDB organizes all data in order, and their common operations are Get(key), NewIterator(), Put(key, value), Delete(Key) and SingleDelete(key).
    RocksDB has three basic data structures:memtable, sstfile and logfile. A memtable is an in-memory data structure - all write requests go to the memtable and then optionally to the logfile. logfile is a file written sequentially on storage.When the memtable is filled, it will be flushed to the sstfile file and stored, and the related logfile will be safely deleted later.. The data in sstfile is sorted to facilitate quick search based on key.
  • RocksDB is based onLSM-Tree(log structured merge - tree) implementation;

LSM-Tree

  • The core idea of ​​LSM-Tree is to use sequential writing to improve write performance; LSM-Tree is not a tree data structure, just a storage structure; **LSM-Tree is a solution proposed for writing-intensive specific scenarios;**Such as logging system, massive data storage, data analysis;
  • LO layer data duplication, there is no order between files, and there is order within the files;
  • L1 ~ LN There is no duplication of data in each layer, but there may be duplication across layers.; The files are in order
    ;
    Insert image description here

About access speed

  • Disk access time: seek time + rotation time + transmission time;
    seek time: 8ms~12ms;
    rotation time: 7200 rpm /min (half cycle 4ms);
    Transmission time: 50M/s (about 0.3ms);
  • Disk Random IO Disk Sequential IO Memory Random IO Memory Sequential IO;

MemTable

  • MemTable is an in-memory data structure that saves the data before being written to the SST file. he**Serves both reading and writing**——New writes always insert data into the memtable, and reads always query the memtable before querying the SST file because the data in the memtable is always updated. Once a memtable is full, it becomes unmodifiable and replaced by a new memtable. A background thread will write the contents of the memtable to an SST file, and then the memtable can be destroyed.
  • The default memtable implementation is based on skiplist.
  • Options that affect memtable size:
    write_buffer_size: the size of a memtable;
    db_write_buffer_size: the sum of the sizes of memtables for multiple column families; this can Used to manage the total memory used by memtable;
    max_write_buffer_number: The maximum number of memtables that can be stored in the memory before flushing to SST files;
    Insert image description here
    Insert image description here

Placement strategy

  • The size of Memtable after a writewrite_buffer_size exceeded
  • The memtable size in all column families exceeds db_write_buffer_size, or write_buffer_managerRequest placement. In this scenario, the largest memtable will be flushed;
  • The total size of WAL files exceeds max_total_wal_size. In this scene, there areThe memtable with the oldest data will be flushed., so as to allow the WAL file carrying data related to this memtable to be deleted.

WAL

Every update operation in RocksDB is written to two places:

  1. A memory data structure named memtable (will be flushed to the SST file later)
  2. WAL log written to disk. In the event of a crash, the WAL log can be used to completely restore the data in the memtable toEnsure that the database can be restored to its original state. In the default configuration of
    , RocksDB ensures consistency by calling fflush on the WAL after each write operation.

WAL creation time:

  1. When db is opened;
  2. When a column family writes data; (new creation, old delayed deletion)

Important parameters

  • DBOptions::max_total_wal_size: If you wish to limit the size of the WAL, RocksDB uses DBOptions::max_total_wal_size asColumn family placement trigger. Once the WAL exceeds this size, RocksDB will startForced family placement,promiseddelete oldestWAL file. This configuration is useful when the column family is updated at a variable frequency. If there is no size limit, users may need to save very old WAL files if there are some very infrequently updated column family data in this WAL that are not written to disk.
  • DBOptions::WAL_ttl_seconds, DBOptions::WAL_size_limit_MB: These two options affect the time when WAL files are deleted. Non-zero parameters represent the threshold of time and hard disk space. If this threshold is exceeded, the archived WAL file will be deleted.

Immutable MemTable

  • Immutable MemTable is also data stored in memory, but it is**read only**Memory data;
  • When the MemTable is full, it will be placed in a read-only state and become ImmutableMemTable.. Then a new MemTable will be created to provide write operations; the Immutable MemTable will be asynchronously flushed to SST;

SST

SST (Sorted String Table) is a collection of ordered key-value pairs; it is the data structure of LSM-Tree on disk; it can be passedCreate index of key and bloom filterTo speed up key query; LSM-Tree will save all DML operation records in memory, and then write them to disk sequentially in batches; this is very different from B+ Tree.Data update of B+ Tree directly requires finding the page where the original data is located and modifying the corresponding value; while LSM-Tree directly writes to the disk through append.;Although it will be laterRemove redundant and invalid data through merging

BlockCache

  • The block cache is where RocksDB caches data in memory for reads. Users can bring a desired space size and pass a Cache object to the RocksDB instance. A cache object can be shared between multiple RocksDB instances in the same process, allowing the user to control the overall cache size. The block cache stores uncompressed blocks. Users can also choose to set up another block cache to store compressed blocks. When reading, the cache of uncompressed data blocks will be pulled first, and then the cache of compressed data blocks will be pulled. OpeningDirect IOThe compressed block cache can replace the OS page cache.
  • There are two implementation methods in RocksDB, called LRUCache and ClockCache. Both types of caches use sharding to mitigate lock conflicts. Capacity will be evenly distributed to each shard, and no space will be shared between shards. By default, each cache is fragmented into 64 shards, with each shard having at least 512 B of space.
  • Users can choose to cache index and filter blocks in BlockCache; by default, index and filter blocks are stored outside BlockCache;
    Insert image description here
    Insert image description here

LRU cache

By default, RocksDB uses LRU block cache implementation with a space of 8MB. Each cache shard maintains its own LRU list and its own lookup hash table. Concurrency is achieved by holding a mutex lock per shard. Whether searching or inserting, you need to apply for the mutex lock of the shard. Users can create an LRU cache by calling NewLRUCache;

Clock cache

  • ClockCache implements the CLOCK algorithm. Each clock cache shard maintains a circular list of cache entries. A clock pointer traverses the ring list to find an unpinned block item to evict. At the same time, if it was used in the previous scan, the item is given two chances to stay in the cache. tbb::concurrent_hash_map is used to find data.
  • Compared with LRU cache, clock cache has better lock granularity. Under the LRU cache, each shard's mutex needs to be locked when reading because it needs to update its LRU list. Searching for data on a clock cache does not require applying for the mutex lock of the shard. It only needs to search the parallel hash table, so there is better lock granularity. A per-shard lock is only required during insertion. Using clock caching, under certain circumstances, we can see an increase in read performance;

Write process

    1. Writes are placed in WAL (Write Ahead Log) on ​​disk.
    1. Write to memtable.
    1. When the size reaches a certain threshold, the original memtable is frozen and becomes immutable. Subsequent writes are handed over to the new memtable and WAL.
    1. The Compaction thread is opened in the background and the immutable library is turned into an L0 layer SSTable. After the writing is successful, the previous WAL is released.
    1. If the total file size of the current layer (Li) exceeds the threshold after inserting a new SSTable, a file will be selected from Li and merged with the overlapping files of the Li+1 layer until the size of all layers is less than the threshold. During the merging process, it will be ensured that after L1, the keys of each SSTable do not overlap.

Reading process

Insert image description here

    1. FindFiles. Search from the SST file. If it is in L0, then each file must be read, because L0 does not guarantee that the keys do not overlap; if it is in a deeper layer, then the keys guarantee that there is no overlap, and only one SST file needs to be read at each layer. Starting from L1, each layer can maintain an ordered interval index of SST in the memory, and perform a binary search on the index;
    1. LoadIB+FB. IB and FB are the abbreviations of index block and filterblock respectively. The index block is the index of the block divided inside SST; the filter block is a Bloom Filter, which can quickly eliminate the absence of the Key, so these two structures are loaded first;
    1. SearchIB. Binary search index block to find the corresponding block;
    1. SearchFB. Filter with bloom filter, if not, return;
    1. LoadDB. Then load this block into memory;
    1. SearchDB. Continue binary search in this block;
    1. ReadValue. After finding the Key, read the data. If you consider the separation of WiscKey KV, you also need to read it in the vLog;

Three major problems with LSM-Tree

read amplification

Used to describe the ratio of the number of bytes that the database must physically read compared to the number of bytes returned. The data is stored hierarchically, and the read operation also needs to be searched hierarchically until the corresponding data is found; this process may involve multiple IOs; during range query, the impact is greater;

space magnification

Used to describe the ratio of the number of data bytes stored on the disk compared to the number of logical bytes contained in the database; all write operations are sequential writes, rather than in-place updates, and invalid data will not be cleared immediately;

write amplification

Used to describe the ratio of the size of data actually written to disk and the size of data written by the program; in order to reduce read amplification and space amplification, RocksDB uses background threads to merge data; these merging processes will cause the same data to be written. Data is written to disk multiple times;

column family

Each key-value pair in RocksDB is combined with a unique column family. If no Column Family is specified, key-value pairs will be combined into the "default" column family.
Column families provide a way to logically shard a database. Some of his interesting characteristics include:

  • Supports cross-column family atomic writes. It means you can execute Write({cf1,key1,value1},{cf2,key2,value2}) atomically.
  • Consistent views across column families.
  • Allows different configurations for different column families
  • Add/remove column families on the fly. Both operations are very fast.

accomplish:

  • The main realization ideas of the clans are that theyShare a WAL logBut memtable and table files are not shared. Atomic writes are implemented through shared WAL files. By isolating memtable and table files, we can configure each column family independently and delete them quickly.
  • Whenever an individual column family is flushed, we create a new WAL file. All new writes to all column families will go to the new WAL file. However, we cannot delete the old WAL yet because it still has some useful data for other column families. We can only delete this WAL file by flushing the data in this WAL in all column families. This brings up some interesting implementation details as well as some interesting tuning requirements. Make sure all your column families are flushed regularly. In addition, take a look at Options::max_total_wal_size. By configuring it, expired column families can be automatically flushed.
    Insert image description here

affairs

RocksDB will support transactions when using TransactionDB or OptimisticTransactionDB. Transactions come with a simple BEGIN/COMMIT/ROLLBACK API and allow applications to modify data concurrently. Specific conflict checking is handled by RocksDB. RocksDB supports pessimistic
and optimistic concurrency control.
Note that RocksDB provides atomic operations when writing multiple keys through WriteBatch. Transactions provide a way to ensure that they will only be committed when there are no conflicts. Similar to WriteBatch, only when a transaction is committed, other threads can see the modified content (read committed).

Pessimistic transaction (TransactionDB)

  • When using TransactionDB, all keys in RocksDB that are being modified will be locked, allowing RocksDB to perform conflict detection. If a lock conflict occurs for a key, the operation returns an error. When a transaction is committed, the database guarantees that the transaction is writable.
  • A TransactionDB performs better than OptimisticTransactionDB when there is a lot of concurrent work pressure. However, due to the very aggressive locking strategy, there will be a certain performance loss when using TransactionDB.
  • TransactionDB will perform conflict checking during all write operations, including when writing without transactions.

Optimistic TransactionDB

  • Optimistic transactions provide lightweight optimistic concurrency control for work scenarios where there is no high competition or interference between multiple transactions.
  • Optimistic transactions do not use any locks when preparing to write. Instead, they defer this operation to check at commit time whether anyone else has modified the ongoing transaction. If there is a conflict with another writer (or he is unable to make a decision), the commit will return an error and no keys will be written.
  • Optimistic concurrency control is very effective in handling those occasional write conflicts. However, it is not a good idea for scenarios where a large number of transactions write to the same key, causing write conflicts to occur frequently. For these scenarios, using TransactionDB is a better choice. OptimisticTransactionDB performs better than TransactionDB in scenarios where there are a large number of non-transaction writes and a small number of transaction writes.

Guess you like

Origin blog.csdn.net/m0_68678128/article/details/134892199