RocksDB resolve


 

1 Introduction

RocksDB project originated from a Facebook experiment, hoping to develop an efficient database implementation capable of loading and storing data services server on a fast storage devices (especially Flash), while fully tap the potential of this type of storage device. RocksDB is a C ++ library for storing data and supports atomic read kv. RocksDB achieve a higher flexibility in configuration and can be run to produce a variety of environments, including pure memory, Flash, HDD or HDFS. RocksDB support a variety of compression algorithms and a variety of tools for production support and debug. RocksDB borrowed many LevelDB code and Apache HBase in mind. Initially based LevelDB1.5 development.

RocksDB is an embedded KV (arbitrary stream of bytes) is stored. All data is stored in an orderly engine can support Get (key), Put (Key), Delete (Key) and NewIterator (). The basic composition RocksDB is memtable, sstfile and logfile. memtable is a memory data structure, the data is first written to the write requests in memtable, then optionally write logfile. logfile is a sequential write file. When memory overflow table, the data will be flush to sstfile then this memtable corresponding logfile will be deleted safely. The data also ordered sstfile storage easy to find.

RocksDB the key and value is completely byte stream, key size and value without any restrictions. Get user interface provides one kind of key value corresponding to the query process from the DB, MultiGet query function provides bulk. All data DB are ordered according to the key storage, wherein the key method may compare the user-defined. Iterator method for providing a user function RangeScan first seek to a particular key, and then begins to traverse from this point. Iterator can also be achieved RangeScan reverse order traversal when executing Iterator, users see a consistent view of a point in time.

Fault Torlerance     

RocksDB disk data corruption detected by a checksum. Sst file for each data block (4k-128k) has a corresponding checksum value. Write the contents stored in the data block can not be modified.

Multi-Threaded Compactions     

When a user writes a repeat key, there will be a plurality of the key value in the DB, compaction operation is the key to the removal of redundant data. When a key is deleted, compation can also be used to actually execute the delete job underlying data, if appropriate, user configuration, compation operation can be multi-threaded execution. (Duplicates removed, remove the invalid data) is stored in the data DB sstfile, when the memory data table is full, the data will be written to the memory L0 file. Every so often a small data file will be re-merge into a larger file, which is compation. LSM write throughput performance is directly dependent on the engine compation, especially where data is stored in the RAM or SSD. RocksDB also supports multi-threaded parallel compaction. Background compaction thread used to flush data to memory storage, when all background threads are executing compaction, instantaneous large number of write operations will soon be filled with memory tables, which can cause write a standstill. Less thread can be configured for performing a data flush operation,

Block Cache -- Compressed and Uncompressed Data     

RocksDB use LRU cache block read service provided. block cache partition into two separate cache, which may be an uncompressed cache data RAM, the other one cache RAM data compression. If the compression cache configuration opened, the user will generally open direct io, in order to avoid re-cache cache OS is also the same compressed data.

Available configurations  

Whether the option string or the option map, option name is a variable name of the target class, these include: DBOptions, ColumnFamilyOptions, BlockBasedTableOptions, or PlainTableOptions. Variable names and DBOptions and ColumnFamilyOptions as described in options.h information can be found in, BlockBasedTableOptions, variable information and PlainTableOptions can be found in table.h in. It should be noted that although most of the configuration items can support the option string and the option map, there are still some exceptions. All configuration items RocksDB support can be found at db_options_type_info, cf_options_type_info and block_based_table_type_info, the source file is util / options_helper.h.

LSM-Tree

RocksDB is based on the LSM-Tree, probably as follows

First of all, any writing would first wrote WAL, then write Memory Table (Memtable). Of course for performance, you can not write WAL, but it may face the risk of collapse of losing data. Memory Table skiplist support is usually a concurrent writes, but RocksDB also supports various skiplist, the user can choose the actual business scenarios.

After a Memtable filled, it will become the immutable Memtable, RocksDB in the background will be a flush thread through this Memtable flush to disk, generating a Sorted String Table (SST) file, on Level 0 layer. When the file number of the SST Level 0 exceeds a threshold level, it will pass into Compaction policy layer Level 1, and so on.

The key here is to Compaction, if not Compaction, then write very fast, but will result in reduced reading performance, the same will cause very serious problem to enlarge the space. To balance write, read, spatial problems, RocksDB Compaction is executed in the background, a different Level of SST merge. Compaction but not without cost, it will take up I / O, it is bound to affect the outside of the writing and reading operations.

For RocksDB, he has three Compaction strategy, one is the default Leveled Compaction, another is Universal Compaction, which is often said Size-Tired Compaction, there is a kind FIFO Compaction. For FIFO, its strategy is very simple, all SST in Level 0, if the threshold is exceeded, they begin to delete from the oldest of SST, in fact you can see, this mechanism is very suitable for storing time-series data.

For RocksDB practical terms, it is actually used is a Hybrid strategy, in Level 0 layer, it is actually a Size-Tired, while the other layer is Leveled of.

Here to talk about a few amplification factor, for LSM, we need to consider writing zoom, zoom in and read enlarge space, sense amplifier can be regarded as RA = number of queries * disc reads, such as the user wants to read a page, but the actual the following read three pages, then the sense amplifier is 3. Amplification is written WA = data writeen to disc / data written to database, such as the user writes 10 bytes, but is actually written to disk has 100 bytes, then the write amplification is 10. As for the space amplification, it is the SA = size of database files / size of databases used on disk, that is, the database may be 100 MB, but actually takes up 200 MB of space, then the space is enlarged 2.

 


 

2. compaction

LSM-Tree capable discrete random write request is converted to a sequential batch of write requests (WAL + Compaction), in order to improve write performance. But it also brings some problems:

  • Sense amplifier (Read Amplification). LSM-Tree read operation needs the new to the old (top to bottom) layer by layer to find, until you find the desired data. This process may take more than one I / O. Especially in the case of range query, the impact is obvious.
  • An enlarged space (Space Amplification). Because all writes are sequential write (append-only), not in-place update, so outdated data will not immediately be cleared away.

RocksDB and LevelDB to reduce compaction amplified by reading the background (SST reduce the number of files) and spatial amplification (clean up stale data), but that brought write amplification (Write Amplification) problems.

  • Write amplification. And a program data size actually written HDD / SSD required size of the data writing. Normally, data is written HDD / SSD data observed in the upper layer than the written program.

In the HDD as the storage mainstream of the times, RocksDB the compaction caused by write amplification problem is not very obvious. This is because:

  • HDD sequential read and write performance is far superior random read and write performance, enough to offset the cost of write amplification brings.
  • HDD write amount does not substantially affect their life.

Now becoming mainstream SSD storage, compaction caused by write amplification becomes more and more serious:

  • SSD sequential read and write performance is better than the performance of some random read and write, but the gap is not so big HDD. Therefore, the sequential write random writes compared to the benefits, can not offset the cost of write amplification brings, this is a problem.
  • SSD life and write about the amount, write amplification is too serious will greatly shorten the life of the SSD. Because the SSD does not support writing cover must be erased (erase) rewritable. And each SSD block (block SSD is the basic unit of an erase operation), the average number of erase is limited.

So, on the SSD, LSM-Tree of write amplification is a very interesting question. The write amplification, read enlarge, enlarge the space, as the three CAP theorem, like, need to make trade-offs.

RocksDB write amplification analysis:

+1 - redo log write

+1 - Immutable Memtable written to the file L0

+2 - L0 and L1 compaction (Key file the SST range L0 are overlapping, for performance reasons, generally try to keep the data size of L0 and L1 are the same, each time the data to take the total amount of data and L1 L0 be the total amount of compaction )

+11 - Ln-1 and Ln combined write (n> = 2, by default, the data size is Ln Ln-1 10-fold, see max_bytes_for_level_multiplier).

Therefore, the total write amplification is 4 + 11 * (n-1) = 11 * n - 7 times. The key is the value of n.

Suppose max_bytes_for_level_multiplier default value 10, the value of n is subject to the size L1 and the size of the impact LSM-Tree.

L1 determined by the size max_bytes_for_level_base, the default is 256 MB.

By default size as large as L0 and L1, it is 256 MB. But L0 special, when the number of SST file L0 reached level0_file_num_compaction_trigger, triggering L0 -> L1 of comapction. So the maximum size L0 is write_buffer_size * min_write_buffer_number_to_merge * level0_file_num_compaction_trigger.

write_buffer_size 默认 64 MB

min_write_buffer_number_to_merge 默认 1

level0_file_num_compaction_trigger default 4

So by default L0 up to 64 MB * 1 * 4 = 256 MB

Therefore, RocksDB default size of each layer is as follows:

L0 - 256 MB

L1 - 256 MB

L2 - 2.5 GB

L3 - 25 GB

L4 - 250 GB

L5 - 2500 GB

Tiered Compaction vs Leveled Compaction

We should all know, for LSM, it will be written into a memtable first inside, then flush to disk in the background, forming a SST file, the writing is actually more friendly, but when read, is likely to traverse all of SST file, this overhead is very big. At the same time, LSM is a multi-versioning mechanism, a key may be updated frequently, so it will have more than one version in the LSM stay inside, take up space.

In order to solve these two problems, LSM will be in the background compaction, that is, SST rearrange files to improve performance read out unwanted release version of the space, usually, LSM Compaction there are two ways, one is Tiered, while the other it is a Leveled.

The figure are two compaction distinction, when the brush Level 0 to Level 1, Level SST so Document 1 reaches a set threshold, it is necessary for compaction. For Tiered, we will file all Level 1 Level 2 SST merge into one on Level 2. That is, for Tiered it, compaction is actually the top of all the little SST merge into a lower SST process more.

For Leveled, different Level inside SST are the same size, Level 1 which is carried out with the SST operator Level 2 merge together, eventually forming a SST ordered in Level 2, each SST and do not overlap.

上面仅仅是一个简单的介绍,大家可以参考 ScyllaDB 的两篇文章 Write Amplification in Leveled Compaction,Space Amplification in Size-Tiered Compaction,里面详细的说明了这两种 compaction 的区别。

 


 

3. Block Cache

Block Cache是RocksDB把数据缓存在内存中以提高读性能的一种方法。开发者可以创建一个cache对象并指明cache capacity,然后传入引擎中。cache对象可以在同一个进程中供多个DB Instance使用,这样开发者就可以通过配置控制所有的cache使用。Block cache存储的是非压缩的数据块内容。用户也可以设置另外一个block cache来存储压缩数据块。读数据时首先从非压缩数据块cache中读数据、然后读压缩数据块cache。当Direct-IO打开的话,压缩数据库可以作为系统页缓存的替代。RocksDB中有两种cache的实现方式,分别为LRUCache和CLockCache。这两种cache都会被分片,来降低锁压力。用户设置的容量平均分配给每个shard。默认情况下,每个cache都会被分片为64块,每块大小不小于512K字节。

LRU Cache

默认情况,RocksDB使用LRU Cache,默认大小为8M。cache的每个分片都有自己的LRU list和hash表来查找使用。每个shard都有个mutex来控制数据并发访问。不管是数据查找还是数据写入,线程都要获取cache分片的锁。开发中也可以调用NewLRUCache()来创建一个LRU cache。这个函数提供了几个有用的配置项来设置cache:

Capacity               cache的总大小

num_shard_bits               去cache key的多少字节来选择shard_id。cache将会被分片为2^num_shard_bits

strict_capacity_limit        很少会出现block cache的size超过容量的情况,这种情况发生在持续不断的read or iteration 访问block cache,pinned blocks的总大小会超过容量。如果有更多的读请求将block数据写入block cache时,且strict_capacity_limit=false(default),cache服务会不遵循容量限制并允许写入。如果host没有足够内存的话,就会导致DB instance OOM。如果将这个配置设置为true,就可以拒绝将更多的数据写入cache,fail掉那些read or iterator。这个参数配置是以shard为控制单元的,所以会出现某一个shard在capcity满时拒绝继续写入cache,而另一个shard仍然有extra unpinned space。

high_pri_pool_ratio        为高优先级block预留的capacity 比例

Clock Cache

  ClockCache实现了CLOCK算法。CLOCK CACHE的每个shard都有一个cache entry的圆环list。算法会遍历圆环的所有entry寻找unspined entry来回收,但是如果上次scan操作这个entry被使用的话,也会有继续留在cache中的机会。寻找并回收entry使用tbb::concurrent_hash_map。

  使用LRUCache的一个好处是有一把细粒度的锁。在LRUCache中,即使是查找操作也需要获取分片锁,因为有可能会更改LRU-list。在CLock cache中查找并不需要获取分片锁,只需要查找当前hash_map就可以了,只有在insert时需要获取分片锁。使用clock cache,相比于LRU cache,写吞吐有一定提升。

当创建clock cache时,也有一些可以配置的信息。

    Capacity               same as LRUCache

    num_shard_bits              same as LRUCache

    strict_capacity_limit       same as LRUCache

Simulated Cache

SimCache是当cache capacity或者shard num发生改变时预测cache hit的方法。SimCache封装了真正的Cache 对象,运行一个shadow LRU cache模仿具有同样capacity和shard num的cache服务,检测cache hit和miss。这个工具在下面这种情况很有用,比如:开发者打开了一个DB 实例,配置了4G的cache size,现在想知道如果将cache size调整到64G时的cache hit。

SimCache的基本思想是根据要模拟的容量封装正常的block cache,但是这个封装后的block cache只有key,没有value。当插入数据时,把key插入到两个cache中,但是value只插入到normal cache。value的size会在两种cache中都计算进去,但是SimCache中因为只有key,所以并没有占用那么多的内存,但是以此却可以模拟block cache的一些行为。

 


 

4. MemTable

MemTable是一种在内存中保存数据的数据结构,然后再在合适的时机,MemTable中的数据会flush到SST file中。MemTable既可以支持读服务也可以支持写服务,写操作会首先将数据写入Memtable,读操作在query SST files之前会首先从MemTable中query数据(因为MemTable中的数据一直是最新的)。一旦MemTable满了,就会转换为只读的不可改变的,然后会创建一个新的MemTable来提供新的写操作。后台线程负责将MemTable中的数据flush到SST file,然后这个MemTable就会被销毁。

重要的配置:

memtable_factory:memtable的工厂对象。通过这个工厂对象,用户可以改变memtable的底层实现并提供个性化的实现配置。

write_buff_size :单个内存表的大小限制

db_write_buff_size: 所有列族的内存表总大小。这个配置可以管理内存表的总内存占用。

write_buffer_manager : 这个配置不是管理所有memtable的总内存占用,而是,提供用户自定义的write buffer manager来管理整体的内存表内存使用。这个配置会覆盖db_write_buffer_size。

max_write_buffer_number:内存表的最大个数

memtable的默认实现是skiplist。除了默认memtable实现外,用户也可以使用其他类型的实现方法比如 HashLinkList、HashSkipList or Vector 来提高查询性能。

Skiplist MemTable

基于Skiplist的memtable在支持读、写、随机访问和顺序scan时提供了较好的性能。此外,还支持了一些其他实现不能支持的feature比如concurrent insert和 insert with hint。

HashSkiplist MemTable

如其名,HashSkipList是在hash table中组织数据,hash table中的每个bucket都是一个skip list,HashLinkList也是在hash table中组织数据,但是每一个bucket是一个有序的单链表。这两种结构实现目的都是在执行query操作时可以减少比较次数。一种使用场景就是把这种memtable和PlainTable SST格式结合在一起,然后将数据保存在RAMFS中。当执行检索或者插入一个key时,key的前缀可以通过Options.prefix_extractor来检索,之后就找到了相应的hash bucket。进入到 hash bucket内部后,使用全部的key数据来进行比较操作。使用hash实现的memtable的最大限制是:当在多个key前缀上执行scan操作需要执行copy和sort操作,非常慢且很耗内存。

flush

在以下三种情况下,内存表的flush操作会被触发:

内存表大小超过了write_buffer_size

全部列族的所有内存表大小超过了db_write_buffer_size,或者wrtie_buffer_manager发出了flush的指令。这种情况下,最大的内存表会被选择进行flush操作。

全部的WAL文件大小超过max_total_wal_size。在这种场景下,内存中数据最老的内存表会被选择执行flush操作,然后这个内存表对应的WAL file会被回收。

所以,内存表也可以在未满时执行flush操作。这也是产生的SST file比对应的内存表小的一个原因,压缩是是另一个原因(内存表总的数据是没有压缩的,SST file是压缩过的)。

Concurrent Insert

如果不支持concurrent insert to memtable的话,来自多个线程的concurrent 写会顺序地写入memtable。默认是打开concurrent insert to memtable,也可以通过设置allow_concurrent_memtable_write来关闭。

 


 

5. Write Ahead Log

对RocksDB的每一次update都会写入两个位置:1) 内存表(内存数据结构,后续会flush到SST file) 2)磁盘中的write ahead log(WAL)。在故障发生时,WAL可以用来恢复内存表中的数据。默认情况下,RocksDB通过在每次用户写时调用fflush WAL文件来保证一致性。

 


 

6. Write Buffer Manager

Write buffer mnager帮助开发者管理列族或者DB instance的内存表的内存使用。

  • 管理内存表的内存占用在阈值内
  • 内存表的内存占用转移到block cache

  Write buffer manager与rate_limiter和sst_file_manager类似。用户创建一个write buffer manager对象,传入 column family或者DBs的配置中。可以参考write_buffer_manager.h的注释部分来学习如何使用。

Limit total memory of memtables

  在创建write buffer manager对象时,内存限制的阈值就已经确定好了。RocksDB会按照这个阈值去管理整体的内存占用。

  在5.6或者更高版本中,如果整体内存表使用超过了阈值的90%,就会触发正在写入的某一个column family的数据执行flush动作。如果DB instance实际内存占用超过了阈值,即使全部的内存表占用低于90%,那也会触发更加激进的flush动作。在5.6版本以前,只有在内存表内存占用的total超过阈值时才会触发flush。

  在5.6版本及更新版本中,内存是按照arena分配的total内存计数的,即使这些内存不是被内存表使用。在5.6之前版本中,内存使用是按照内存表实际使用的内存

Cost memory used in memtable to block cache

  从5.6版本之后,用户可以将内存表的内存使用的占用转移到block cache。不管是否打开内存表的内存占用,都可以这样操作。

  大部分情况下,block cache中实际使用的blocks远比block cache中的数据少很多,所以如果用户打开了这个feature后,block cache的容量会覆盖掉block cache和内存表的内存占用。如果用户打开了cache_index_and_filter_blocks的话,这三种内存占用都在block cache中。

  具体实现如下,针对内存表分配的每一个1M内存,WriteBufferManager都会在block cache中put一个dummy 1M的entry,这样block cache就可以正确的计算内部占用,而且可以在需要时淘汰掉一些block以便腾出内存空间。如果内存表的内存占用降低了,WriteBufferManager也不会立马三除掉dummmy blocks,而是在后续慢慢地释放掉。这是因为内存表空间占用的up and down太正常不过了,RocksDB不需要对此太过敏感。

  • 把使用的block cache传递给WriteBufferManager
  • 把WriteBufferManager的参数传入RocksDB内存表占用的最大内存
  • 把block cache的容量设置为 data blocks和memtables的内存占用总和

 

 

 


 

Ref:

Tuning RocksDB – Options    https://www.jianshu.com/p/8e0018b6a8b6

 

https://www.jianshu.com/u/aa9cae571502

https://www.jianshu.com/p/9b7437b5ea5b

https://zhuanlan.zhihu.com/p/37193700

 

Guess you like

Origin www.cnblogs.com/pdev/p/11277784.html