Designing DIA note 15 -- LSM-tree, B-tree

3.1.2.1 Constructing and maintaining SSTables

Maintaining a sorted structure in memory is easy, we can use trees (red-black / AVL).
To make our storage engine work:

  • For writes -- add to an in-memory balanced tree ==> memtable
  • When the memtable > a few MB, write it out to disk as an SSTable file (most recent segment)
  • While writing SSTable to disk, writes can continue to a new memtable instance.
  • For read -- first find in memtable, then in the most recent on-disk segment, etc
  • Run merging/compaction process in the background frequently

Problem : if the DB crashes, the most recent writes (in memtable) are lost.
Solution : keep a separate unsorted append-only log on disk for recovery, every time the memtable is written to an SSTable, discard the corresponding log.

3.1.2.2 making an LSM-tree out of SSTables

Storage engines that are based on this principle of merging and compacting sorted files are often called LSM (Log-Structured Merge) storage engines.

Lucene -- an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary.

Idea of a full-text index -- given a word in a search query, find all the documents that mention the word, implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list)

3.1.2.3 performance optimizations

Problem 1
LSM-tree algo can be slow when looking up non-existing keys (need to check all)

Solution
Use Bloom filters -- a memory-efficient data structure for approximating the contents of a set, it can tell you if a key does not appear in the DB, and thus saves many unnecessary disk reads for nonexistent keys.

Problem 2
How to determine the order and timing of how SSTables are compacted and merged?

Solution
size-tiered compaction -- merge newer & smaller SSTables into older & larger ones (LevelDB, RocksDB, Cassandra)
leveled compaction -- the key range is split up into smaller SSTables and older data is moved into separate "levels" ==> proceed more incrementally with less disk space (HBase, Cassandra)


  • Basic idea of LSM-tree -- keeping a cascade of SSTables that are merged in the background
  • Even for large dataset, LSM-tree continues to work well
    • READ -- you can efficiently perform range queries since data is stored in sorted order
    • WRITE -- LSM-tree can support remarkably high write throughput because the disk writes are sequential

3.1.3 B-Trees

B-tree -- the most widely used indexing structure, standard in almost all relational DB and many nonrelational DB.

Index structure similarity differenct
B-tree keep key-value pairs sorted by key ==> efficient lookups & range queries break down into fixed-size blocks/pages (4KB), read/write 1 page at a time
LSM-tree keep key-value pairs sorted by key ==> efficient lookups & range queries break down into variable-size (several MB) segments, written sequentially

Each page can be identified using an address or location, which allows 1 page to refer to another (pointer on disk).


13203352-6ed1f3c148e3adb6.png
Figure 3-7. Looking up a key using a B-tree index

branching factor -- the number of references to child pages in 1 page of the B-tree (typically several 100 in practice)

  • To look up a key in the index, you start from the root page and go down to leaf pages.
  • To update, you search for the leaf page with given key, change the value and write the page back to disk
  • To add, you find the page with the right range and add. If the space isn't enough, split it and update the parent page key ranges


    13203352-c25c1d73c97d51b8.png
    Figure 3-8. Growing a B-tree by splitting a page
  • this algo ensures balance -- a B-tree with n keys always has a depth of O(log n)
  • most DB can fit into a B-tree within 3 or 4 levels deep
  • a 4-level tree of 4KB pages with a branching factor of 500 can store up to 256TB

Reference
Designing Data-Intensive Applications by Martin Kleppman

转载于:https://www.jianshu.com/p/3fc426e4fa9c

猜你喜欢

转载自blog.csdn.net/weixin_33675507/article/details/91134177