The system card write basic principles and applications of excessive --- LSM-tree

LSM-tree is very common in NoSQL system, basic has become a mandatory program. Today, we introduce the main idea of the LSM-tree, another example of LevelDB.

LSM-tree

Paper originated in 1996's "The Log-Structured Merge-Tree (LSM-Tree)", the paper 32, I have not read, learning the basic LSM will come from the top paper background knowledge and open source systems documents. Today's content and images mainly from FAST'16 of "WiscKey: Separating Keys from Values in SSD-conscious Storage".

Look at the name, log-structured, log structures, log is a software system to break out, just like people, like a diary, write down page by page, and the system does not write the log was wrong, so no change, just the need for additional just fine behind. Write the log before the various databases is also a write-once, so basically refer to the additional log structure. Note that he was a "Merge-tree", that is, "the merger - tree", is to merge into one more.

Well, no nonsense, say the text.

LSM-tree is designed for key-value storage system design, the type of key-value storage system to two main functions, put (k, v): writing a (k, v), get (k) : given a k Find v.

LSM-tree is the biggest feature fast write speed, the main advantage of sequential disk writes, pk off the B-tree requires random writes. Sequential and random write on the disk can refer to: "the concepts of hard disk"

The figure is part of the LSM-tree is a multi-layer structure, a more tree-like, small great. C0 is the first layer of memory, holds all the most recently written (k, v), the memory structure are ordered and can be updated at any place, at any time and support. C1 to Ck remaining layers on the disk, each layer is a key in the ordered structure.
Here Insert Picture Description
Writing process: a put (k, v) operation, first write log is added to the front (Write Ahead Log, which is written to the log before the actual recording), the next layer is applied C0. When the data C0 layer reaches a certain size, put the C0 and C1 layers were combined, similar to merge sort, this process is Compaction (merge). Merge out of the new new-C1 will be the order of writing to the disk, replace the original old-C1. When the C1 layer reaches a certain size, and the lower layer will continue to merge. After the merger all the old files can be deleted, leaving the new one.

Note that the data is written may be repeated, the new version needs to cover the old version. What is the new version, I will write (a = 1), write (a = 233), 233 is the new version. If a version of the old layer has to Ck, and this time C0 layer to the new version, this time not to manage under the old version of the file has not, the old version of the clean-up is done in the time of the merger.

Write process basically only used the memory structure, Compaction can be completed asynchronously in the background, does not block writes.

Query process: you can see the writing process, the latest data C0 layer, the oldest data in Ck layer, the query is first check the C0 level, if not want to check k, then check C1, layer by layer search.

A query may require multiple single point of inquiry, slightly slower. So the scene LSM-tree is aimed primarily write-intensive, scene of a small amount of queries.

LSM-tree is used in a variety of key databases, such as LevelDB, RocksDB, as well as the distributed line Cassandra database storage is also used in the LSM-tree storage architecture.

LevelDB

After the fact, we look at the top of this model still a little problem, such as the merger with C0 C1, the new write how to do? In addition, every time the merger with C0 C1, finishing the back is also very troublesome ah. LevelDB here for an example, look at the actual use of the system is how the idea of LSM-tree.

This figure is lower LevelDB architecture, first, LSM-tree is classified into three files, the first two are memory memtable, memtable is a normal write request received, a modification is not immutable memtable.
Here Insert Picture Description
The other part is SStable (Sorted String Table) on the disk, ordered string table, this string is ordered key data. SStable a total of seven (L0 to L6). The total size limit is 10 times that of the next layer on the layer.

Writing process: First, a write operation is added to the write-ahead log, the next data is written memtable, when memtable full, this will be switched to the immutable memtable memtable unchangeable, and to open a new reception memtable write requests. And this immutable memtable you can brush the disk. Here is a direct brush to brush disk file SSTable L0 layer does not merge directly with the file L0 layer.

All total file size of each layer is limited, every next layer is ten times larger. Once a layer of the total size exceeds a threshold value, and selects a file and file merge layer. Like playing, like 2048, the merger will trigger each time the trigger, which is the most succulent in 2048, but in the system is a very troublesome thing, because of the need to shift the data more, but not a bad thing, because it can speed up queries .

这里注意，所有下一层被影响到的文件都会参与 Compaction。合并之后，保证 L1 到 L6 层的每一层的数据都是在 key 上全局有序的。而 L0 层是可以有重叠的。
Here Insert Picture Description
上图是个例子，一个 immutable memtable 刷到 L0 层后，触发 L0 和 L1 的合并，假如黄色的文件是涉及本次合并的，合并后，L0 层的就被删掉了，L1 层的就更新了，L1 层还是全局有序的，三个文件的数据顺序是 abcdef。

虽然 L0 层的多个文件在同一层，但也是有先后关系的，后面的同个 key 的数据也会覆盖前面的。这里怎么区分呢？为每个key-value加个版本号。所以在 Compaction 时候应该只会留下最新的版本。

查询流程：先查memtable，再查 immutable memtable，然后查 L0 层的所有文件，最后一层一层往下查。

LSM-tree读写放大

读写放大（read and write amplification）是 LSM-tree 的主要问题，这么定义的：读写放大 = 磁盘上实际读写的数据量 / 用户需要的数据量。注意是和磁盘交互的数据量才算，这份数据在内存里计算了多少次是不关心的。比如用户本来要写 1KB 数据，结果你在内存里计算了1个小时，最后往磁盘写了 10KB 的数据，写放大就是 10，读也类似。

写放大：我们以 RocksDB 的 Level Style Compaction 机制为例，这种合并机制每次拿上一层的所有文件和下一层合并，下一层大小是上一层的 r 倍。这样单次合并的写放大就是 r 倍，这里是 r 倍还是 r+1 倍跟具体实现有关，我们举个例子。

If there are three, file size are: 9,90,900, r = 10. Wrote a 1, this time will continue to merge, 1 + 9 = 10, 10 + 90 = 100, 100 + 900 = 1000. He wrote a total of 10 + 100 + 1000. Logically speaking, write amplification should 1110/1, but not in various papers say, the paper says is on the right side than the left side of the equal sign and the plus sign, which is 10/1 + 100/10 + 1000 / 100 = 30 = r * level. Personal feeling is a write amplification process, as measured by a number less accurate, but this is only the worst case.

Sense amplifier: a data query to the 1KB. The worst eight documents need to read the L0 layer, a file read each of L1 to L6, a total of 14 files. And each internal file need to read the index of 16KB, 4KB of Bloom filter data 4KB blocks (not read is not important, as long as know from a SSTable Richards a key, so many things need to read it). A total of 24 * 14/1 = 336 times. key-value is smaller the greater the sense amplifier.

to sum up

About content LSM-tree and LevelDB design ideas on the introduction is over, write three parts including the front log WAL, memtable, SStable. The combined layer by layer, layer by layer to find. The main disadvantage of LSM-tree is enlarged to read and write, read and write about can zoom through some other strategy to reduce.

Address reprint:
https://cloud.tencent.com/developer/news/340271

966 life-long learning

Published 349 original articles · won praise 6 · views 9615

Private letter concerns

The system card write basic principles and applications of excessive --- LSM-tree

Guess you like