LSM tree (Log-Structured Merge Tree) storage engine

LSM tree (Log-Structured Merge Tree) storage engine

Representative database: nessDB, leveldb, hbase, etc.

The core of the core idea is to give up part of the read ability in exchange for the maximized ability to write. LSM Tree This concept is the meaning of structured merge tree. Its core idea is actually very simple, that is, it is assumed that the memory is large enough, so it is not necessary to write the data to the disk every time there is a data update, but the latest data can be written first. The data resides in the disk, and after it accumulates to the end, the in-memory data is merged and appended to the end of the disk queue using merge sort (because all the trees to be sorted are ordered and can be sorted by merge sort) way to quickly merge together).

The log-structured merge tree (LSM-tree) is a hard disk-based data structure. Compared with the B-tree, it can significantly reduce the overhead of the hard disk arm and provide high-speed insertion of files over a longer period of time ( delete). However, LSM-tree performs poorly in some cases, especially when queries require fast responses. Usually LSM-tree is suitable for applications where index insertion is more frequent than retrieval . When Bigtable provides Tablet services, GFS is used to store logs and SSTables. The original design of GFS is to modify files by adding new data instead of rewriting old data. LSM-tree, on the other hand, delays and batches index updates through rolling merge and multi-page block methods, makes full use of memory to store recent or frequently used data to reduce search costs, and uses hard disks to store infrequent data to reduce storage costs.

The technical characteristics of the disk: For the disk, the use method that can maximize the technical characteristics of the disk is to read or write a block of data of a fixed size at one time, and reduce the number of random seek operations as much as possible.

\

 

The difference between LSM and Btree needs to be rounded and calculated in read performance and write performance. At the expense of colleagues, look for other options to make up for it.

1. LSM has batch characteristics and storage delay. When the ratio of writes to reads is large (more writes than reads), LSM trees have better performance than B trees. Because with the insert operation, in order to maintain the B-tree structure, the nodes are split. The random read and write probability of reading the disk will increase, and the performance will gradually weaken. Multiple single-page random writes become one multi-page random write, which reuses the disk seek time and greatly improves efficiency.

2. The writing process of the B-tree: The writing process of the B-tree is an in-situ writing process, which is mainly divided into two parts. First, the position of the corresponding block is found, and then the new data is written to the previous block. Find the data block, and then find the physical location of the disk corresponding to the block, and write the data there. Of course, when the memory is sufficient, because part of the B-tree can be cached in the memory, there is a certain probability that the process of finding blocks can be completed in the memory, but for the sake of clarity, we assume that the memory is small enough to store only enough memory. A B-tree block-sized data bar. It can be seen that in the above mode, two random seeks (one search and one in-situ write) are required to complete one data write, and the cost is still very high.

3. LSM Tree gives up disk read performance in exchange for the sequentiality of writing. It seems to think that reading should be the most guaranteed feature of most systems, so it does not seem to be a good deal to exchange reading for writing. But don't worry, listen to my analysis.

a. The speed of memory far exceeds that of disk, more than 1000 times. The performance improvement of reading mainly depends on the memory hit rate rather than the number of disk reads.

b. Write io that does not occupy the disk, and read can obtain the right to use the disk io for a longer time, which can also improve the reading efficiency. For example, although the SSTable of LevelDb reduces the read performance, if the read hit rate of the data is guaranteed, because the read can obtain more disk IO opportunities, the read performance is basically not reduced, and even some promote. The performance of writing will be greatly improved, basically about 5 to 10 times.

Here's a detailed example:

LSM Tree弄了很多个小的有序结构,比如每m个数据,在内存里排序一次,下面100个数据,再排序一次……这样依次做下去,我就可以获得N/m个有序的小的有序结构。

在查询的时候,因为不知道这个数据到底是在哪里,所以就从最新的一个小的有序结构里做二分查找,找得到就返回,找不到就继续找下一个小有序结构,一直到找到为止。

 

很容易可以看出,这样的模式,读取的时间复杂度是(N/m)*log2N 。读取效率是会下降的。

这就是最本来意义上的LSM tree的思路。那么这样做,性能还是比较慢的,于是需要再做些事情来提升,怎么做才好呢?

LSM Tree优化方式:

a、Bloom filter: 就是个带随即概率的bitmap,可以快速的告诉你,某一个小的有序结构里有没有指定的那个数据的。于是就可以不用二分查找,而只需简单的计算几次就能知道数据是否在某个小集合里啦。效率得到了提升,但付出的是空间代价。

b、compact:小树合并为大树:因为小树他性能有问题,所以要有个进程不断地将小树合并到大树上,这样大部分的老数据查询也可以直接使用log2N的方式找到,不需要再进行(N/m)*log2n的查询了




  • 哈希存储引擎  是哈希表的持久化实现,支持增、删、改以及随机读取操作,但不支持顺序扫描,对应的存储系统为key-value存储系统。对于key-value的插入以及查询,哈希表的复杂度都是O(1),明显比树的操作O(n)快,如果不需要有序的遍历数据,哈希表就是your Mr.Right
  • B树存储引擎是B树(关于B树的由来,数据结构以及应用场景可以看之前一篇博文)的持久化实现,不仅支持单条记录的增、删、读、改操作,还支持顺序扫描(B+树的叶子节点之间的指针),对应的存储系统就是关系数据库(Mysql等)。
  • LSM树(Log-Structured Merge Tree)存储引擎和B树存储引擎一样,同样支持增、删、读、改、顺序扫描操作。而且通过批量存储技术规避磁盘随机写入问题。当然凡事有利有弊,LSM树和B+树相比,LSM树牺牲了部分读性能,用来大幅提高写性能。

通过以上的分析,应该知道LSM树的由来了,LSM树的设计思想非常朴素:将对数据的修改增量保持在内存中,达到指定的大小限制后将这些修改操作批量写入磁盘,不过读取的时候稍微麻烦,需要合并磁盘中历史数据和内存中最近修改操作,所以写入性能大大提升,读取时可能需要先看是否命中内存,否则需要访问较多的磁盘文件。极端的说,基于LSM树实现的HBase的写性能比MySQL高了一个数量级,读性能低了一个数量级。

LSM树原理把一棵大树拆分成N棵小树,它首先写入内存中,随着小树越来越大,内存中的小树会flush到磁盘中,磁盘中的树定期可以做merge操作,合并成一棵大树,以优化读性能。

 

以上这些大概就是HBase存储的设计主要思想,这里分别对应说明下:

  • 因为小树先写到内存中,为了防止内存数据丢失,写内存的同时需要暂时持久化到磁盘,对应了HBase的MemStore和HLog
  • After the tree on the MemStore reaches a certain size, it needs to be flushed to the HRegion disk (usually Hadoop DataNode), so that the MemStore becomes the disk file StoreFile on the DataNode, and the HRegionServer periodically merges the data of the DataNode to completely delete the invalid space. Multiple small trees are merged into a large tree at this time to enhance read performance.

 

Regarding LSM Tree, for the simplest two-layer LSM Tree, the data in memory and the data in disk merge operation, as shown below

Figure from the lsm paper

The lsm tree, in theory, can be a part of the in-memory tree and the first-level tree in the disk for merge. The direct update operation for the tree in the disk may destroy the continuity of the physical block, but in practical applications, generally lsm has Multi-layer, when the small trees in the disk are merged into a large tree, the order can be rearranged to make the blocks continuous and optimize the read performance.

In the implementation of hbase, after the entire memory is at a certain threshold, flush to disk to form a file. The storage of this file is also a small B+ tree, because hbase is generally deployed on hdfs, hdfs does not support file access update operation, so hbase flushes the overall memory instead of merging update with the small tree in the disk, this design can also make sense. Small trees that are flushed to disk are periodically merged into a large tree. On the whole, hbase uses the idea of ​​lsm tree.


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326660365&siteId=291194637