Article directory

foreword
1. What is leveldb?
Second, the use of leveldb
3. Principle exploration
Summarize

foreword

The autumn recruitment is almost over, and the company I will go to is basically determined. There is no written test, interview and internship, and I have been free recently.

I plan to do some open source projects with my friends, and the tentative direction is data storage. So these two weeks are also learning data storage related knowledge, such as SSD, underlying file system, lsm-tree and so on.

I have always had the habit of recording what I have learned, but recently it is recorded in wps in the form of a knowledge brain map. Considering that I haven't published a technical blog post for a long time, I still plan to sort out what I have learned recently and record it in the blog in the form of notes for future review and sharing for learning together.

1. What is leveldb?

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

Leveldb is a KV system for persistent storage.
In fact, it is what we usually call the underlying storage engine, or a database. The storage engine rocketdb used at the bottom of redis that we usually know is evolved from leveldb.

Second, the use of leveldb

Here is an experience of looking at the source code. When you get the source code of an unfamiliar project, take a look at how it is used. Start with the use and use this as an entry point to read the key code, which will be much easier.

ps: In the doc directory of the leveldb project, there is a document index.md that introduces its use. The following are mostly excerpts from its translation.

1. Open a database

The leveldb database has a name, which corresponds to a directory on the file system, and all the contents of the database are stored in this directory. The following example shows how to open a database and create it if necessary:

#include <cassert>
#include "leveldb/db.h"

leveldb::DB* db;
leveldb::Options options;
options.create_if_missing = true;
leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &db);
assert(status.ok());
...

If you want to trigger an exception when the database already exists, add the following configuration line before the call to leveldb::DB::Open:

options.error_if_exists = true;

2.Status

You may have noticed the return type of leveldb::Status above, most methods in leveldb will return this type of value when encountering an error, you can check if it is ok, and then print the relevant error message:

leveldb::Status s = ...;
if (!s.ok()) cerr << s.ToString() << endl;

3. Close a database

When the database is no longer in use, just delete the database object directly as follows:

... open the db as described above ...
... do something with db ...
delete db;

4. Database reading and writing

leveldb provides Put, Delete and Get methods to modify/query the database. The following code shows that the value corresponding to key1 is moved to key2.

std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);

5. Atomic updates

It should be noted that in the previous section, if the process hangs up after Put key2 and before Delete key1, then the same value will be stored under multiple keys. Similar problems can be avoided by using WriteBatch to atomically apply a set of operations.

#include "leveldb/write_batch.h"
...
std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) {
    
    
  leveldb::WriteBatch batch;
  batch.Delete(key1);
  batch.Put(key2, value);
  s = db->Write(leveldb::WriteOptions(), &batch);
}

WriteBatch holds a series of operations that will be applied to the database, and these operations will be executed in the order they were added. Note that we execute Delete first and then Put, so that we will not lose data by mistake if key1 and key2 are the same.

In addition to atomicity, WriteBatch can also speed up the update process, because a large number of independent operations can be added to the same batch and executed at once.

6. Synchronous write

By default, each write operation of leveldb is asynchronous: the process returns immediately after throwing the content to be written to the operating system, and the transmission from the operating system memory to the underlying persistent storage is performed asynchronously.

You can turn on the synchronization flag for a specific write operation: write_options.sync = true, to wait until the data is actually recorded to persistent storage before returning (on Posix systems, this is done by calling fsync(…) before the write operation returns or fdatasync(…) or msync(…, MS_SYNC)).

leveldb::WriteOptions write_options;
write_options.sync = true;
db->Put(write_options, ...);

Asynchronous writes are typically 1000 times faster than synchronous writes. The disadvantage of asynchronous writes is that the last few update operations may be lost if the machine crashes. Note that just the crash of the writing process (rather than restarting the machine) will not cause any loss, because even if the sync flag is false, the write operation has been pushed from the process memory to the operating system before the process exits.

Asynchronous writes are generally safe to use. For example, if you want to write a large amount of data to the database, if you lose the last few update operations, you can also redo the entire writing process. If the amount of data is very large, an optimization point is to adopt a hybrid solution, and perform a synchronous write for every N asynchronous write operations. If a crash occurs during the period, restart from the last successful synchronous write operation. (Synchronous write operations can also update a flag describing where to restart in case of a crash)

WriteBatch can be used as a substitute for asynchronous write operations. Multiple update operations can be placed in the same WriteBatch and then written together through a synchronous write (ie write_options.sync = true).

7. Concurrency

A database can only be opened by one process at a time. leveldb will acquire a lock from the operating system to prevent multiple processes from opening the same database at the same time. Within a single process, the same leveldb::DB object can be safely used by multiple concurrent threads, that is, different threads can write to, get iterators from, or call Get (the leveldb implementation ensures the required synchronization). But other objects, such as Iterator or WriteBatch, require external synchronization guarantees. If two threads share such objects, they need to use their own locks for mutually exclusive access. See the corresponding header file for details.

8. Iterators

The following use case shows how to print all (key, value) pairs in the database.

leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next()) {
    
    
  cout << it->key().ToString() << ": "  << it->value().ToString() << endl;
}
assert(it->status().ok());  // Check for any errors found during the scan
delete it;

The following use case shows how to print data in the range [start, limit):

for (it->Seek(start);
   it->Valid() && it->key().ToString() < limit;
   it->Next()) {
    
    
  ...
}

Of course, you can also traverse in reverse (note that reverse traversal may be slower than forward traversal, see the previous read performance benchmark for details):

for (it->SeekToLast(); it->Valid(); it->Prev()) {
    
    
  ...
}

9.Snapshots snapshot

Snapshots provide consistent read-only views of the entire KV store. If ReadOptions::snapshot is not null, it means that the read operation should act on a specific version of DB; if it is null, the read operation will act on an implicit snapshot of the current version.

Snapshots are created by calling the DB::GetSnapshot() method:

leveldb::ReadOptions options;
options.snapshot = db->GetSnapshot();
... apply some updates to db ...
leveldb::Iterator* iter = db->NewIterator(options);
... read using iter to view the state when the snapshot was created ...
delete iter;
db->ReleaseSnapshot(options.snapshot);

Note that when a snapshot is no longer used, it should be released through the DB::ReleaseSnapshot interface.

10.Slice

it->key()The value returned by the and it->value()call is leveldb::Sliceof the type (similar to the slice slice in the go language). Slice is a simple data structure that contains a length and a pointer to an external byte array. Returning a Slice is more efficient than returning a std::string because there is no need to implicitly copy a large number of keys and values. Also, the leveldb methods do not return \0terminating C-style strings, since leveldb's keys and values allow \0bytes.

C++-style strings and C-style null-terminated strings are easily converted to a Slice:

leveldb::Slice s1 = "hello";

std::string str("world");
leveldb::Slice s2 = str;

A Slice is also easily converted back to a C++-style string:

std::string str = s1.ToString();
assert(str == std::string("hello"));

Be careful when using Slice, it is up to the caller to ensure that the external byte array pointed to by Slice is valid. For example, the following code has a bug:

leveldb::Slice slice;
if (...) {
    
    
  std::string str = ...;
  slice = str;
}
Use(slice);

When the if statement ends, str will be destroyed, and the pointer to Slice will also disappear, and there will be problems if it is used later.

11. Comparator

In the previous examples, the default comparison function is used, that is, byte-by-byte and lexicographical comparison. You can customize the comparison function and pass it in when opening the database. You only need to inherit leveldb::Comparator and define related logic. Here is an example:

class TwoPartComparator : public leveldb::Comparator {
    
    
 public:
  // Three-way comparison function:
  //   if a < b: negative result
  //   if a > b: positive result
  //   else: zero result
  int Compare(const leveldb::Slice& a, const leveldb::Slice& b) const {
    
    
    int a1, a2, b1, b2;
    ParseKey(a, &a1, &a2);
    ParseKey(b, &b1, &b2);
    if (a1 < b1) return -1;
    if (a1 > b1) return +1;
    if (a2 < b2) return -1;
    if (a2 > b2) return +1;
    return 0;
  }

  // Ignore the following methods for now:
  const char* Name() const {
    
     return "TwoPartComparator"; }
  void FindShortestSeparator(std::string*, const leveldb::Slice&) const {
    
    }
  void FindShortSuccessor(std::string*) const {
    
    }
};

Now create a database with this custom comparator:

TwoPartComparator cmp;
leveldb::DB* db;
leveldb::Options options;
options.create_if_missing = true;
options.comparator = &cmp;
leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &db);
...

12. Backward Compatibility

The result returned by the Comparator Name() method will be bound to the database when the database is created, and will be checked every time it is opened. If the name is changed, leveldb::DB::Open the call to will fail. Therefore, change the comparator name if and only if the new key format and comparison function are not compatible with the existing database and the existing data is no longer needed. All in all, a database can only correspond to one comparator, and the comparator is uniquely determined by its name. Once the name or comparator logic is modified, the operation logic of the database will be wrong. After all, leveldb is an ordered KV storage.

What if you have to modify the comparison logic? You can evolve your key format bit by bit according to the pre-planning. Note that the pre-evolution planning is very important. For example, you can store a version number at the end of each key (in most scenarios, one byte is enough), when you want to switch to a new key format (such as the keys processed by TwoPartComparator in the above example), Then what needs to be done is:

keep the same comparator name
Increment the version number of new keys
Modify the comparator function to use the version number to decide how to sort

13. Performance tuning

The performance of leveldb is tuned by modifying the default values of the types defined in include/leveldb/options.h.

Block size

leveldb organizes adjacent keys into the same block, and a block is the basic unit of data transfer between memory and persistent storage. The default uncompressed block size is about 4KB. Applications that frequently scan large amounts of data in batches may want to increase this value, while applications that only do "point read" of data may want this value to be smaller. However, there is no evidence that performance is better with values smaller than 1KB or larger than a few MB. Another thing to note is that with a larger block size, the compression efficiency will be more efficient

compression

Each block is compressed individually before being written to persistent storage. Compression is enabled by default, because the default compression algorithm is very fast and automatically turns off compression for incompressible data. There are very few scenarios where users want to turn off compression completely, unless benchmarks show that turning off compression will significantly improve performance . Compression can be turned off as follows:

leveldb::Options options;
options.compression = leveldb::kNoCompression;
... leveldb::DB::Open(options, name, ...) ....

cache

The content of the database is stored in a set of files in the file system. Each file stores a series of compressed blocks. If options.block_cache is non-NULL, it is used to cache frequently used decompressed block content.

#include "leveldb/cache.h"

leveldb::Options options;
options.block_cache = leveldb::NewLRUCache(100 * 1048576);  // 100MB cache
leveldb::DB* db;
leveldb::DB::Open(options, name, &db);
... use the db ...
delete db
delete options.block_cache;

Note that the cache stores uncompressed data, so its size should be set according to the size of the data required by the application. (The buffer cache of the compressed data is left to the buffer cache of the operating system or the user-defined Env implementation.)

When performing a large data read operation, the application may want to cancel the caching function, so that the large data read will not cause most of the data in the current cache to be replaced. We can provide it with a separate iterator to achieve this purpose:

leveldb::ReadOptions options;
options.fill_cache = false;
leveldb::Iterator* it = db->NewIterator(options);
for (it->SeekToFirst(); it->Valid(); it->Next()) {
    
    
  ...
}

Key layout

Note that the unit of disk transmission and cache is a block, and adjacent keys (sorted) are always in the same block, so the application can put together the keys that need to be accessed together, and put the infrequently used keys together. to a separate keyspace region to improve performance.

As an example, suppose we are implementing a simple file system based on leveldb. The types of data we intend to store to this file system are as follows:

filename -> permission-bits, length, list of file_block_ids
file_block_id -> data

We can add a character prefix to the key representing filename above, such as '/', and then add another different prefix to the key representing file_block_id, such as '0', so that these keys for different purposes have their own independent key spaces area, we don't have to read and cache large chunks of file content data when scanning metadata.

filter

In view of the organization form of leveldb data on disk, a Get() call may involve multiple disk read operations, and the configurable FilterPolicy mechanism can be used to greatly reduce the number of disk reads.

leveldb::Options options;
// 设置启用基于布隆过滤器的过滤策略
options.filter_policy = NewBloomFilterPolicy(10);
leveldb::DB* db;
// 用该设置打开数据库
leveldb::DB::Open(options, "/tmp/testdb", &db);
... use the database ...
delete db;
delete options.filter_policy;

The above code associates a Bloom-filter-based filtering strategy with the database. Bloom-filter-based filtering relies on the fact that a portion of each key's bits (in the example above, 10 bits, Because the parameter we passed to NewBloomFilterPolicy is 10), this filter will reduce the unnecessary disk read operations in the Get() call by about 100 times, and the increase in the number of bits used for the filter per key will further reduce disk reads The number of times, of course, will take up more memory space. We recommend setting a filter policy for applications where the data set cannot be fully stored in memory and there are a large number of random reads.

If you are using a custom comparator, you should ensure that the filter strategy you are using is compatible with your comparator. For example, if a comparator ignores trailing spaces when comparing keys, NewBloomFilterPolicy must not coexist with this comparator. Instead, the application should provide a custom filter strategy, and it should also ignore the trailing spaces of the key, as shown in the following example:

class CustomFilterPolicy : public leveldb::FilterPolicy {
    
    
 private:
  FilterPolicy* builtin_policy_;

 public:
  CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10)) {
    
    }
  ~CustomFilterPolicy() {
    
     delete builtin_policy_; }

  const char* Name() const {
    
     return "IgnoreTrailingSpacesFilter"; }

  void CreateFilter(const Slice* keys, int n, std::string* dst) const {
    
    
    // Use builtin bloom filter code after removing trailing spaces
    std::vector<Slice> trimmed(n);
    for (int i = 0; i < n; i++) {
    
    
      trimmed[i] = RemoveTrailingSpaces(keys[i]);
    }
    return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
  }
};

Of course, you can also provide your own non-Bloom filter-based filter strategy, see for details leveldb/filter_policy.h.

14. Checksums

leveldb associates checksums with all data it stores in the filesystem, and there are two separate controls for these checksums:

ReadOptions::verify_checksums can be set to true to force verification of all data read from the filesystem. The default is false, i.e. no such validation will be performed.

Options::paranoid_checks is set to true before the database is opened, so that the database will report an error as soon as it detects data corruption. Depending on where the database is damaged, the error may be reported after the database is opened, or when a certain operation is performed later. This configuration is disabled by default, that is, the database that is partially damaged in the persistent storage can continue to be used.

If the database is corrupted (which may not be possible when Options::paranoid_checks is enabled), the leveldb::RepairDB() function can be used to repair as much data as possible.

15. Approximate space size

The GetApproximateSizes method is used to obtain the approximate size of the file system occupied by one or more key intervals (unit, byte)

leveldb::Range ranges[2];
ranges[0] = leveldb::Range("a", "c");
ranges[1] = leveldb::Range("x", "z");
uint64_t sizes[2];
db->GetApproximateSizes(ranges, 2, sizes);

The result of the above code is that size[0] stores the approximate number of bytes in the file system corresponding to the interval [a…c). size[1] saves the approximate number of bytes in the file system corresponding to the [x…z) key range.

16. Environment variables

All file operations and other operating system calls initiated by leveldb are ultimately routed to a leveldb::Env object. Users can also provide their own Env implementations for finer control. For example, if an application wants to introduce an artificial delay for leveldb's file IO to limit the impact of leveldb on other applications on the same system:

// 定制自己的 Env 
class SlowEnv : public leveldb::Env {
    
    
  ... implementation of the Env interface ...
};

SlowEnv env;
leveldb::Options options;
// 用定制的 Env 打开数据库
options.env = &env;
Status s = leveldb::DB::Open(options, ...);

17. Portable

leveldb can be ported to a particular platform if it provides implementations of types/methods/functions exported by leveldb/port/port.h, see leveldb/port/port_example.h for more details.

Additionally, new platforms may also require a new default leveldb::Env implementation. For details, please refer to leveldb/util/env_posix.h for implementation.

3. Principle exploration

1. Start with lsm-tree

The design idea of leveldb starts from the storage strategy of lsm-tree.
lsm-tree is essentially a storage strategy and a method to optimize write performance.

It changes our usual operation of adding, deleting, modifying and checking kv key-value pairs from random writing to sequential writing , thereby improving the performance of writing.

The core idea of lsm-tree is to maintain an ordered data structure in memory (it can be a jump list, sorted tree, etc.). When a write operation comes in, we just modify the data structure in memory. In order to avoid the data loss problem caused by special circumstances (such as downtime and process crash) during this modification, the modification will be converted into a log at this time, and it will be written into the log file in order to avoid data loss.

When the size of the data structure in the memory exceeds a certain threshold, and then the data structure in the memory is persisted to the hard disk, the writing at this time is also sequential writing. Performance comes naturally.

But this is not without disadvantages. The disadvantage is that when I query, if the key value is not in memory, then I have to look it up on the hard disk. At this time, it is divided into several different small files for the content we are looking for, so in extreme cases, it may be read multiple times to find the corresponding key-value pair.

2. Architecture design and core design ideas

2.1 Architecture Design

The architecture design of leveldb is as follows:
insert image description here
There are several roles in the above figure:

Log: disk log file
memtable, immutable: orderly jump table in memory, memtable can be written, immutable means closed data structure, read-only, waiting to be persisted to disk
SSTable files: sorted string table (SSTable) files, divided into seven levels

The core idea is to use lsm-tree storage

2.2 Read and write process

Next, let's briefly explain the above architecture diagram through the operation in the process of reading and writing.

data writing process

Convert this insertion into a log and insert it into the log file
Apply the data inserted this time to the memtable in memory

Usually, the above two operations complete one data write. Compared with the traditional random write, this method only uses one sequential write and memory write, and the performance is naturally much faster.

Data deletion process

Data deletion does not actually erase data from the disk, but takes the form of marking.

When the data is still in the memory and not persisted to the disk, you only need to delete the corresponding data structure in the memory at this time
When the data has been persisted to the disk, I will mark a Key for deletion at this time, indicating that it has been deleted. As for the actual deletion, it will wait until the compression operation is performed later

Data modification process

The modification here can be understood as deleting first and then inserting. The old value will become invalid, but it is not really deleted.

Data query process

Find memtable and Immutable in memory
If it cannot be found in the memory, then it is necessary to search for the disk. At this time, the components will be searched from C0 in a cascaded manner until the required data is found in the smallest component Ci.

The first time you understand leveldb, you may wonder what L0-L6 is all about.
In fact, this is also the innovation of leveldb and the origin of its name. Didn't it say before that the lsm-tree search needs to find all the files?

Of course, you can search data randomly and sequentially, but don't forget that data search is often localized , and hot data often has more opportunities to be accessed.

Accordingly, leveldb designs seven level files. Its idea is as follows:

leveldb disk file storage is divided into 7 levels (L0-L6), the higher the level, the greater the quantity level, and the greater the storage capacity (the difference is the order of magnitude)
The persistent data just written is often at the L0 level. When the file reaches a certain size, a new file will be opened for writing
When the number of files reaches the threshold, it will compress several files in the level into the size specified by the next level, and then enter the next level

The effect of this is that the recently modified data is often accessed first (because the level is small), but this part of the data often accounts for a small proportion, and the older data will have a lower access priority.
The advantage of this is that it complies with the principle of time locality, and theoretically data access will be faster.

2.3 Compression

When writing more and more, the number of files of a certain level will also increase.
And we know that the file descriptors of the file system are limited, and too many files will cause the problem of reduced search efficiency.
Not only that, we mentioned before that the deletion operation of leveldb is not a real deletion, but a mark first, so it will inevitably cause the problem of redundant storage.

So when a threshold is reached in a level, compression will be triggered, and several files of that level will be compressed to the next level. Compression here is the merging of several files into one or more new files. This process can be understood as follows: several small ordered trees are merged into a large ordered tree, and the data that has been marked for deletion will be removed during this process.

Another very important reason is that the need for compression stems from the design requirements of LSM, namely sequential disk access and keeping data sorted on disk .

3. File composition

For the study of data storage, it is necessary to study its persistent file composition.

Log files

Every update will be written to this file. When the file size reaches the threshold (4KB by default), it will be converted into a sorted table, and a new log will be created to deal with the next update.

sorted tables

The file that actually stores the data
The higher the level, the greater the quantity level and the greater the storage capacity

Manifest

content:

The file name of the sorted table for each level
key range
some other metadata

This file is created whenever the database is reopened

current

A text file that records the names of the most recent MANIFEST files

info logs

The information will be printed to a file called LOG and LOG.old

other

Other files used for other purposes may also exist (LOCK, *.dbtmp)

4. Data recovery

What is considered here is the process of redoing data operations sequentially according to the previously written log files after the downtime. The steps are as follows:

Read CURRENT to find the name of the most recently committed MANIFEST
Read the named MANIFEST file
Clean up old files
Convert log blocks to new level 0 sstable
Start directing new writes to a new log file with recovery sequence

Summarize

leveldb is a storage engine designed based on lsm-tree, which can greatly improve the efficiency of write operations, and is suitable for scenarios where more writes are performed than reads.

Leveldb study notes: exploration of the use and principle of leveldb