Implement a kv storage engine from scratch


The purpose of writing this article is to help more people understand rosedb. I will implement a simple kv storage engine including PUT, GET, and DELETE operations from scratch.


You can think of it as a simple version of rosedb, call it minidb (mini version of rosedb).


Whether you are a beginner of Go language, want to advance to Go language, or are interested in kv storage, you can try to implement it yourself, I believe it will be of great help to you.


When it comes to storage, in fact, a core problem to be solved is how to store and retrieve data. In the computer world, the problem is more diverse.


There are memory and disks in the computer. The memory is volatile, and all the stored data is lost after a power failure. Therefore, if you want to use the system normally after a system crash and restart, you have to store the data in non-volatile media. , the most common is the disk.


Therefore, for a stand-alone version of kv, we need to design how data should be stored in memory and how to store it in disk.


Of course, there have been many excellent predecessors who have explored it, and there have been classic summaries, which mainly divide the data storage models into two categories: B+ tree and LSM tree .


The focus of this article is not on these two models, so only a brief introduction is given.


B+ tree



The B+ tree is evolved from a binary search tree. By increasing the number of nodes in each layer, the height of the tree is reduced, the pages of the disk are adapted, and the disk IO operations are minimized.


The B+ tree query performance is relatively stable. When writing or updating, it will find and locate the position in the disk and perform in-situ operations. Note that this is random IO, and a large number of insertions or deletions may trigger page splits and merges. The write performance is average, so the B+ tree is suitable for scenarios with more reads and fewer writes .


LSM tree



LSM Tree (Log Structured Merge Tree) is not actually a specific tree-type data structure, but just a data storage model. Its core idea is based on the fact that sequential IO is much faster than random IO .


Unlike B+ tree, in LSM, data insertion, update and deletion will be recorded as a log, and then appended to the disk file, so that all operations are sequential IO, so  LSM is more suitable for writing more and reading less scene.


After reading the previous two basic storage models, I believe you have a basic understanding of how to access data, while minidb is based on a simpler storage structure, which is similar to LSM in general.


I will not directly talk about the concept of this model, but take a simple example to look at the process of data PUT, GET, DELETE in minidb, so as to let you understand this simple storage model.


PUT


We need to store a piece of data, namely key and value. First, in order to prevent data loss, we will encapsulate the key and value into a record (here, this record is called Entry), and append it to the disk file. The contents of the Entry are roughly key, value, key size, value size, and writing time.



Therefore, the structure of the disk file is very simple, which is a collection of multiple Entry.



After the disk is updated, the memory is updated. In the memory , a simple data structure, such as a hash table, can be selected. The key of the hash table corresponds to the location of the Entry on the disk, which is easy to obtain when searching.


In this way, in minidb, the process of data storage is complete, and there are only two steps: an appending of disk records, and an index update in memory.


GET


Let's look at GET to obtain data. First, find the index information corresponding to the key in the hash table in the memory, which includes the location where the value is stored in the disk file, and then directly retrieve the value from the disk according to this location. .


OF THE


Then there is the delete operation, which will not locate the original record for deletion, but encapsulate the delete operation as an Entry and append it to the disk file, but here it is necessary to identify the type of the Entry as delete.


Then delete the index information of the corresponding key in the hash table in the memory, and the delete operation is completed.

It can be seen that there are only two steps for inserting, querying, and deleting: an index update in memory, and a record appending to a disk file . Therefore, regardless of the size of the data, the writing performance of minidb is very stable.


Merge


Finally, let's look at a more important operation. As mentioned earlier, the records of the disk file are always being appended and written, which will cause the file capacity to increase all the time. And for the same key, there may be multiple entries in the file (recall that updating or deleting the key content will also append records), then in the data file, there are actually redundant Entry data.


Take a simple example, for example, for key A, set its value to 10, 20, and 30 successively, then there are three records in the disk file:



At this time, the latest value of A is 30, so the first two records are already invalid.


In response to this situation, we need to periodically merge data files and clean up invalid Entry data. This process is generally called merge .


The idea of ​​​​merging is also very simple. You need to take out all the Entry of the original data file, rewrite the valid Entry into a new temporary file, and finally delete the original data file. The temporary file is the new data file.



This is the underlying data storage model of minidb, its name is bitcask , of course, rosedb also adopts this model. It is essentially an LSM-like model. The core idea is to use sequential IO to improve write performance, but it is much simpler than LSM in implementation.


After introducing the underlying storage model, we can start the code implementation. I put the complete code implementation on my Github, the address:


https://github.com/roseduan/minidb


Some key codes are intercepted in the article.


The first is to open the database, you need to load the data file first, then take out the Entry data in the file, and restore the index state. The key part of the code is as follows:


func Open(dirPath string) (*MiniDB, error) {
   // 如果数据库目录不存在,则新建一个
   if _, err := os.Stat(dirPath); os.IsNotExist(err) {
      if err := os.MkdirAll(dirPath, os.ModePerm); err != nil {
         return nil, err
      }
   }

   // 加载数据文件
   dbFile, err := NewDBFile(dirPath)
   if err != nil {
      return nil, err
   }

   db := &MiniDB{
      dbFile: dbFile,
      indexes: make(map[string]int64),
      dirPath: dirPath,
   }

   // 加载索引
   db.loadIndexesFromFile(dbFile)
   return db, nil
}


Let's take a look at the PUT method. The process is the same as the above description. First update the disk, write a record, and then update the memory:


func (db *MiniDB) Put(key []byte, value []byte) (err error) {
  
   offset := db.dbFile.Offset
   // 封装成 Entry
   entry := NewEntry(key, value, PUT)
   // 追加到数据文件当中
   err = db.dbFile.Write(entry)

   // 写到内存
   db.indexes[string(key)] = offset
   return
}


The GET method needs to first retrieve the index information from the memory, determine whether it exists, and return it directly if it does not exist. If it exists, retrieve the data from the disk.


func (db *MiniDB) Get(key []byte) (val []byte, err error) {
   // 从内存当中取出索引信息
   offset, ok := db.indexes[string(key)]
   // key 不存在
   if !ok {
      return
   }

   // 从磁盘中读取数据
   var e *Entry
   e, err = db.dbFile.Read(offset)
   if err != nil && err != io.EOF {
      return
   }
   if e != nil {
      val = e.Value
   }
   return
}


The DEL method is similar to the PUT method, except that the Entry is marked as DEL, and then encapsulated as an Entry and written to the file:


func (db *MiniDB) Del(key []byte) (err error) {
   // 从内存当中取出索引信息
   _, ok := db.indexes[string(key)]
   // key 不存在,忽略
   if !ok {
      return
   }

   // 封装成 Entry 并写入
   e := NewEntry(key, nil, DEL)
   err = db.dbFile.Write(e)
   if err != nil {
      return
   }

   // 删除内存中的 key
   delete(db.indexes, string(key))
   return
}


Finally, the important merge data file operation, the process is the same as the above description, the key code is as follows:


func (db *MiniDB) Merge() error {
   // 读取原数据文件中的 Entry
   for {
      e, err := db.dbFile.Read(offset)
      if err != nil {
         if err == io.EOF {
            break
         }
         return err
      }
      // 内存中的索引状态是最新的,直接对比过滤出有效的 Entry
      if off, ok := db.indexes[string(e.Key)]; ok && off == offset {
         validEntries = append(validEntries, e)
      }
      offset += e.GetSize()
   }

   if len(validEntries) > 0 {
      // 新建临时文件
      mergeDBFile, err := NewMergeDBFile(db.dirPath)
      if err != nil {
         return err
      }
      defer os.Remove(mergeDBFile.File.Name())

      // 重新写入有效的 entry
      for _, entry := range validEntries {
         writeOff := mergeDBFile.Offset
         err := mergeDBFile.Write(entry)
         if err != nil {
            return err
         }

         // 更新索引
         db.indexes[string(entry.Key)] = writeOff
      }

      // 删除旧的数据文件
      os.Remove(db.dbFile.File.Name())
      // 临时文件变更为新的数据文件
      os.Rename(mergeDBFile.File.Name(), db.dirPath+string(os.PathSeparator)+FileName)

      db.dbFile = mergeDBFile
   }
   return nil
}


Excluding the test file, the core code of minidb is only 300 lines . Although the sparrow is small and complete, it already contains the main idea of ​​the storage model of bitcask, and is also the underlying foundation of rosedb.


After understanding minidb, you can basically fully grasp the storage model of bitcask. If you spend more time, I believe that rosedb can also be used with ease.


Further, if you are interested in kv storage, you can study more related knowledge in more depth. Although bitcask is simple and easy to understand, there are many problems. Rosedb has optimized it in the process of practice. But there are still many problems.


Some people may be confused. Is the model of bitcask simple, is it just a toy, and is it applicable in the actual production environment ? The answer is yes.


bitcask originally originated from the underlying storage model of the Riak project, and Riak is a distributed kv storage that also ranks among the best in NoSQL:



The distributed kv storage used by Douban is actually based on the bitcask model and has been optimized a lot. At present, there are not many kvs purely based on the bitcask model, so you can take a look at the code of rosedb and put forward your own opinions and suggestions to improve this project together.


Finally, attach the relevant project address:

minidb:https://github.com/roseduan/minidb

rosedb:https://github.com/roseduan/rosedb


References:

https://riak.com/assets/bitcask-intro.pdf

https://medium.com/@arpitbhayani/bitcask-a-log-structured-fast-kv-store-c6c728a9536b




Title image: from wallheaven.cc



This article is shared from the WeChat public account - HHFCodeRv (hhfcodearts).
If there is any infringement, please contact [email protected] to delete it.
This article participates in the " OSC Yuanchuang Project ", you are welcome to join and share with us.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3162806/blog/5217902