Realize your own database four

I. Introduction

The previous article has explained some principles of the B+ tree, and also mentioned that the way we are currently using to persist data, and we are inserting data alone without any metadata information, although the insertion speed is very fast, because it uses way to append. But this method inserts very fast, as mentioned last time, the query and deletion speed will be very slow.

df05c8fcc4633092ca39980d6a7725fa.png
Data structure performance comparison chart

We used to use non-sorted array rows, which saved data and no other information. The insertion performance is the best, but the time complexity of deletion and search is O(n). The sorted array search is very fast, and the binary search can be used. The time complexity is O(log(n)), but the time complexity of insertion and deletion is O(n), and when the B+ tree method is used and the metadata and primary key are saved, the performance is balanced whether it is searching, inserting or deleting.

Two transformation

2.1 Metadata information

When using the tree method to save database data, it is not possible to simply record only the original information, but also record information such as child node pointers. In order to facilitate traversal to sibling nodes, it is also necessary to save the pointer information pointing to the parent node (here The pointer of the c language is similar to the pointer of the c language, when it is saved on the disk, it depends on the specific implementation, it may be a page number). bf8150b05c75d8230e8fc220285f97ab.pngSimilarly, in order to distinguish between subnodes and leaf nodes, we need to save the type of node and whether it is a root node.

Another advantage of using this sorted tree to save data is that it is more convenient to traverse.

node type

typedef enum { NODE_INTERNAL, NODE_LEAF } NodeType;

root node metadata

/*
 * Common Node Header Layout
 */
// 节点类型数据的大小,其实只有一个bit就可以区分叶子节点和根节点,这里面浪费了点
const uint32_t NODE_TYPE_SIZE = sizeof(uint8_t);
// 节点类型的偏移,放在页节点的开头
const uint32_t NODE_TYPE_OFFSET = 0;
// 是否为root的元数据大小
const uint32_t IS_ROOT_SIZE = sizeof(uint8_t);
// 是否为root的元数据的偏移量
const uint32_t IS_ROOT_OFFSET = NODE_TYPE_SIZE;
// 指向父指针的指针大小
const uint32_t PARENT_POINTER_SIZE = sizeof(uint32_t);
// 指向父指针的偏移量
const uint32_t PARENT_POINTER_OFFSET = IS_ROOT_OFFSET + IS_ROOT_SIZE;
// 整个Node节点的元数据整体尺寸
const uint8_t COMMON_NODE_HEADER_SIZE =
  NODE_TYPE_SIZE + IS_ROOT_SIZE + PARENT_POINTER_SIZE;

Except for the root node, it is a leaf node, which stores complete data information, so the content of the leaf node is different from that of the root node.

/*
 * Leaf Node Header Layout
 */
// 叶子节点保存的cell数量,一个cell由key和value组成 可以看作一个key后面跟着持久化的行
const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t);
// 叶子节点的cell数量的偏移量
const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE;
// 叶子节点的元数据大小
const uint32_t LEAF_NODE_HEADER_SIZE =
  COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE;

The number of cells saved by the leaf node. A cell consists of key and value, which can be regarded as a key followed by persistent rows.5d51d1bb96aff173e0434e4c6e4566e1.png

The schematic diagram clearly shows that the first byte is node_type, the next byte is is_root, the next four bytes are the pointer to the parent node, and the next four bytes are the number of cells. There is an error in it that is written Two, the remaining content is the cell, that is, key+value, which are deployed in this way. If the Node node does not have enough cells, it will be wasted.

Access leaf node method

// 叶子节点中cell数量地址获取
uint32_t* leaf_node_num_cells(void* node) {
return node + LEAF_NODE_NUM_CELLS_OFFSET;
}
// 叶子节点上第cell_num个cell的偏移量的地址
void* leaf_node_cell(void* node, uint32_t cell_num) {
 return node + LEAF_NODE_HEADER_SIZE + cell_num * LEAF_NODE_CELL_SIZE;
}
// 叶子节点上第cell_num个key的偏移量,因为cell的前面放的是key
uint32_t* leaf_node_key(void* node, uint32_t cell_num) {
  return leaf_node_cell(node, cell_num);
}
// 叶子节点上第cell_num个cell的value地址获取
void* leaf_node_value(void* node, uint32_t cell_num) {
 return leaf_node_cell(node, cell_num) + LEAF_NODE_KEY_SIZE;
}
// 初始化一个节点,将cell_num设置为0
void initialize_leaf_node(void* node) { *leaf_node_num_cells(node) = 0; }

2.2 Changes to Table and Pager

First of all, the whole design is considered to be simple. It does not support the reading of partial pages. Each reading is to read the entire page.

const uint32_t PAGE_SIZE = 4096;
 const uint32_t TABLE_MAX_PAGES = 100;
 
 
 typedef struct {
   int file_descriptor;
   uint32_t file_length;
+  uint32_t num_pages;
   void* pages[TABLE_MAX_PAGES];
 } Pager;
 
 typedef struct {
   Pager* pager;
-  uint32_t num_rows;
+  uint32_t root_page_num;
 } Table;

In the new definition, we fixed the maximum number of rows per table. In addition, we save the number of pages in Pager. Save the page_num of the root page in the Table, so that we can easily find the root page through the table.

Here are some key line changes:

void* get_page(Pager* pager, uint32_t page_num) {
     pager->pages[page_num] = page;
    if (page_num >= pager->num_pages) {
     pager->num_pages = page_num + 1;
    }
   return pager->pages[page_num];
  }

///
Pager* pager_open(const char* filename) {
   Pager* pager = malloc(sizeof(Pager));
   pager->file_descriptor = fd;
   pager->file_length = file_length;
  pager->num_pages = (file_length / PAGE_SIZE);
 
  if (file_length % PAGE_SIZE != 0) {
     printf("Db file is not a whole number of pages. Corrupt file.\n");
    exit(EXIT_FAILURE);
   }

2.3 Cursor changes

The cursor locates the data, which used to be located by the row, but now the data is located by the page number and the cell number.

typedef struct {
   Table* table;
-  uint32_t row_num;
+  uint32_t page_num;
+  uint32_t cell_num;
   bool end_of_table;  // Indicates a position one past the last element
 } Cursor;

Cursor creation:

Cursor* table_start(Table* table) {
   Cursor* cursor = malloc(sizeof(Cursor));
   cursor->table = table;
-  cursor->row_num = 0;
-  cursor->end_of_table = (table->num_rows == 0);
+  cursor->page_num = table->root_page_num;
+  cursor->cell_num = 0;
+
+  void* root_node = get_page(table->pager, table->root_page_num);
+  uint32_t num_cells = *leaf_node_num_cells(root_node);
+  cursor->end_of_table = (num_cells == 0);
 
   return cursor;
 }

The cursor positioning at the beginning of the table, the page of the cursor is the root page number of the table, and leaf_node_num_cellsthe number of cells in the root node is obtained by the number of rows.

Cursor* table_end(Table* table) {
   Cursor* cursor = malloc(sizeof(Cursor));
   cursor->table = table;
-  cursor->row_num = table->num_rows;
+  cursor->page_num = table->root_page_num;
+
+  void* root_node = get_page(table->pager, table->root_page_num);
+  uint32_t num_cells = *leaf_node_num_cells(root_node);
+  cursor->cell_num = num_cells;
   cursor->end_of_table = true;
 
   return cursor;
 }

This is the cursor on the end page of the table, and the end_of_table flag is set.

void* cursor_value(Cursor* cursor) {
-  uint32_t row_num = cursor->row_num;
-  uint32_t page_num = row_num / ROWS_PER_PAGE;
+  uint32_t page_num = cursor->page_num;
   void* page = get_page(cursor->table->pager, page_num);
-  uint32_t row_offset = row_num % ROWS_PER_PAGE;
-  uint32_t byte_offset = row_offset * ROW_SIZE;
-  return page + byte_offset;
+  return leaf_node_value(page, cursor->cell_num);
 }

Get the value at the cursor and get the value of cell_num. Since each cell has the same size, it is also a similar value method. Increment of the cursor:

void cursor_advance(Cursor* cursor) {
-  cursor->row_num += 1;
-  if (cursor->row_num >= cursor->table->num_rows) {
+  uint32_t page_num = cursor->page_num;
+  void* node = get_page(cursor->table->pager, page_num);
+
+  cursor->cell_num += 1;
+  if (cursor->cell_num >= (*leaf_node_num_cells(node))) {
     cursor->end_of_table = true;
   }
 }

Each increment here only increases the number of cells. Maybe you will say what to do if the cell is incremented to the last cell. In fact, after increasing to the last cell, the flag of the end of the table is set, and the loop exits.

Database Open Open the database, if it is a new database, initialize a page as a leaf node.

Table* db_open(const char* filename) {
   Pager* pager = pager_open(filename);
   Table* table = malloc(sizeof(Table));
   table->pager = pager;
  table->root_page_num = 0;

  if (pager->num_pages == 0) {
    // New database file. Initialize page 0 as leaf node.
    void* root_node = get_page(pager, 0);
    initialize_leaf_node(root_node);
  }
 
   return table;
 }

The following is the number of key rows, the leaf node inserts data first:

void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) {
 void* node = get_page(cursor->table->pager, cursor->page_num);

  uint32_t num_cells = *leaf_node_num_cells(node);
 
  if (num_cells >= LEAF_NODE_MAX_CELLS) {
    // 节点满了
    printf("Need to implement splitting a leaf node.\n");
   exit(EXIT_FAILURE);
  }

  if (cursor->cell_num < num_cells) {
    // 为一个cell腾出位置位置
  for (uint32_t i = num_cells; i > cursor->cell_num; i--) {
     memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1),
            LEAF_NODE_CELL_SIZE);
    }
  }
// 增加cell_num,设置key和持久化row。
  *(leaf_node_num_cells(node)) += 1;
  *(leaf_node_key(node, cursor->cell_num)) = key;
  serialize_row(value, leaf_node_value(node, cursor->cell_num));
}

This function assumes that the tree has only one page, and the current version does not support multiple pages. Insert operation:

ExecuteResult execute_insert(Statement* statement, Table* table) {
  void* node = get_page(table->pager, table->root_page_num);
  if ((*leaf_node_num_cells(node) >= LEAF_NODE_MAX_CELLS)) {
     return EXECUTE_TABLE_FULL;
   }
 
   Row* row_to_insert = &(statement->row_to_insert);
   Cursor* cursor = table_end(table);
  leaf_node_insert(cursor, row_to_insert->id, row_to_insert);
   free(cursor);
}

Three print commands

Print meta information, that is, metadata information.

void print_constants() {
 printf("ROW_SIZE: %d\n", ROW_SIZE);
  printf("COMMON_NODE_HEADER_SIZE: %d\n", COMMON_NODE_HEADER_SIZE);
  printf("LEAF_NODE_HEADER_SIZE: %d\n", LEAF_NODE_HEADER_SIZE);
  printf("LEAF_NODE_CELL_SIZE: %d\n", LEAF_NODE_CELL_SIZE);
  printf("LEAF_NODE_SPACE_FOR_CELLS: %d\n", LEAF_NODE_SPACE_FOR_CELLS);
  printf("LEAF_NODE_MAX_CELLS: %d\n", LEAF_NODE_MAX_CELLS);
}

MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) {
   if (strcmp(input_buffer->buffer, ".exit") == 0) {
     db_close(table);
     exit(EXIT_SUCCESS);
  } else if (strcmp(input_buffer->buffer, ".constants") == 0) {
      printf("Constants:\n");
      print_constants();
     return META_COMMAND_SUCCESS;
   } else {
     return META_COMMAND_UNRECOGNIZED_COMMAND;
   }

There is nothing special to say, just support a command to print constant metadata.

Four Tree Visualization

To help with debugging, add the ability to print the tree:

void print_leaf_node(void* node) {
  uint32_t num_cells = *leaf_node_num_cells(node);
  printf("leaf (size %d)\n", num_cells);
  for (uint32_t i = 0; i < num_cells; i++) {
    uint32_t key = *leaf_node_key(node, i);
   printf("  - %d : %d\n", i, key);
  }
}

Traverse nodes and print cell information. Added meta command:

MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) {
   if (strcmp(input_buffer->buffer, ".exit") == 0) {
     db_close(table);
     exit(EXIT_SUCCESS);
  } else if (strcmp(input_buffer->buffer, ".btree") == 0) {
    printf("Tree:\n");
    print_leaf_node(get_page(table->pager, 0));
   return META_COMMAND_SUCCESS;
   } else if (strcmp(input_buffer->buffer, ".constants") == 0) {
     printf("Constants:\n");
     print_constants();
     return META_COMMAND_SUCCESS;
   } else {
     return META_COMMAND_UNRECOGNIZED_COMMAND;
   }

The biggest change this time is the change of the storage structure of the file, changing it to a B+ tree method, but we did not sort the cells in the file according to the key, and we only support one page, but it is still a major improvement, take your time .

Guess you like

Origin blog.csdn.net/mseaspring/article/details/128810882