The working principle of relational database - data structure (3)

The working principle of relational database - data structure (3)

This article is translated from the Coding-Geek article: How does a relational database work. 
Original link: http://coding-geek.com/how-databases-work/#Buffer-Replacement_strategies 
This article translates the following chapters: 

write picture description here

一、Array、Tree and Hash table

Through the previous chapters, we have understood the concepts of time complexity and merge sort, and then I will introduce three data structures. These three data structures are so important that they are the cornerstones of modern database systems. I will also talk about the concept of database indexes.

2. Array-array

A two-digit array is the simplest data structure, and a database table can be regarded as a two-dimensional array. E.g: 

write picture description here


This two-dimensional array represents a table structure with rows and columns:

  1. Each row represents an object
  2. The data of all columns in each row represents all the properties of an object
  3. Each column stores a certain type of data (such as integer, string, date...).

Although, two-dimensional arrays are great for storing table data, but when you need to query data from the array based on a certain condition, the performance is unacceptable.

For example: you want to find everyone who works in the UK, you need to iterate through each row of data to determine whether he belongs to the UK. This process requires N steps (N depends on the number of rows in the table). Sounds, the performance is not too bad, but is there a faster way?

There must be, and then the tree structure should appear.

Note: Modern databases use more advanced array structures to store table data, such as heap-organized tables or index-organized tables. But none of them solve the problem of how to quickly filter data in the array according to the filter conditions of some columns.

3. Tree and database index – tree and database index

write picture description here

This tree has 15 nodes. Let's see how to find the element 208 from it:

  1. The search starts from the root node, which is 136. Since 136 < 208, search in the right subtree of 136.
  2. 398>208, continue searching in the left subtree of 398.
  3. 250>208, continue searching in the left subspecies of 250.
  4. 200<208, find in the right subtree of 200. But 200 has no right subtree. 208 was not found (because if it could be found, it should be on the right subtree of 200).

Let's see how to find the element 40:

  1. The query also starts from the root node 136 . Since 136>40, query in the left subtree of 136.
  2. 80>40, find in the left subtree of 80
  3. 40=40, element found! Extract 40 the row number index of the corresponding array stored in this node.
  4. With this row number index, if you want to get the data of this row, you can get it immediately (subscript access of the array).

Ultimately, the number of operation steps for both queries is the height of the tree. If you have read the merge sort chapter carefully, you should know that the height of the tree is log(N), so the time complexity of the search algorithm is O(log(N)). not bad.

(1) Back to our problem – back to the problem

The content of the article is very abstract, so let's get back to the question. In addition to simple integer data, consider strings, which are used to represent a person's country of origin in the preceding table. Suppose, you have built a tree containing the data for the "country" field in the previous table.

  1. Do you want to know who is working in the UK
  2. You need to look up the tree and find the UK node.
  3. In the UK node, you can find the corresponding array line index information for all people working in the UK.

This query operation only consumes Log(N) steps, rather than the N steps required to directly query the array. Now you can also guess what a database index is?

You can index any number of columns of data (one column of strings, one column of integers, two columns of strings, one column of integers + one column of strings, one column of date types, etc.). As long as you implement comparison functions for those columns, you can control the order in which the primary keys are sorted in the tree (databases already implement comparison functions for basic data types).

(2) B+ tree index – B+ tree index

While the above binary tree works well for querying a certain fixed value, if you want to query all data in a certain range, the performance is very low. It takes N steps because each node in the tree needs to be compared to see if it is within the specified range. In addition, this method is also very expensive for I/O resources, because the entire index tree needs to be read. We need to find an efficient range query method. To solve this problem, modern databases use B+ trees, which are an optimization of the previous binary query tree. Inside the B+ tree:

  1. Only the leaf nodes store the pointer of each row of data objects in the association table same).
  2. The purpose of other nodes is only to help route to the desired leaf node when querying.

    write picture description here

    As shown in the figure above, the B+ tree stores more redundant nodes (2 times). There are some more subsidiary nodes inside the tree. The role of these "decision nodes" is to help you find the correct leaf nodes (nodes that store table data pointers). The query time complexity of the B+ tree is still Log(N), and the tree is just one more layer. The biggest difference is that the leaf node stores the pointer to the next node.

In this B+ tree, if you look for data before 40 to 100:

  1. You only need to query for nodes with a value of 40 (or nodes slightly larger than 40, if the 40 node does not exist). The query method is the same as the previous binary tree.
  2. Collect 40 successor nodes, through the pointers it stores to the successor nodes, until 100 (or a number less than 100) is encountered.

Suppose, you need to query M nodes and the tree has N nodes. The time complexity of querying the specified value (40) is Log(N), the same as the previous binary tree query. However, once you find the node (40), you still need to go through M steps to traverse and collect M successor nodes. The time complexity of the B+ range query is O(M+Log(N)), which is a lot better than the O(N) complexity of the previous binary tree. The larger the amount of data, the more obvious the performance improvement. You don't need to read the whole tree, which also means less disk I/O reads.

However, this also introduces new problems (again, problems). If you add or delete a row from the database, you also need to update the data in the B+ tree:

  1. You need to keep the order of nodes in the B+ tree, otherwise you can't find nodes in a messy tree.
  2. You must ensure that the leaf nodes are arranged in ascending order, otherwise the time complexity of the range query will degenerate from O(Log(N)) to O(N).

In other words, B+ trees must have the ability to self-adjust tree balance and node order. Thankfully, intelligent data deletion and data insertion operations allow the B+ tree to maintain the above characteristics. This also comes with a cost: the time complexity of inserting and deleting data is O(Log(N)), which is why you often hear the argument that too many indexes are not a good thing. In practice, this reduces the efficiency of insert/update/delete operations because the database needs to update the indexes of the table at the same time, and each index takes O(Log(N)) time.

Translator's Note: Everything has two sides, and it is beneficial to be beneficial; what kind of data structure you choose depends on your application scenario.

In addition, the index will also increase the complexity of the tansaction manager (the last one will talk about the tansaction manager).

For more details, you can search B+ Tree on Wikipedia. If you want a sample B+ tree implementation, you can read this article ( https://blog.jcole.us/2013/01/07/the-physical-structure-of-innodb-index-pages/ ), the author of this article is a core MySQL developer. He detailed how innoDB (MySQL database engine) implements indexes.

4. Hash Table – Hash Table

The last important data structure is the hash table. Hash tables are very useful when you need to quickly look up a piece of data. In addition, understanding the hash table will help us understand a common database connection technology that will be mentioned later: hash join. Hash table is also often used to store internal management data of some databases, such as lock table, buff pool, etc. These concepts will be discussed later.

Hash table is a type of data structure that can quickly find data based on keywords. To build a hash table, you need to define the following:

1) Object key 
2) A hash function defined for the key. The key of an object is calculated using a hash function to indicate where the object is stored (called buckets). 
3) Keyword comparison function. Once the bucket where the object is located, the next step is to find the corresponding object through the comparison function inside the bucket.

(1) A simple example - an example of a hash table

write picture description here
This hash table has 10 buckets. I only drew 5 groups of buckets in the picture, please make up your own brain for the other 5 groups. I define the Hash function to be modulo division by 10 (divide by 10 to find the remainder), in other words, I can determine the bucket by the last digit of the value.
  1. If the last digit of the number is 0, store the value in bucket 0
  2. If the last digit of the number is 1, store the value in bucket 1
  3. If the last digit of the number is 2, store the value in bucket 2
  4. …..

Comparison function I use the method of judging whether two integer values ​​are equal.

Let's see how to find element 78 in the hash table:

  1. Calculate the hash value 8 through the hash function
  2. Find elements in bucket 8, the first element is 78
  3. return element 78
  4. The query only needs to perform two steps: the first step is to calculate the hash value and determine the bucket location; the second step is to view the elements in the bucket.

Let's take another look at how to find element 59:

  1. Calculate the hash value, get 9
  2. Find element 59 in bucket. The first element is 99, 99!=59, 99 is not the element to look for.
  3. In the same way, find the second element (9), the third element (79), ... the last element (29).
  4. The element 59 does not exist
  5. This query performs 7 steps

(2) A good hash function – a good hash function

As you can see, the time complexity of querying different values ​​is different.

If you change the hash function to divide by 1000000 modulo (take the last 6 digits of the number as the bucket identifier). The second query above takes only one step (without any data in bucket 59). Finding a good hash function that keeps as little data as possible in each bucket is very difficult.

Finding a good hash function in the above example is very easy. But this is just a simple example, it would be very difficult if the keywords were of the following data types: 
1. A string (eg representing a person's first name) 
2. Two strings (eg representing both a person's first and last name) 
3. Two strings + a date (e.g. last name, first name and birthday of the person)

Design a good hash function, and the query time of the hash table is O(1).

五、Array VS Hash table

Why not use an array? Good question.

  1. Hash table supports loading part of the content into memory and keeping the other part on disk. There is no need to load the entire tree into memory, saving memory space.
  2. Arrays must use contiguous memory space. If you want to load a large table data into memory, it is difficult to find a large contiguous memory space. The risk of memory allocation failure is high.
  3. Hash table supports selecting any field you want as a key (for example: the country of the person, plus the person's name. Any combination).

For more information, you can read the article on how to implement hash maps in Java, which is an example of an efficient implementation of hash maps. To understand the concepts in this article, you don't need to know JAVA.

 

https://blog.csdn.net/ylforever/article/details/51278954

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324693161&siteId=291194637