Database indexes and their data structures

Source: http://blog.linezing.com/?p=798

Reprinted: http://blog.csdn.net/kennyrose/article/details/7532032

 

To put it bluntly, the index problem is a search problem. . .

 

A database index is a sorted data structure in a database management system to assist in quickly querying and updating data in database tables. The implementation of the index usually uses the B tree and its variant B+ tree .

In addition to data, database systems also maintain data structures that satisfy specific lookup algorithms, and these data structures reference (point to) the data in some way, so that advanced lookup algorithms can be implemented on these data structures. This data structure is the index.

Setting an index for a table comes at a cost: one is to increase the storage space of the database, and the other is to spend more time inserting and modifying data (because the index will also change accordingly).

 

The figure above shows one possible way of indexing. On the left is the data table, with a total of two columns and seven records. The leftmost one is the physical address of the data record (note that logically adjacent records are not necessarily physically adjacent on the disk). In order to speed up the search of Col2, a binary search tree as shown on the right can be maintained, each node contains an index key value and a pointer to the physical address of the corresponding data record, so that binary search can be used in O(log 2 n ) to obtain the corresponding data within the complexity of .

 

Creating indexes can greatly improve system performance.

First, by creating a unique index, the uniqueness of each row of data in the database table can be guaranteed.

Second, it can greatly speed up data retrieval, which is the main reason for creating indexes.

Third, it can speed up table-to-table joins, especially in terms of achieving referential integrity of data.

Fourth, when using the grouping and sorting clauses for data retrieval, the time for grouping and sorting in the query can also be significantly reduced.

Fifth, by using the index, the optimization hider can be used in the query process to improve the performance of the system. 

 

Some people may ask: there are so many advantages to adding indexes, why not create an index for every column in the table? Because, increasing the index also has many downsides.

First, creating and maintaining indexes takes time, which increases with the amount of data.

Second, the index needs to occupy physical space. In addition to the data table occupying the data space, each index also occupies a certain amount of physical space. If a clustered index is to be established, the required space will be larger.

Third, when adding, deleting and modifying the data in the table, the index should also be dynamically maintained, which reduces the speed of data maintenance.

 

Indexes are built on certain columns in a database table. When creating an index, you should consider which columns you can create an index on and which columns you can't create an index on. In general, indexes should be created on these columns: on the columns that need to be searched frequently, the speed of the search can be accelerated; on the column as the primary key, the uniqueness of the column and the arrangement structure of the data in the organization table are enforced; Used on the connected columns, these columns are mainly some foreign keys, which can speed up the connection; create an index on the column that often needs to be searched according to the range, because the index has been sorted, and its specified range is continuous; Create an index on the sorted column, because the index is already sorted, so that the query can use the sorting of the index to speed up the sorting query time; create an index on the column that is often used in the WHERE clause to speed up the judgment of conditions.

 

Also, some columns should not be indexed. In general, these columns that should not be indexed have the following characteristics:

First, indexes should not be created on columns that are rarely used or referenced in queries. This is because, since these columns are rarely used, indexing or no indexing does not improve query speed. On the contrary, due to the addition of indexes, the maintenance speed of the system is reduced and the space requirement is increased.

Second, indexes should not be added to columns with few data values. This is because because these columns have very few values, such as the gender column of the personnel table, in the query results, the data rows of the result set account for a large proportion of the data rows in the table, that is, the data that needs to be searched in the table The proportion of rows is large. Increasing the index does not significantly speed up the retrieval speed.

Third, no indexes should be added to columns defined as text, image and bit data types. This is because the amount of data in these columns is either quite large or has very few values.

Fourth, indexes should not be created when the modification performance is much greater than the retrieval performance. This is because modification performance and retrieval performance are contradictory . When increasing the index, the retrieval performance will be improved, but the modification performance will be reduced. When reducing the index, it will improve the modification performance and reduce the retrieval performance. Therefore, indexes should not be created when the modification performance is much greater than the retrieval performance.

 

Depending on the capabilities of the database, three types of indexes can be created in the database designer: unique indexes, primary key indexes, and clustered indexes .

 

unique index 

 

A unique index is one that does not allow any two rows to have the same index value.

 

Most databases do not allow a newly created unique index to be saved with a table when there are duplicate key values ​​in existing data. The database may also prevent adding new data that would create duplicate key values ​​in the table. For example, if a unique index is created on the employee's last name (lname) in the employee table, no two employees can have the same last name.

 

primary key index

 

Database tables often have a column or combination of columns whose values ​​uniquely identify each row in the table. This column is called the primary key of the table.

 

Defining a primary key for a table in a database diagram automatically creates a primary key index, which is a specific type of unique index. This index requires that every value in the primary key be unique. It also allows fast access to data when primary key indexes are used in queries.

 

clustered index

 

In a clustered index, the physical order of the rows in the table is the same as the logical (index) order of the key values. A table can contain only one clustered index.

 

If an index is not a clustered index, the physical order of the rows in the table does not match the logical order of the key values. Clustered indexes generally provide faster data access than nonclustered indexes.

 

 

 

The principle of locality and disk read-ahead

 

Due to the characteristics of the storage medium, the access of the disk itself is much slower than that of the main memory. In addition to the cost of mechanical movement, the access speed of the disk is often one-hundredth of the main memory. Therefore, in order to improve efficiency, it is necessary to reduce the number of disks as much as possible. I/O. In order to achieve this purpose, the disk is often not read strictly on demand, but will read ahead every time. Even if only one byte is required, the disk will start from this position and sequentially read data of a certain length backward into memory. The rationale for this is the well-known locality principle in computer science : when one piece of data is used, its nearby data is usually used immediately. The data required during program operation is usually concentrated.

Since disk sequential reads are very efficient (no seek time, only very little spin time), read-ahead can improve I/O efficiency for programs with locality.

The read-ahead length is generally an integer multiple of the page. A page is a logical block of computer management memory. Hardware and operating systems often divide main memory and disk storage into consecutive blocks of equal size. Each block of storage is called a page (in many operating systems, the size of a page is usually 4k), main memory and disk exchange data in units of pages. When the data to be read by the program is not in the main memory, a page fault exception will be triggered. At this time, the system will send a disk read signal to the disk, and the disk will find the starting position of the data and read one or several pages continuously. Load into memory, then return abnormally, and the program continues to run.

Performance Analysis of B-/+Tree Index

At this point, we can finally analyze the performance of the B-/+Tree index.

As mentioned above, the number of disk I/Os is generally used to evaluate the pros and cons of the index structure. First, from the B-Tree analysis, according to the definition of B-Tree, it can be known that a retrieval needs to visit h nodes at most. The designers of the database system cleverly used the principle of disk read-ahead to set the size of a node equal to a page, so that each node can be fully loaded with only one I/O. In order to achieve this goal, the following techniques need to be used in the actual implementation of B-Tree:

Each time a new node is created, it directly applies for a page of space, which ensures that a node is also physically stored in a page. In addition, the computer storage allocation is page-aligned, so that only one I/O is required for a node.

A retrieval in B-Tree requires at most h-1 I/Os (the root node resides in memory), and the asymptotic complexity is O(h)=O(log d N). In general practical applications, the out-degree d is a very large number, usually more than 100, so h is very small (usually no more than 3).

In the structure of red-black tree, h is obviously much deeper. Since logically close nodes (father and son) may be physically far away, locality cannot be utilized, so the I/O asymptotic complexity of red-black tree is also O(h), which is significantly less efficient than B-Tree.

 

To sum up, it is very efficient to use B-Tree as an index structure.

 

 

Should spend time learning B-tree and B+ tree data structures

=============================================================================================================

 

1) B-tree

Each node in the B-tree contains key-value and key-value pair data object storage address pointers, so a successful search for an object does not need to reach the leaf node of the tree.

A successful search includes a search within a node and a search along a certain path. The successful search time depends on the level of the key and the number of keys in the node.

 

The way to find a given keyword in a B-tree is: first take the root node, and search for the given keyword in the keywords K1,...,kj contained in the root node (sequential search or binary search method can be used) , if a keyword equal to the given value is found, the search is successful; otherwise, it must be determined that the keyword to be searched is between a certain Ki or Ki+1, so take the next layer of inode block pointed to by Pi and continue to search , until it is found, or the lookup fails when the pointer Pi is empty.

 

 

2) B+ tree

 

The key code stored in the non-leaf node of the B+ tree does not indicate the address pointer of the data object, and the non-leaf node is only the index part. All leaf nodes are on the same layer, including all keys and storage address pointers of corresponding data objects, and the leaf nodes are linked in ascending order of key codes. If the actual data objects are stored in the order of joining instead of the number of keys, the leaf node index must be a dense index. If the actual data is stored in the key order, the leaf node index should be a sparse index.

 

The B+ tree has two head pointers, one is the root node of the tree, and the other is the leaf node of the minimum key.

So the B+ tree has two search methods:

One is to search in the order of the linked list pulled up by the leaf node itself.

One is to start the search from the root node, similar to the B-tree, but if the key of the non-leaf node is equal to the given value, the search does not stop, but continues along the right pointer until the key on the leaf node is found. So whether the search is successful or not, all levels of the tree will be traversed.

In the B+ tree, the insertion and deletion of data objects are performed only on leaf nodes.

 

 

The differences between these two data structures dealing with indexes:
a, the same key value in the B-tree does not appear multiple times, and it may appear in leaf nodes or non-leaf nodes. The keys of the B+ tree must appear in the leaf nodes, and may also appear repeatedly in the non-leaf nodes to maintain the balance of the B+ tree.
b, because the position of the B-tree key is uncertain and only appears once in the entire tree structure, although the storage space can be saved, the complexity of the insertion and deletion operations is significantly increased. B+ trees are a better compromise by comparison.
c. The query efficiency of the B-tree is related to the position of the key in the tree. The maximum time complexity is the same as that of the B+ tree (at the leaf node), and the minimum time complexity is 1 (at the root node). In the case of a B+ tree, the complexity is fixed for a built tree.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324692952&siteId=291194637