The realization principle of database index

A database index is a sorted data structure in a database management systemto assist in quickly querying and updating data in database tables. The implementation of the index usually uses the B tree and its variant B+ tree .

In addition to data, database systems also maintain data structures that satisfy specific lookup algorithms , and these data structures reference (point to) the data in some way, so that advanced lookup algorithms can be implemented on these data structures. This data structure is the index.

Setting an index for a table comes at a cost: one is to increase the storage space of the database , and the other is to spend more time when inserting and modifying data (because the index also changes accordingly) .

 

The figure above shows one possible way of indexing. On the left is the data table, with a total of two columns and seven records. The leftmost one is the physical address of the data record (note that logically adjacent records are not necessarily physically adjacent on the disk). In order to speed up the search of Col2, a binary search tree shown on the right can be maintained, each node contains the index key value and a pointer to the physical address of the corresponding data record, so that the binary search can be used in O(log 2 n ) to obtain the corresponding data within the complexity of .

 

Creating indexes can greatly improve system performance.

First, by creating a unique index, the uniqueness of each row of data in the database table can be guaranteed.

Second, it can greatly speed up data retrieval, which is the main reason for creating indexes.

Third, it can speed up table-to-table joins, especially in terms of achieving referential integrity of data.

Fourth, when using the grouping and sorting clauses for data retrieval, the time for grouping and sorting in the query can also be significantly reduced.

Fifth, by using the index, the optimization hider can be used in the query process to improve the performance of the system. 

 

Some people may ask: there are so many advantages to adding indexes, why not create an index for every column in the table? Because, increasing the index also has many downsides.

First, creating and maintaining indexes takes time, which increases with the amount of data.

Second, indexes need to occupy physical space. In addition to the data space occupied by the data table, each index also occupies a certain amount of physical space. If a clustered index is to be established, the required space will be larger.

Third, when adding, deleting and modifying the data in the table, the index should also be dynamically maintained, which reduces the speed of data maintenance.

 

Indexes are built on certain columns in a database table. When creating an index, you should consider which columns you can create an index on and which columns you can't create an index on. In general, indexes should be created on these columns: on columns that need to be searched frequently, the speed of search can be accelerated; on columns that are used as primary keys, the uniqueness of the column and the arrangement structure of the data in the organization table are enforced; Used on the connected columns, these columns are mainly some foreign keys, which can speed up the connection; create an index on the column that often needs to be searched according to the range, because the index is already sorted, and its specified range is continuous; Create an index on the sorted column, because the index is already sorted, so that the query can use the sorting of the index to speed up the sorting query time; create an index on the column that is often used in the WHERE clause to speed up the judgment of conditions.

 

Also, some columns should not be indexed. In general, these columns that should not be indexed have the following characteristics:

First, indexes should not be created on columns that are rarely used or referenced in queries. This is because, since these columns are rarely used, indexing or no indexing does not improve query speed. On the contrary, due to the addition of indexes, the maintenance speed of the system is reduced and the space requirement is increased.

Second, indexes should not be added to columns with few data values. This is because because these columns have very few values, such as the gender column of the personnel table, in the query results, the data rows of the result set account for a large proportion of the data rows in the table, that is, the data that needs to be searched in the table The proportion of rows is large. Increasing the index does not significantly speed up the retrieval speed.

Third, no indexes should be added to columns defined as text, image and bit data types. This is because the amount of data in these columns is either quite large or has very few values.

Fourth, indexes should not be created when the modification performance is much greater than the retrieval performance. This is because modification performance and retrieval performance are contradictory . When increasing the index, the retrieval performance will be improved, but the modification performance will be reduced. When reducing the index, it will improve the modification performance and reduce the retrieval performance. Therefore, indexes should not be created when the modification performance is much greater than the retrieval performance.

 

Depending on the capabilities of the database, three types of indexes can be created in the database designer: unique indexes, primary key indexes, and clustered indexes .

 

unique index 

 

A unique index is one that does not allow any two rows to have the same index value.

 

Most databases do not allow a newly created unique index to be saved with a table when there are duplicate key values ​​in existing data. The database may also prevent adding new data that would create duplicate key values ​​in the table. For example, if a unique index is created on the employee's last name (lname) in the employee table, no two employees can have the same last name.

 

primary key index

 

Database tables often have a column or combination of columns whose values ​​uniquely identify each row in the table. This column is called the primary key of the table.

 

Defining a primary key for a table in a database diagram automatically creates a primary key index, which is a specific type of unique index. This index requires that every value in the primary key be unique. It also allows fast access to data when primary key indexes are used in queries.

 

clustered index

 

In a clustered index, the physical order of the rows in the table is the same as the logical (index) order of the key values. A table can contain only one clustered index.

 

If an index is not a clustered index, the physical order of the rows in the table does not match the logical order of the key values. Clustered indexes generally provide faster data access than nonclustered indexes.

 

 

 

The principle of locality and disk read-ahead

 

由于存储介质的特性,磁盘本身存取就比主存慢很多,再加上机械运动耗费,磁盘的存取速度往往是主存的几百分分之一,因此为了提高效率,要尽量减少磁盘I/O。为了达到这个目的,磁盘往往不是严格按需读取,而是每次都会预读,即使只需要一个字节,磁盘也会从这个位置开始,顺序向后读取一定长度的数据放入内存。这样做的理论依据是计算机科学中著名的局部性原理当一个数据被用到时,其附近的数据也通常会马上被使用。程序运行期间所需要的数据通常比较集中。

由于磁盘顺序读取的效率很高(不需要寻道时间,只需很少的旋转时间),因此对于具有局部性的程序来说,预读可以提高I/O效率。

预读的长度一般为页(page)的整倍数。页是计算机管理存储器的逻辑块,硬件及操作系统往往将主存和磁盘存储区分割为连续的大小相等的块,每个存储块称为一页(在许多操作系统中,页得大小通常为4k),主存和磁盘以页为单位交换数据。当程序要读取的数据不在主存中时,会触发一个缺页异常,此时系统会向磁盘发出读盘信号,磁盘会找到数据的起始位置并向后连续读取一页或几页载入内存中,然后异常返回,程序继续运行。

B-/+Tree索引的性能分析

到这里终于可以分析B-/+Tree索引的性能了。

上文说过一般使用磁盘I/O次数评价索引结构的优劣。先从B-Tree分析,根据B-Tree的定义,可知检索一次最多需要访问h个节点。数据库系统的设计者巧妙利用了磁盘预读原理,将一个节点的大小设为等于一个页,这样每个节点只需要一次I/O就可以完全载入。为了达到这个目的,在实际实现B-Tree还需要使用如下技巧:

每次新建节点时,直接申请一个页的空间,这样就保证一个节点物理上也存储在一个页里,加之计算机存储分配都是按页对齐的,就实现了一个node只需一次I/O。

B-Tree中一次检索最多需要h-1次I/O(根节点常驻内存),渐进复杂度为O(h)=O(logdN)。一般实际应用中,出度d是非常大的数字,通常超过100,因此h非常小(通常不超过3)。

而红黑树这种结构,h明显要深的多。由于逻辑上很近的节点(父子)物理上可能很远,无法利用局部性,所以红黑树的I/O渐进复杂度也为O(h),效率明显比B-Tree差很多。

 

综上所述,用B-Tree作为索引结构效率是非常高的。

 

 

应该花时间学习B-树和B+树数据结构

=============================================================================================================

 

1)B树

B树中每个节点包含了键值和键值对于的数据对象存放地址指针,所以成功搜索一个对象可以不用到达树的叶节点。

成功搜索包括节点内搜索和沿某一路径的搜索,成功搜索时间取决于关键码所在的层次以及节点内关键码的数量。

 

在B树中查找给定关键字的方法是:首先把根结点取来,在根结点所包含的关键字K1,…,kj查找给定的关键字(可用顺序查找或二分查找法),若找到等于给定值的关键字,则查找成功;否则,一定可以确定要查的关键字在某个Ki或Ki+1之间,于是取Pi所指的下一层索引节点块继续查找,直到找到,或指针Pi为空时查找失败。

 

 

2)B+树

 

B+树非叶节点中存放的关键码并不指示数据对象的地址指针,非也节点只是索引部分。所有的叶节点在同一层上,包含了全部关键码和相应数据对象的存放地址指针,且叶节点按关键码从小到大顺序链接。如果实际数据对象按加入的顺序存储而不是按关键码次数存储的话,叶节点的索引必须是稠密索引,若实际数据存储按关键码次序存放的话,叶节点索引时稀疏索引。

 

B+树有2个头指针,一个是树的根节点,一个是最小关键码的叶节点。

所以 B+树有两种搜索方法:

One is to search in the order of the linked list pulled up by the leaf node itself.

One is to start the search from the root node, similar to the B-tree, but if the key of the non-leaf node is equal to the given value, the search does not stop, but continues along the right pointer until the key on the leaf node is found. So whether the search is successful or not, all levels of the tree will be traversed.

In the B+ tree, the insertion and deletion of data objects are performed only on leaf nodes.

 

 

The differences between these two data structures dealing with indexes:
a, the same key value in the B-tree does not appear multiple times, and it may appear in leaf nodes or non-leaf nodes. The keys of the B+ tree must appear in the leaf nodes, and may also appear repeatedly in the non-leaf nodes to maintain the balance of the B+ tree.
b, because the position of the B-tree key is uncertain and only appears once in the entire tree structure, although the storage space can be saved, the complexity of the insertion and deletion operations is significantly increased . B+ trees are a better compromise by comparison.
c. The query efficiency of the B-tree is related to the position of the key in the tree. The maximum time complexity is the same as that of the B+ tree (at the leaf node), and the minimum time complexity is 1 (at the root node). In the case of a B+ tree, the complexity is fixed for a built tree.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326551187&siteId=291194637