The index database module

In addition to indexing module is one of the most important module in the database, but also the most frequently asked interview, frequently asked questions index module as follows:

Why use an index
What kind of information can become index
Index data structure
The difference between dense and sparse index index

Why use the index:

The minimum unit of storage in the database is typically blocks or pages, each block which will contain multiple rows of data. And we have some data in the query does not use an index, usually require a full table scan, which means that you need to load all the blocks one by one and then loop through the blocks until you find out the data we need to find. One can imagine this query in efficiency when large amount of data is relatively slow, so we often need to avoid full table scan. However, the database designer had this in mind so the introduction of more efficient query mechanism, namely the use of the index. Index inspiration comes from the dictionary, we all know that the dictionary will record key information, such as radical Pinyin, etc., we can quickly find the page where the word through these critical information. The index, too, can locate the target database data recorded by the index key information quickly in what position, it can avoid a full table scan. So using the index purpose is to make queries more efficient.

What kind of information can become the index:

Primary key id, unique field, and the field frequently as a query, you can build composite index of these several fields If more than one field at the same time as the frequent query

Index data structure:

Usually a B + tree, Hash and a handful of database support BitMap

Binary search tree

Next simply a data structure under the index, we all know that the index most commonly used data structure is a B + tree, introducing what is B + tree before, must first understand the binary search tree and B-tree, and briefly explain why not use binary tree or B-tree data structure as an index.

Now that we know the purpose of indexing the field to help us to quickly navigate to the target location where the data resides, if let go of our own design index, it needs to quickly find the first time might think of such a binary search tree the tree data structure. So this section describes the first binary search tree, and step by step to understand why many of the tree structure will be used as a data structure B + tree index.

Binary search tree is a common tree data structure, a binary search tree each node is only about at most two child nodes, respectively, become left subtree and right subtree, typically smaller than the elements of the left subtree of its parent and the right subtree of the parent node is greater than it. Located at the top of the root node is commonly referred to, binary search tree search algorithm is a binary search. The figure is a balanced binary tree, a so-called balanced binary tree is the height of the end of the left and right two nodes differ by no more than 1:
The index database module

Binary search tree because the same level can only have a maximum of two nodes, and is not optimized for disk IO, IO read because each can only read two nodes, therefore it can not achieve better query speed, can not serve as an index data structure.

B-tree

As the binary tree can only read two nodes is not optimized for disk IO, and only about two search path, the depth of the tree will be with the ever-increasing amount of data increase, so this time we need to find a level of each there may be a plurality of multiplex nodes of the tree structure, and to meet the needs of a B-tree, the B-tree is also called a balanced search tree, which is generally the structure shown below:
The index database module

Has m nodes in the same layer is commonly referred to order m, a m-order B-tree (balanced tree of order m) is an m-way search tree balanced. Which is either empty tree or tree that satisfies the following properties:

There are at least two child nodes of the root node
Tree child node having up to m (m> = 2) for each node
In addition to the root and leaf nodes, each node has at least other ceil(m/2)child nodes
All leaf nodes are at the same level
Each non-terminal node is assumed that there are n contains keyword information, wherein:
- Ki (i=1...n)Keywords, and the keywords sorted in ascending order K(i-1) < Ki
- Key number n must satisfy: [ceil(m / 2) - 1] <=n <= m - 1, i.e., the upper limit of the number of any node key is one less than the upper limit of its sub-tree, and for the number of non-leaf nodes for any node key than its child pointer pointing to a a small number
- The non-leaf node pointer: P [1], P [2], ..., p [M]; wherein P [1] key is smaller than point K [1] subtree ①, P [M] is greater than the critical point K [M - 1] subtree ②, other P [i] point to the keyword belongs (K [i - 1], K [i]) ③ subtree

①：某节点最左子节点里关键字的值均小于该节点最左关键字的值
②：某节点最右子节点里关键字的值均大于该节点里所有关键字的值
③：某节点除左右以外所有子节点里关键字的值大小，均位于离该子节点指针最近的两个关键字的值之间

B+树

B 树虽然已经达到可以用作于索引数据结构的标准，但是还有更好的替代品，那就是B+树，从名字也可以看出B+树相当于是B树的变体。其定义基本与B树相同，除了：

非叶子节点的子树指针与关键字个数相同
非叶子节点的子树指针 P[i]，指向关键字值[K[i], K[i + 1])的子树
非叶子节点仅用来做索引，数据都保存在叶子节点中
所有叶子节点均有一个链指针指向下一个叶子节点，叶子节点形成的链会按大小排序

B+树结构图：
The index database module

B+树相比于B树及其他树形数据结构来说，更适合用来做存储索引，原因如下：

B+ 树的磁盘读写代价更低，B+ 树由于非叶子节点只会存储索引，因此B+ 树的非叶子节点相对于B 树来说更小，如果把所有同一内部节点的关键字存储在同一盘块中，那么该盘块所能容纳的关键字数量也越多，一次性读入内存中的关键字也就越多，相对来说IO读写次数也就降低了
B+ 树的查询效率更加稳定，因为具体数据存储在叶子节点中，所以无论查询任何数据都需要从根节点走到叶子节点，那么所有查询的长度也就相同，这样每个数据查询的效率就几乎是相同的
B+ 树更有利于对数据库的扫描，B 树在提高了磁盘IO的同时并没有解决遍历元素效率低下的问题，而B+ 树只需要遍历叶子节点就可以解决对全部关键字信息的扫描，所以对数据库中频繁使用的范围查询来说B+ 树更高效

Hash以及BitMap

除了上一小节所介绍的B+ 树索引结构之外，还有一个常用的Hash索引结构。Hash稍微简单一些，就是对索引的key进行一次hash计算，然后就可以定位出数据存储的位置，所以在某些特定场景来说Hash索引要比B+ 树索引更高效。如图：
The index database module

既然理论上来说Hash索引要比B+ 树索引更高效，但是为什么没有成为主流索引结构呢，这是因为Hash索引存在以下缺点：

因为hash的特性，所以仅仅能满足 “=”，“IN”，不能使用范围查询
无法被用来避免数据的排序操作
不能利用部分索引键查询，因为在使用组合索引的时候，Hash索引是将组合索引里的字段合并后再计算的hash值，而不是单独计算的hash值。所以不使用组合索引里全部字段去查询的话，Hash索引就无法被利用
不能避免表扫描，因为数据量大的时候就会有出现重复Hash较多的情况，那么就得拿出所有相同Hash值的数据来比较才能取到具体的数据，所以普遍来说数据量越大Hash索引的效率就越低
遇到大量Hash值相等的情况后性能并不一定就会比B+树索引高

BitMap：

除了B+ 树及Hash索引外，还有一种索引结构就是BitMap，即位图索引，但是仅有少量数据库支持，所以这里仅做简略提及。当表中的某个字段只有几种值的时候，例如存储性别信息的字段之类的，在这种字段使用BitMap索引就是最佳的选择。BitMap结构图如下：
The index database module

但是BitMap有一个很大的缺陷就是锁的粒度会非常的大，在新增和更新数据时，与该数据在同一个位图的数据也会被锁住。

密集索引和稀疏索引的区别

密集索引和稀疏索引的区别：

密集索引文件中的每个搜索码值都对应一个索引值
稀疏索引文件只为索引码的某些值建立索引项
密集索引和稀疏索引的主要区别就是前者叶子节点保存完整的数据，而后者保存的是指向data的指针

密集索引和稀疏索引的区别图：
The index database module

密集索引：叶子节点保存的不仅仅是键值，还保存了位于同一行数据里其他列的信息，由于密集索引决定了表的物理排列顺序，而一个表只能有一个物理排列顺序，所以一个表只能创建一个密集索引

稀疏索引：叶子节点仅保存了键位信息，以及该行数据的地址或主键。所以需要通过数据的地址或主键才能进一步定位到数据。

我们来看看具体到MySQL的主流存储引擎：

MyISAM：不管是主键索引、唯一索引还是普通索引都属于稀疏索引，所以MyISAM只有稀疏索引，没有密集索引。并且MyISAM中索引与数据是分开存储的
InnoDB：表只会有且只有一个密集索引，其他索引都是稀疏索引。并且InnoDB中索引与数据是存储在同一个文件中的
- 若一个主键被定义，该主键则作为密集索引
- 若没有主键被定义，该表的第一个唯一非空索引则作为密集索引
- 若不满足以上条件，InnoDB内部会生成一个隐藏主键作为密集索引，这个隐藏的主键是一个6字节的自增列
- 非主键索引存储相关键位和其他对应的主键值，包含两次查找

InnoDB与MyISAM引擎的检索流程对比：
The index database module

索引额外问题之联合索引最左匹配原则的成因

假设我们对A、B两个字段建立联合索引：(A, B)，此时该联合索引的左边是A而右边是B，当执行where A = '' and B = '' 时会走这个(A, B)联合索引，where A = ''也会走(A, B)联合索引，但是where B = ''则不会走(A, B)联合索引。这就是所谓的最左匹配原则

在最左匹配原则中，有如下说明：

最左前缀匹配原则，非常重要的原则，mysql会一直向右匹配直到遇到范围查询(>、<、between、like)就停止匹配，比如a = 1 and b = 2 and c > 3 and d = 4 如果建立(a,b,c,d)顺序的索引，d是用不到索引的，如果建立(a,b,d,c)的索引则都可以用到，a,b,d的顺序可以任意调整。

=和in可以乱序，比如a = 1 and b = 2 and c = 3 建立(a,b,c)索引可以任意顺序，mysql的查询优化器会帮你优化成索引可以识别的形式

我们来做个实验，验证下最左匹配原则。建表sql如下，该表中有一个联合索引：

CREATE TABLE `student` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) NOT NULL,
  `age` int(11) NOT NULL,
  `sex` varchar(20) NOT NULL,
  `address` varchar(100) NOT NULL,
  `cid` int(11) NOT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_name_age` (`name`,`age`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=utf8;

当where条件存在name字段时，会使用索引查询：
The index database module

当where条件不存在name字段时，则不会使用索引查询：
The index database module

When the present condition where the name field, even scrambled also use the index query because the MySQL optimizer automatically adjusts execution order to satisfy the condition of use of the index:
The index database module

Reference article:

Now let's answer this question causes the leftmost matching principle:

When MySQL create a joint index, it is to sort the data in the most left-field joint index, based on the most left-field sort, and then the data field of a sort, similar to the order by field 1, order by Field 2 such a collation. So joint index of the most left-field is absolutely in order, then a field is disordered, so the use of the fields except the left-most field is the use of conditional queries less than the index, which is the genesis of the leftmost matching principle

The index database module

Additional indexing problems of the index is to establish better it

The answer is no, so-called extremes meet:

A small amount of data table does not require indexing, indexing adds additional index maintenance overhead
Data changes need to maintain an index, and therefore more indexes mean more maintenance costs
More indexes also implies the need for more storage space