Talk about database indexing (reproduced)

Today, let's talk about the index in the database. You know, there are a lot of articles on the Internet, basically starting from the principle of the index, talking about the classification of the index, the physical organization and storage form, how to find the corresponding Records, how to build complex indexes, etc. If I write another article like this, it will be boring, and these may not really be what everyone (especially development students) care about. So today I plan to talk about the index from a different perspective, and for the B+tree index, I hope it will be helpful for everyone to read it.

For a SQL, what do developers care about most? I think it's not how the SQL is executed in the database, but whether the SQL can return the result as soon as possible. We mentioned earlier when we talked about the connection pool. In the life cycle of SQL, every link has There is enough room for optimization, but have we ever thought about the essence of SQL optimization? What is the ultimate goal? In fact, optimization is essentially to reduce the consumption and dependence of SQL on resources. Just as the ultimate trick of database optimization is Do nothing in database, the ultimate purpose of SQL optimization is Consume no resource.

Resources have two characteristics: first, resources are limited, and everyone rushes to use them, so there will be bottlenecks, so the bottleneck of SQL may be caused by resource shortage. Secondly, resources have a price, and the cost is different. For example, the delay of memory is 100ns, SSD is 100us, sas disk is 10ms, and the network is higher. Then the cost of accessing cpu l1/l2/l3 cache is lower than that of accessing memory. The cost of resources is lower than the cost of accessing hard disk resources, so SQL bottlenecks may also be caused by accessing more expensive resources. Under the modern computer system, there are only a few kinds of coarse-grained resources on the machine, which are nothing more than CPU, memory, hard disk, and network. Then let's take a look at what resources SQL needs to consume: CPU is required for comparison, sorting, SQL parsing, functions or logical operations; cached data access, temporary data storage requires memory; cold data reading, sorting and For association, data is written to the disk and needs to access the hard disk; SQL request interaction, the result set return requires network resources. So our optimization idea for SQL in the database is naturally to reduce SQL parsing, reduce complex operations, reduce the scale of data processing, reduce the dependence on physical IO, and reduce the network interaction between the server and the client, then if the index is explained clearly How to help do this, the purpose of this article is achieved.

But let’s not rush to explain these first, let’s make everyone become a master of indexing, haha, you read that right, it’s that simple to become a master of indexing, the three tricks are quick, and I won’t be able to do any more. After practicing the three tricks, the above This question is naturally explained, well, let's start practicing with the following query SQL.

SELECT CNO, FNAME
FROM CUST
WHERE LNAME = :LNAME AND CITY = :CITY
ORDER BY FNAME

第一招就是构建一星索引,根据where后面等值的条件,或者范围的条件来构建索引,即index(LNAME,CITY) 。教科书上一般都说索引是为了能以最快的速度定位到想要的数据,即用空间来换时间,这当然没错,但是你有没有想过,快速定位了你想要的数据后,也就过滤掉了不必要的数据,所以一星索引的核心就是利用索引来尽可能的过滤不必要的数据,减少数据处理的规模,对于RDBMS来说是极为关键的,比如说CUST表有1000000行,CITY的过滤度是10%,LNAME的过滤度是0.1%,那么如果没有索引,你不得不把表里所有的一百万行数据都读出来,做处理,但是如果有了这个一星索引,需要处理的数据被极大的缩小了,只需要根据索引找到符合条件的索引叶子节点的范围,读取0.1%*10%*1000000=100rows就可以了,哪怕我们乐观的假定产生的都是逻辑IO, 而不是物理IO,单次的差别就已经很明显了,更别说是执行频率很高的时候了,我们线上很多烂SQL对DB造成了影响,一看机器逻辑读都好几百万了,基本上就可以定位是SQL索引缺失,或者不合理造成的。当理解了这个时候,你就一定不会产生一个误区,在硬件越好越好,时延越来越低的今天,是不是索引还有存在的必要。

第二招就是构建二星索引, 针对上面的case, 我们构建索引如下index(LNAME,CITY,FNAME),基本的想法就是利用索引的有序性,把消除ordby或者group by等需要排序的操作,因为大家都知道排序是非常消耗CPU资源的,大量的排序操作会把user cpu搞得很高,即使CPU吃得消,如果数据量比较大,需要排序的数据放不下内存的sort buffer,只能悲剧的和外存换进换出,性能下降的就不是一点两点了,这时候利用索引避免排序的优势就明显的体现出来了。

想必第三招你没学就已经会了,没错,第三招就是构建三星索引,即index(LNAME,CITY,FNAME,CNO), 跟之前的二星索引的差别在于, 在索引中额外添加了要查询的列CNO,这就是所谓的索引覆盖,即在索引的叶子节点就能够读到查询SQL所需要的所有信息,而不需要回原表去查询了,在目前内存如此充足的情况下,很多时候,除了root节点和branch结构,甚至整个索引都是可以被放入内存的,这样能大概率的避免,至少是减少物理IO。

也许你会说,这招式都是最理想的状态,现实的SQL千变万化,有各种奇葩的条件,有很多动态的SQL,有多表关联的SQL,肯定不能拿上面说的三脚猫的招数硬往上套, 没错,实际情况下确实要考虑这样那样的因素,我们也没办法构建所有的索引都是三星的,我们只能根据实际情况, 构建最佳的索引,而非理想的索引,但是万变不离其宗,理解了这三招的原理,就能够见招拆招了,无招胜有招了。比如各种奇葩的条件,那我们选择那些过滤性最好的, 比如动态的SQL,我们就抓住主干的那些SQL,比如两表关联(MySQL), 因为那就nest loop一种,那就用小表驱动大表,在关联字段各自尽可能的构建最优索引。

我们前面也提到了,索引其实是一种权衡,是一种拿空间来换时间的艺术,所以极左或者极右都是不恰当的,创建过多的索引所带来的空间损耗 ,和对DML所产生的负担,在某些极端场景下,都不能被忽视, 对于DML性能损耗的优化,除了只创建必要的索引外,有些NOSQL实现了二级索引,但是索引是采用异步方式维护,不在一个事务里,这是通过牺牲强一致性来提高性能, 但是RDBMS还做不到,另外在innodb上,我们会推荐使用业务无关的自增字段来作为主键,提高顺序插入性能的同时,还能避免过多的索引分裂,在现有的MySQL版本中, 索引分裂会锁住整棵树, 代价还是非常大的。对于空间成本上的优化,同样可以有些技巧,还是拿Innodb举例,我们推荐使用数字型主键,而不推荐使用大字段作为主键的重要原因在于,大字段主键会极大的增大二级索引所占用的空间,因为二级索引叶子节点包含指向的主键,另外在Oracle上,我们会定期rebuild index来节省索引所占用的空间。

同时B+tree索引,作为一种面向磁盘&SSD的数据结构,相对来说,查询和写入性能也是相对比较平衡的,读写的时间复杂度都在O(log2n),写入上因为采用的是update-in-place的方式 ,每次写入的时候需要先通过随机查找来找到要写入的位置,性能会不是那么好,当然你也可以选择类似lsm_tree这样的实现(包括OB自己实现的Btree),通过牺牲一定程度的读性能,来提高写的性能。未来会不会出现一种能更完美的数据结构,能够同时更高效的支持读取和写入,是一件比较值得期待的事情。

说了这么多, 总结一下,我认为那么在不考虑业务层面优化的前提假设下,索引是最有效的药方,其他的优化方式与之相比都只能是看成偏方了,而且B-tree作为普遍采用的数据结构,基本上是通用于多种关系型数据库的,记得我从Oracle转MySQL的时候,索引的运用基本上能平滑过渡,所以希望大家都能了解到这些索引知识, 对平时的工作中写出更好更合理的SQL会很有帮助。

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327081432&siteId=291194637
Recommended