Getting Started with SQL Server Query Optimization

This article is also published on Zhihu and my personal website at the same time , welcome to pay attention

The reason why it is limited to "entry" is because I am not an expert in SQL Server, but I have accumulated some experience in performance optimization recently. Although it is not deep, it is enough to deal with the performance problems of some SQL statements that we need to solve on a daily basis. Therefore, Share it for your reference. Selfishly, even if you don't use this skill for a long time, you can quickly pick it up by reviewing your own articles after being rusty.

The reason why it is limited to "query statement" is because there is room for optimization in SQL Server memory usage, query compilation, deadlock, etc., which is not covered in this article.

Another layer of "getting started" means that I will focus on the principle explanation. I found that neither front-end performance optimization nor SQL performance optimization can be compared with WYSIWYG coding, maybe because the information provided by the tool is limited, or because the performance bottleneck is the ancestral code shit mountain, most of the time you need Be adaptable. Sometimes you can find the root cause of the problem by pulling the cocoon, and sometimes you can only alleviate it a little by rewriting the code drastically. No matter what method, you need to have some understanding of the tools behind it. I strive to understand this article even if you can only do simple SELECT, UPDATE, DELETE

This article covers two parts: Index and Execution Plan . Although indexes can solve more than 90% of our performance problems, we also need to know when and where to add indexes, so we need to find hints in this area by reading the execution plan.

In order to illustrate the problem, the article will use the officially provided sample database AdventureWorks and the three tables Person.Person, Person.PersonPhone, and Person.EmailAddress. There are BusinessEntityID fields in the three tables, and we can associate the information of the same person through the BusinessEntityID fields.

Scan carefully

It is no exaggeration to compare a database to a book. Imagine if you needed to find a line of text in a book without a table of contents. The only thing you could do would be to search page by page. The database also works like this. For a table without any indexes, it can only find matching data by **scanning** the data of the entire table.

For example, I delete all indexes in the PersonPhone table to find a specific phone number:

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';
复制代码

The execution plan shows us the following process:

001_table_scan.png

Since we only talk about the execution plan later, for now you can think of the execution plan as the execution process of the SQL statement. The above Table Scanis telling us that it scanned the entire table. And in the entire execution process, this step occupies the most resources: Cost: 100%. The cost here is just an abstract unit, it does not represent the consumption of a single dimension of CPU or I/O, but the result of various resource statistics.

In fact, 100% in the above process does not mean that the scan operation is inefficient, because only the query operation of a single table is involved, even if this simple query is performed on a table with an index, you can see Yes too Cost: 100%. For example, I query the Person table with an [PK_Person_BusinessEntityID]index :

  SELECT *
  FROM Person.Person
  WHERE BusinessEntityID = 10;
复制代码

The resulting execution process is as follows:

002_person_query.png

Non-scan type Clustered Index Seek(I will explain later, here you can understand it as a kind of operation better than scan) The consumption of the operation is also 100%.

But if we perform a joint query on the PersonPhone and Person tables, the query efficiency is immediately higher:

  SELECT *
  FROM Person.PersonPhone AS PersonPhone
  JOIN Person.Person AS Person ON PersonPhone.BusinessEntityID = Person.BusinessEntityID
  WHERE PhoneNumber = '156-555-0199';
复制代码

003_compare_scan_seek.png

scan consumes 91% of all operations

所以 scan 是我们可以识别到的一个优化点,当你发现一个表缺少索引,或者说在执行计划中看到有 scan 操作时,尝试通过添加索引来修复性能问题。

关键的 Logical Reads

通常 SQL Server 在查询数据时会优先从内存中的缓存(buffer cache)中查找,如果没有找到才会继续前往磁盘中查找,前者我们称之为 logical read,后者称之为 physical read,鉴于从内存读写的效率比磁盘高,我们当然希望尽可能避免任何的 physical read。

而 logical read 具体读写的是什么呢?是 page,page 是数据库中数据组织的最小单位,我们只需要了解到这个深度即可,至于 page 是如何被组织的,page 的数据结构如何不重要。所以 logical read 数量也理应越小越好。默认情况下你不会看到 logical read 这项指标的输出,可以使用 SET STATISTICS IO ON 将这项监控打开,例如对于查询一个没有索引的 PersonPhone 表,我们的查询语句如下:

SET STATISTICS IO ON
GO

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';

SET STATISTICS IO OFF
GO
复制代码

得到的有关 logical read 信息如下:

(1 row affected) Table 'PersonPhone'. Scan count 1, logical reads 158, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

一旦我给 PersonPhone 加上了以 PhoneNumber 为 key 的 Clustered Index 之后(如果你对 index 没有任何了解,在这里可以仅仅把它理解为一种优化手段),上面语句的执行结果则变为:

(1 row affected) Table 'PersonPhone'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

logical reads 过高,可能(并不是一定)在暗示一些问题:

  • 缺少索引导致多行被扫描
  • 数值越高可能意味着给磁盘带来的压力也过高
  • 即使是查询操作也可能会给数据上锁(根据事物隔离级别(isolation level)的不同),过长的查询会方案后续的读写操作,造成连锁反应。

之所以选择 logical read 的另一个好处是,作为衡量性能的指标之一,它的波动没有例如 duration或者CPU time 那么大

但 logical read 的参考价值没有执行计划高,一方面因为它是单向的,也就是说你能够通过 SQL 语句得出得出 logical reads 数值但却无法反向通过数值读出问题,从这点上看执行计划更适合我们排查问题;另一方面它并不是总能准确反馈问题,如果你把上面查询中的 where 语句去掉,你会发现添加 index 前后的 logical reads 并没有太大变化。

无论如何,logical reads 可以作为我们的参考指标之一。

索引(Index)

Clustered Index & Nonclustered Index

终于能进入正题 index 了。Index 的工作原理很简单,如果我们把数据库比作一本书的话,那么索引就是这本书的目录,它能帮助你快速定位数据。

004_mysql_no_index.png

在上面这张表中,如果我们想要找到某个公司的行,那么需要检查表的每一行,看看它是否与那个期望值相匹配。 这是一个全表扫描操作,其效率很低,如果表很大,而且仅有少数几个行与搜索条件相匹配, 那么整个扫描过程的效率将会超级低。

我们可以给这个表添加一个索引:

005_mysql_index.png

该索引包含 ad 表里每个行的一个项,而且这些索引项按 company_num 值排了序。现在,不用 为了查找匹配项,一行一行地搜索整个表了,我们可以使用这个索引。假设,我们要找出公 司编号为 13 的所有行。我们开始扫描索引,便会找到 3 个属于该公司的值。然后,我们会到 达公司编号为 14 的索引值,该值比我们正查找的值要大一点。由于索引值已是有序的,因此 当我们读到那条包含 14 的索引行时,我们便知道再也无法找到更多与 13 匹配的内容了,于 是可以退出查找过程。由此可见,一种使用索引提高效率的做法是,我们可以得知匹配行在 什么位置结束,从而跳过其余部分;另一种使用索引提高效率的做法是,利用定位算法,不 用从索引开始位置进行线性扫描,即可直接找到第一个匹配项(例如,二分搜索比扫描要快 很多)。这样,我们便可以快速地定位到第一个匹配值,从而节省大量的搜索时间。

但是在 SQL Server 中,index 被划分为了几类。Clustered Index 是最常被用的:表中的数据会按照 clustered index 进行物理排序。因为只可能有一种物理顺序的关系,所以一张表只允许有一个 clustered index.当你在表中添加 primary key 约束时,数据库会为你自动以 primary key 创建一个 clustered index。

我们可以给 PersonPhone 添加一列以 PhoneNumber 为 key 的 index ,然后再次执行上面查询 PhoneNumber 的语句

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';
复制代码

你可以看到了执行计划变成了下图所示的 Clustered Index Seek

006_add_phone_number_clustered_index.png

Seek 的效率是最高的,我们应该尽可能的让查询语句执行 seek 操作,它不再像 scan 一样逐行扫描,而类似于书的目录一样直达目的地将所需要的数据取出。

但 clustered index seek 不会在任何情况下都生效,比如在上面 PhoneNumber 索引的情况下按照 BusinessEntityID 条件查询:

  SELECT *
  FROM Person.PersonPhone
  WHERE BusinessEntityID = 4511
复制代码

你会发现执行计划是 Clustered Index Scan

007_query_entity_id_by_phone_number_index.png

index scan 意味着数据库通过索引获取所有行后再进行扫描。如果你对比 index scan 和 table scan,两者的 logical reads 差不多。

配置 nonclustered index 和 clustered index 相比并无不同,在使用的时候你也会在执行计划中看到 Non-Clustered Index Seek。明显的不同之处在于不会对原表的顺序产生影响。虽然看似相同,但实际上它们背后有千丝万缕的联系,搞清楚这些联系有助于我们判断在什么时候应该恰当的添加哪一种 index。

Index 运作原理

想象一下有一组 27 行的单列数据,因为 page 大小有限的缘故,它们被分为了 9 个 page

008_random_row_pages.png

你为它们添加的 Clustered Index 之后,索引的数据结构如下所示

009_b_tree_layout.png

当你想找值 5 时,搜索会从顶部节点开始,因为5在1到10之间,搜索过程会继续到左侧分支的下一个节点上,又因为5落在4到7之间,搜索会走到下一层以4开头的节点上。最后从叶子节点上找到5

事实上我们忽略了一些细节,clustered index 的结构如下:

010_clustered_index_arch.gif

从图中不难看出,每一层节点都是双向列表,叶子节点上存储的是表的真实数据。

但 nonclustered index 的存储结构稍有不同,叶子节点由索引信息(index page)而非数据信息(data page) 组成。nonclustered index 需要借由 row locator 定位到对应的数据行(你可以理解为指针),对于 heap table(没有 clustered index 的表) 而言,row locator 指向的是每行数据的 RID (row identifier);对于非 heap table,row locator 指向的是 clustered index

篇幅有限,基于以上知识点我们就能总结一下何时应该使用什么样的 index:

  • 在创建 nonclustered index 之前你应该优先创建 clustered index
  • 如果你查询的数据总是需要按照某一列排序,可以为那一列添加 clustered index
  • 不要给会被频繁更新的列添加 clustered index,这会导致所有与此相关的 noneclusterd index 的 row locator 也被频繁更新,这可能会引起死锁问题。
  • 相反你可以给频繁更新的列添加 nonclustered index,因为它只会影响到当前的 nonclustered index
  • nonclusterd index 不适合数据量巨大的查询,因为它们可能会带来额外的 lookup 操作,此时你应该将这个索引变成一个 covering index。

Covering Index

清除 PersonPhone 下所有的索引后将 PhoneNumber 添加为 nonclustered index,再执行最初的查询语句:

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';
复制代码

你会得到如下的执行计划:

012_lookup.png

除了 Index Seek 之外,右下方的 lookup 操作占比是最多的。触发 lookup 的原因非常简单:当数据库决定使用 nonclustered index 进行查询,而需要查询的列信息又不在 nonclustered index 中(既不是作为 index 的 key 也不再 includes 列表中)时,就会触发 lookup 操作。lookup 的意思是它会根据 index 所关联的 row locator (非 heap table 用 clustered index,heap table 用 RID) 找到对应的 row data,再从中读取中想查询的列数据。整个过程除了除了有消耗在 index page 上的 logical read 上以外,还有额外花费在 data page 上的 logical read 操作。可想而知如果数据库在查询过程中使用了 clustered index 那么它永远也不需要 lookup,因为 clustered index 的叶子节点就是 data page

如果查询所需要的所有列信息 index 都能提供,那么意味着访问 data page 的操作可以省略,这种类型的 index 就能称之为 covering index

我们可以将除了 index key 以外的却又要查询信息的列放入 includes 列表中,这也就能解决上面 lookup 的问题:

013_add_columns_to_include_list.png

总结

Originally, I wanted to write about join efficiency (compared to hash join / nested loop / merge join), but it was beyond our control to think about what kind of join the database uses. In fact, whether the database will actually use our index is beyond our control. The execution plan is calculated by its internal optimizer. Cruelly, each execution plan may vary with resources, data, and index status. different. However, the controllability of the index is higher. Most performance problems can be solved by index

Guess you like

Origin juejin.im/post/7080181829415731237