《Pro SQL Server Internals》翻译之索引

本文选自《Pro SQL Server Internals》

作者: Dmitri Korotkevitch

出版社: Apress

出版年: 2016-12-29

页数: 804

作者简介：Dmitri Korotkevitchis是微软SQL Server MVP和微软认证大师。作为应用程序和数据库开发人员、数据库管理员和数据库架构师，他具有多年使用SQL Server的经验。他专门从事OLTP系统在高负载下的设计、开发和性能调优。Dmitri经常在各种Microsoft和SQL PASS活动上发言，他为世界各地的客户提供SQL Server培训。

原文链接：http://www.doc88.com/p-4042504089228.html

CHAPTER 2 ■ TABLES AND INDEXES: INTERNAL STRUCTURE AND ACCESS METHODS

第2章■表格和索引：内部结构和访问方法

Figure 2-4. Forwarding pointers and I/O: Reading data when forwarding pointers exist

图2-4。转发指针和I / O：存在转发指针时读取数据

As you can see, the large number of forwarding pointers leads to extra I/O operations and significantly reduces the performance of the queries accessing the data. Companion materials for this book include the script that demonstrates this problem in a large scope with a table that includes a large amount of data.

如您所见，大量转发指针会导致额外的I / O操作，并显着降低访问数据的查询的性能。本书的伴随材料包括在大范围内使用包含大量数据的表来演示此问题的脚本。

When the size of the forwarded row is reduced by another update and the data page with forwarding pointer has enough space to accommodate the updated version of the row, SQL Server may move it back to its original data page and remove the forwarding pointer row. Nevertheless, the only reliable way to get rid of all of the forwarding pointers is by rebuilding the heap table. You can do that by using an ALTER TABLE REBUILD statement.

当通过另一次更新减少转发行的大小并且具有转发指针的数据页具有足够的空间来容纳该行的更新版本时，SQL Server可以将其移回其原始数据页并移除转发指针行。然而，摆脱所有转发指针的唯一可靠方法是重建堆表。您可以使用ALTER TABLE REBUILD语句执行此操作。

Heap tables can be useful in staging environments, where you want to import a large amount of data into the system as fast as possible. Inserting data into heap tables can often be faster than inserting it into tables with clustered indexes. Nevertheless, during a regular workload, tables with clustered indexes usually outperform heap tables, which have suboptimal space control and extra I/O operations introduced by forwarding pointers.

堆栈表在暂存环境中非常有用，您希望尽可能快地将大量数据导入系统。将数据插入堆表通常比将其插入具有聚簇索引的表更快。然而，在常规工作负载期间，具有聚簇索引的表通常优于堆表，堆表具有次优的空间控制和转发指针引入的额外I / O操作。

Clustered Indexes

聚集索引

A clustered index dictates the physical order of the data in a table, which is sorted according to the clustered index key. The table can have only one clustered index defined.

聚簇索引指示表中数据的物理顺序，该表根据聚簇索引键进行排序。该表只能定义一个聚簇索引。

Let’s assume that you want to create a clustered index on the heap table with the data. As a first step, which is shown in Figure 2-5 , SQL Server creates another copy of the data that is then sorted based on the value of the clustered key. The data pages are linked in a double-linked list where every page contains pointers to the next and previous pages in the chain. This list is called the leaf level of the index, and it contains the actual table data.

假设您要在堆表上使用数据创建聚簇索引。作为第一步，如图2-5所示，SQL Server会创建另一个数据副本，然后根据群集密钥的值对其进行排序。数据页链接在双链表中，其中每个页面都包含指向链中下一页和上一页的指针。此列表称为索引的叶级，它包含实际的表数据。

Figure 2-5. Clustered index structure: Leaf level

图2-5。聚集的索引结构：叶级

■ Note The sort order on the page is controlled by a slot array. Actual data on the page is unsorted.

注意页面上的排序顺序由插槽阵列控制。页面上的实际数据未排序。

When the leaf level consists of multiple pages, SQL Server starts to build an intermediate level of the index, as shown in Figure 2-6 .

当叶级别由多个页面组成时，SQL Server开始构建索引的中间级别，如图2-6所示。

Figure 2-6. Clustered index structure: Intermediate and leaf levels

图2-6。聚集的索引结构：中级和叶级

The intermediate level stores one row per leaf-level page. It stores two pieces of information: the physical address and the minimum value of the index key from the page it references. The only exception is the very first row on the first page, where SQL Server stores NULL rather than the minimum index key value. With such optimization, SQL Server does not need to update non-leaf-level rows when you insert the row with the lowest key value in the table.

中间级别为每个叶级页面存储一行。它存储两条信息：它引用的页面中的索引键的物理地址和最小值。唯一的例外是第一页上的第一行，其中SQL Server存储NULL而不是最小索引键值。通过这种优化，当您在表中插入具有最低键值的行时，SQL Server不需要更新非叶级行。

The pages on the intermediate levels are also linked to the double-linked list. SQL Server adds more and more intermediate levels until there is a level that includes just the single page. This level is called the root level , and it becomes the entry point to the index, as shown in Figure 2-7 .

中间级别的页面也链接到双链表。 SQL Server添加了越来越多的中间级别，直到只包含单个页面的级别。此级别称为根级别，它将成为索引的入口点，如图2-7所示。

Figure 2-7. Clustered index structure: Root level

As you can see, the index always has one leaf level, one root level, and zero or more intermediate levels. The only exception is when the index data fits into a single page. In that case, SQL Server does not create the separate root-level page, and the index consists of just the single leaf-level page.

如您所见，索引始终具有一个叶级别，一个根级别和零个或多个中间级别。唯一的例外是索引数据适合单个页面。在这种情况下，SQL Server不会创建单独的根级页面，索引只包含单个叶级页面。

The number of levels in the index largely depends on the row and index key sizes. For example, the index on the 4-byte integer column will require 13 bytes per row on the intermediate and root levels. Those 13 bytes consist of a 2-byte slot-array entry, a 4-byte index-key value, a 6-byte page pointer, and a 1-byte row overhead, which is adequate because the index key does not contain variable-length and NULL columns.

索引中的级别数很大程度上取决于行和索引键的大小。例如，4字节整数列上的索引在中间和根级别上每行需要13个字节。这13个字节由一个2字节的插槽数组条目，一个4字节的索引键值，一个6字节的页面指针和一个1字节的行开销组成，这是足够的，因为索引键不包含变量 - length和NULL列。

As a result, you can accommodate 8,060 bytes / 13 bytes per row = 620 rows per page. This means that, with the one intermediate level, you can store information about up to 620 * 620 = 384,400 leaf-level pages. If your data row size is 200 bytes, you can store 40 rows per leaf-level page and up to 15,376,000 rows in the index with just three levels. Adding another intermediate level to the index would essentially cover all possible integer values.

因此，每行可容纳8,060字节/ 13字节=每页620行。这意味着，使用一个中间级别，您可以存储最多620 * 620 = 384,400个叶级页面的信息。如果数据行大小为200字节，则每个叶级页面可存储40行，索引中最多可存储15,376,000行，只有三个级别。向索引添加另一个中间级别将基本上涵盖所有可能的整数值。

Note In real life, index fragmentation would reduce those numbers. We will talk about index fragmentation in Chapter 6 .

注意在现实生活中，索引碎片会减少这些数字。我们将在第6章讨论索引碎片。

There are three different ways in which SQL Server can read data from the index. The first one is by an ordered scan. Let’s assume that we want to run the SELECT Name FROM dbo.Customers ORDER BY CustomerId query. The data on the leaf level of the index is already sorted based on the CustomerId column value. As a result, SQL Server can scan the leaf level of the index from the first to the last page and return the rows in the order in which they were stored.

SQL Server可以通过三种不同的方式从索引中读取数据。第一个是有序扫描。假设我们想要运行SELECT Name FROM dbo.Customers ORDER BY CustomerId查询。索引的叶级别上的数据已根据CustomerId列值进行排序。因此，SQL Server可以从第一页到最后一页扫描索引的叶级，并按存储顺序返回行。

SQL Server starts with the root page of the index and reads the first row from there. That row references the intermediate page with the minimum key value from the table. SQL Server reads that page and repeats the process until it finds the first page on the leaf level. Then, SQL Server starts to read rows one by one, moving through the linked list of the pages until all rows have been read. Figure 2-8 illustrates this process.

SQL Server从索引的根页开始，从那里读取第一行。该行使用表中的最小键值引用中间页面。 SQL Server读取该页面并重复该过程，直到找到叶级别的第一页。然后，SQL Server开始逐个读取行，遍历页面的链接列表，直到读取了所有行。图2-8说明了这个过程。

Figure 2-8. Ordered index scan

图2-8。有序索引扫描

The execution plan for the preceding query shows the Clustered Index Scan operator with the Orderedproperty set to true, as shown in Figure 2-9 .

上述查询的执行计划显示了“集群索引扫描”操作符，其中Orderedproperty设置为true，如图2-9所示。

Figure 2-9. Ordered index scan execution plan

图2-9。有序索引扫描执行计划

It is worth mentioning that the order by clause is not required for an ordered scan to be triggered. An ordered scan just means that SQL Server reads the data based on the order of the index key.

值得一提的是，触发有序扫描不需要order by子句。有序扫描只意味着SQL Server根据索引键的顺序读取数据。

SQL Server can navigate through indexes in both directions, forward and backward. However, there is one important aspect that you must keep in mind: SQL Server does not use parallelism during backward index scans.

SQL Server可以向前和向后两个方向导航索引。但是，您必须记住一个重要方面：SQL Server在向后索引扫描期间不使用并行性。

Tip： You can check scan direction by examining the INDEX SCAN or INDEX SEEK operator properties in the execution plan. Keep in mind, however, that Management Studio does not display these properties in the graphical representation of the execution plan. You need to open the Properties window to see it by selecting the operator in the execution plan and choosing the View/Properties Window menu item or by pressing the F4 key

提示：您可以通过检查执行计划中的INDEX SCAN或INDEX SEEK运算符属性来检查扫描方向。但请记住，Management Studio不会在执行计划的图形表示中显示这些属性。您需要打开“属性”窗口以通过在执行计划中选择运算符并选择“视图/属性窗口”菜单项或按F4键来查看它。

The Enterprise Edition of SQL Server has an optimization feature called merry-go-round scan that allows multiple tasks to share the same index scan. Let’s assume that you have session S1, which is scanning the index. At some point in the middle of the scan, another session, S2, runs a query that needs to scan the same index. With a merry-go-round scan, S2 joins S1 at its current scan location. SQL Server reads each page only once, passing rows to both sessions.

SQL Server企业版具有称为旋转木马扫描的优化功能，允许多个任务共享相同的索引扫描。假设您有会话S1，它正在扫描索引。在扫描中间的某个时刻，另一个会话S2运行需要扫描相同索引的查询。通过旋转木马扫描，S2将S1连接到当前扫描位置。 SQL Server只读取每个页面一次，将行传递给两个会话。

When the S1 scan reaches the end of the index, S2 starts scanning data from the beginning of the index until the point where the S2 scan started. A merry-go-round scan is another example of why you cannot rely on the order of the index keys and why you should always specify an ORDER BY clause when it matters.

当S1扫描到达索引的末尾时，S2开始从索引的开头扫描数据，直到S2扫描开始的点。旋转木马扫描是另一个例子，说明为什么不能依赖索引键的顺序以及为什么在重要时应始终指定ORDER BY子句。

The next access method after the ordered scan is called an allocation order scan . SQL Server accesses the table data through the IAM pages, similar to how it does so with heap tables. The SELECT Name FROM dbo.Customers WITH (NOLOCK) query and Figure 2-10 illustrate this method. Figure 2-11 shows the query execution plan.

有序扫描之后的下一个访问方法称为分配顺序扫描。 SQL Server通过IAM页面访问表数据，类似于使用堆表的方式。 SELECT名称FROM dbo.Customers WITH（NOLOCK）查询和图2-10说明了这种方法。图2-11显示了查询执行计划。

Figure 2-10. Allocation order scan

图2-10。分配订单扫描

Figure 2-11. Allocation order scan execution plan

图2-11。分配订单扫描执行计划

Unfortunately, it is not easy to detect when SQL Server uses an allocation order scan. Even though the Ordered property in the execution plan shows false , it indicates that SQL Server does not care whether the rows were read in the order of the index key, not that an allocation order scan was used.

不幸的是，当SQL Server使用分配顺序扫描时，检测起来并不容易。即使执行计划中的Ordered属性显示为false，也表示SQL Server不关心是否按索引键的顺序读取行，而不是使用分配顺序扫描。

An allocation order scan can be faster for scanning large tables, although it has a higher startup cost. SQL Server does not use this access method when the table is small. Another important consideration is data consistency. SQL Server does not use forwarding pointers in tables that have a clustered index, and an allocation order scan can produce inconsistent results. Rows can be skipped or read multiple times due to the data movement caused by page splits. As a result, SQL Server usually avoids using allocation order scans unless it reads the data in READ UNCOMMITTED or SERIALIZABLE transaction-isolation levels.

尽管扫描大型表的启动成本较高，但分配顺序扫描可以更快地扫描大型表。当表很小时，SQL Server不使用此访问方法。另一个重要的考虑是数据一致性 SQL Server不在具有聚簇索引的表中使用转发指针，并且分配顺序扫描可能会产生不一致的结果。由于页面拆分导致的数据移动，可以多次跳过或读取行。因此，SQL Server通常会避免使用分配顺序扫描，除非它读取READ UNCOMMITTED或SERIALIZABLE事务隔离级别中的数据。

Note We will talk about page splits and fragmentation in Chapter 6 , “Index Fragmentation,” and discuss locking and data consistency in Part III, “Locking, Blocking, and Concurrency.”

注意我们将在第6章“索引碎片”中讨论页面拆分和碎片，并讨论第三部分“锁定，阻塞和并发”中的锁定和数据一致性。

The last index access method is called index seek . The SELECT Name FROM dbo.Customers WHERE CustomerId BETWEEN 4 AND 7 query and Figure 2-12 illustrate the operation.

最后一个索引访问方法称为索引查找。 SELECT名称来自dbo.Customers WHERE CustomerId BETWEEN 4和7查询以及图2-12说明了操作。

Figure 2-12. Index seek

图2-12。索引寻求

In order to read the range of rows from the table, SQL Server needs to find the row with the minimum value of the key from the range, which is 4. SQL Server starts with the root page, where the second row references the page with the minimum key value of 350. It is greater than the key value that we are looking for (4), and SQL Server reads the intermediate-level data page (1:170) referenced by the first row on the root page.

为了从表中读取行的范围，SQL Server需要从该范围中找到具有最小键值的行，即4. SQL Server以根页面开始，其中第二行引用该页面的页面。最小键值350.它大于我们要查找的键值（4），SQL Server读取根页面第一行引用的中间级数据页（1：170）。

Similarly, the intermediate page leads SQL Server to the first leaf-level page (1:176). SQL Server reads that page, then it reads the rows with CustomerIds equal to 4 and 5, and, finally, it reads the two remaining rows from the second page.

同样，中间页面将SQL Server引导到第一个叶级页面（1：176）。 SQL Server读取该页面，然后它读取CustomerIds等于4和5的行，最后，它从第二页读取剩余的两行。

The execution plan is shown in Figure 2-13 .

执行计划如图2-13所示。

Figure 2-13. Index seek execution plan

图2-13。索引寻求执行计划

As you can guess, index seek is more efficient than index scan, because SQL Server processes just the subset of rows and data pages rather than scanning the entire table.

您可以猜测，索引搜索比索引扫描更有效，因为SQL Server只处理行和数据页的子集，而不是扫描整个表。

Technically speaking, there are two kinds of index seek operations. The first is called a singleton lookup , or sometimes point-lookup , where SQL Server seeks and returns a single row. You can think about WHERE CustomerId = 2 predicate as an example. The other type of index seek operation is called a range scan , and it requires SQL Server to find the lowest or highest value of the key and scan (either forward or backward) the set of rows until it reaches the end of scan range. The predicate WHERE CustomerId BETWEEN 4 AND 7 leads to the range scan. Both cases are shown as INDEX SEEK operations in the execution plans.

从技术上讲，索引搜索操作有两种。第一种称为单例查找，有时称为点查找，其中SQL Server寻找并返回单行。您可以考虑将WHERE CustomerId = 2谓词作为示例。另一种类型的索引查找操作称为范围扫描，它要求SQL Server查找键的最低值或最高值，并扫描（向前或向后）行集，直到达到扫描范围的末尾。 CustomerId BETWEEN 4和7之间的谓词WHERE导致范围扫描。这两种情况都在执行计划中显示为INDEX SEEK操作。

As you can guess, it is entirely possible for range scans to force SQL Server to process a large number or even all data pages from the index. For example, if you changed the query to use a WHERE CustomerId > 0predicate, SQL Server would read all rows/pages, even though you would have an Index Seek operator displayed in the execution plan. You must keep this behavior in mind and always analyze the efficiency of range scans during query performance tuning.

您可以猜到，范围扫描完全可以强制SQL Server处理索引中的大量甚至所有数据页。例如，如果您将查询更改为使用WHERE CustomerId> 0predicate，则SQL Server将读取所有行/页，即使您在执行计划中显示了Index Seek运算符。您必须牢记此行为，并始终在查询性能调整期间分析范围扫描的效率。

There is a concept in relational databases called SARGable predicates , which stands for S earch Arg ument able . The predicate is SARGable if SQL Server can utilize an index seek operation, if an index exists. In a nutshell, predicates are SARGable when SQL Server can isolate the single value or range of index key values to process, thus limiting the search during predicate evaluation. Obviously, it is beneficial to write queries using SARGable predicates and utilize index seek whenever possible.

在关系数据库中有一个名为SARGable谓词的概念，它代表了S earch Arg ement的能力。如果索引存在，如果SQL Server可以使用索引查找操作，则谓词是SARGable。简而言之，当SQL Server可以隔离要处理的索引键值的单个值或范围时，谓词是SARGable，因此在谓词评估期间限制搜索。显然，使用SARGable谓词编写查询并尽可能利用索引查找是有益的。

SARGable predicates include the following operators: = , > , >= , < , <= , IN , BETWEEN , and LIKE (in case of prefix matching). Non-SARGable operators include NOT , <> , LIKE (in case of non-prefix matching), and NOT IN . Another circumstance for making predicates non-SARGable is using functions or mathematical calculations against the table columns. SQL Server has to call the function or perform the calculation for every row it processes. Fortunately, in some of cases you can refactor the queries to make such predicates SARGable. Table 2-1 shows a few examples of this.

SARGable谓词包括以下运算符：=，>，> =，<，<=，IN，BETWEEN和LIKE（在前缀匹配的情况下）。非SARGable运算符包括NOT，<>，LIKE（在非前缀匹配的情况下）和NOT IN。使谓词非SARGable的另一种情况是对表列使用函数或数学计算。 SQL Server必须为其处理的每一行调用该函数或执行计算。幸运的是，在某些情况下，您可以重构查询以生成此类谓词优化搜索。表2-1列出了一些例子。

Another important factor that you must keep in mind is type conversion . In some cases, you can make predicates non-SARGable by using incorrect data types. Let’s create a table with a varchar column and populate it with some data, as shown in Listing 2-6 .

您必须牢记的另一个重要因素是类型转换。在某些情况下，您可以使用不正确的数据类型使谓词非SARGable。让我们创建一个带有varchar列的表，并用一些数据填充它，如清单2-6所示。

Listing 2-6. SARG predicates and data types: Test table creation

清单2-6 SARG谓词和数据类型：测试表创建

create table dbo.Data

(

VarcharKey varchar(10) not null,

Placeholder char(200)

);

create unique clustered index IDX_Data_VarcharKey

on dbo.Data(VarcharKey);

;with N1(C) as (select 0 union all select 0) -- 2 rows

,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

,IDs(ID) as (select row_number() over (order by (select null)) from N5)

insert into dbo.Data(VarcharKey)

select convert(varchar(10),ID) from IDs;

The clustered index key column is defined as varchar, even though it stores integer values. Now, let’s run two selects, as shown in Listing 2-7 , and look at the execution plans.

聚簇索引键列定义为varchar，即使它存储整数值。现在，让我们运行两个选择，如清单2-7所示，并查看执行计划。

Listing 2-7. SARG predicates and data types: Select with integer parameter

清单2-7 SARG谓词和数据类型：使用整数参数选择

declare

@IntParam int = '200'

select * from dbo.Data where VarcharKey = @IntParam;

select * from dbo.Data where VarcharKey = convert(varchar(10),@IntParam);

As you can see in Figure 2-14 , in the case of the integer parameter, SQL Server scans the clustered index, converting varchar to an integer for every row. In the second case, SQL Server converts the integer parameter to a varchar at the beginning and utilizes a much more efficient clustered index seek operation.

如图2-14所示，对于整数参数，SQL Server扫描聚簇索引，将varchar转换为每行的整数。在第二种情况下，SQL Server在开始时将整数参数转换为varchar，并使用更高效的聚簇索引查找操作。

Figure 2-14. SARG predicates and data types: Execution plans with integer parameter

图2-14。 SARG谓词和数据类型：带整数参数的执行计划

Tip Pay attention to the column data types in the join predicates. Implicit or explicit data type conversions can significantly decrease the performance of the queries.

提示：请注意连接谓词中的列数据类型。隐式或显式数据类型转换可能会显着降低查询的性能。

You will observe very similar behavior in the case of unicode string parameters. Let’s run the queries shown in Listing 2-8 . Figure 2-15 shows the execution plans for the statements.

在unicode字符串参数的情况下，您将观察到非常类似的行为。让我们运行清单2-8中所示的查询。图2-15显示了语句的执行计划。

Listing 2-8. SARG predicates and data types: Select with string parameter

清单2-8 SARG谓词和数据类型：使用字符串参数选择

select * from dbo.Data where VarcharKey = '200';

select * from dbo.Data where VarcharKey = N'200'; -- unicode parameter

Figure 2-15. SARG predicates and data types: Execution plans with string parameter

图2-15。 SARG谓词和数据类型：带字符串参数的执行计划

As you can see, a unicode string parameter is non-SARGable for varchar columns. This is a much bigger issue than it appears to be. While you rarely write queries in this way, as shown in Listing 2-8 , most application development environments nowadays treat strings as unicode. As a result, SQL Server client libraries generate unicode ( nvarchar ) parameters for string objects unless the parameter data type is explicitly specified as varchar . This makes the predicates non-SARGable, and it can lead to major performance hits due to unnecessary scans, even when varchar columns are indexed.

如您所见，对于varchar列，unicode字符串参数是非SARGable。这是一个比看起来更大的问题。虽然您很少以这种方式编写查询，如清单2-8所示，但现在大多数应用程序开发环境都将字符串视为unicode。因此，除非将参数数据类型显式指定为varchar，否则SQL Server客户端库会为字符串对象生成unicode（nvarchar）参数。这使得谓词不具有SARG，并且由于不必要的扫描，它可能导致主要的性能命中，即使对varchar列进行索引也是如此。

■ Important Always specify parameter data types in client applications. For example, in ADO.Net, use

Parameters.Add("@ParamName",SqlDbType.Varchar, <Size>).Value = stringVariable instead of

Parameters.Add("@ParamName").Value = stringVariable overload. Use mapping in ORM frameworks to

explicitly specify non-unicode attributes in the classes.

It is also worth mentioning that varchar parameters are SARGable for nvarchar unicode data columns.

值得一提的是，对于nvarchar unicode数据列，varchar参数是SARGable。

Composite

Indexes Indexes with multiple key columns are called composite (or compound) indexes . The data in the composite indexes is sorted on a per-column basis from leftmost to rightmost columns. Figure 2-16 shows the structure of a composite index.

综合

索引具有多个键列的索引称为复合（或复合）索引。复合索引中的数据按从最左列到最右列的每列进行排序。图2-16显示了复合索引的结构。