翻译之聚集索引

Clustered Indexes

聚集索引

文章选自：《Pro SQL Server Internals, 2nd edition》CHAPTER 2 Tables and Indexes

作者：Dmitri Korotkevitch

A clustered index dictates the physical order of the data in a table, which is sorted according to the clustered index key. The table can have only one clustered index defined. Let’s assume that you want to create a clustered index on the heap table with the data. As a first step, which is shown in Figure 2-5 , SQL Server creates another copy of the data that is then sorted based on the value of the clustered key. The data pages are linked in a double-linked list where every page contains pointers to the next and previous pages in the chain. This list is called the leaf level of the index, and it contains the actual table data.

聚集索引表示表中数据的物理顺序，根据聚集索引键排序。表仅可以定义一个聚集索引。我们假设你想要在堆表的数据上创建群索引。第一步，如图2-5所示，SQLServer创建了另一个数据副本。然后根据群集密钥的值对其进行排序。数据每个页面中包含的双链表中的页面链接指向链中下一页和上一页的指针。此列表被调用索引的叶子级别，其中包含实际的表数据。

Figure 2-5. Clustered index structure: Leaf level

表2-5. 索引的结构：叶子级别

■ Note The sort order on the page is controlled by a slot array. Actual data on the page is unsorted.

注意：页面上的排序顺序由插槽阵列控制。页面上的实际数据未排序。

When the leaf level consists of multiple pages, SQL Server starts to build an intermediate level of the index, as shown in Figure 2-6 .

当叶子级别由多个页面组成时，SQL Server开始构建索引的中间级别，如图2-6所示。

F igure 2-6. Clustered index structure: Intermediate and leaf levels

图2-6。聚集索引的结构：中间级别和叶子级别

The intermediate level stores one row per leaf-level page. It stores two pieces of information: the physical address and the minimum value of the index key from the page it references. The only exception is the very first row on the first page, where SQL Server stores NULL rather than the minimum index key value. With such optimization, SQL Server does not need to update non-leaf-level rows when you insert the row with the lowest key value in the table. The pages on the intermediate levels are also linked to the double-linked list. SQL Server adds more and more intermediate levels until there is a level that includes just the single page. This level is called the root level , and it becomes the entry point to the index, as shown in Figure 2-7 .

中间级别为每个叶级页面存储一行。它存储两条信息：物理地址和引用的页面中的索引键的最小值。唯一的例外是第一页上的第一行，其中SQL Server存储NULL而不是最小索引键值。有了这样的优化，SQL插入时，服务器不需要更新非叶子级别行表中键值最小的行。页面的中间级别也链接到双链表。 SQL服务器添加越来越多的中间级别，直到有一个仅包括单页级别。此级别称为根级别，它成为索引的入口点，如图所示图2-7。

F igure 2-7. Clustered index structure: Root level

图2-7. 聚集索引结构：跟级别

As you can see, the index always has one leaf level, one root level, and zero or more intermediate levels. The only exception is when the index data fits into a single page. In that case, SQL Server does not create the separate root-level page, and the index consists of just the single leaf-level page. The number of levels in the index largely depends on the row and index key sizes. For example, the index on the 4-byte integer column will require 13 bytes per row on the intermediate and root levels. Those 13 bytes consist of a 2-byte slot-array entry, a 4-byte index-key value, a 6-byte page pointer, and a 1-byte row overhead, which is adequate because the index key does not contain variable-length and NULL columns. As a result, you can accommodate 8,060 bytes / 13 bytes per row = 620 rows per page. This means that, with the one intermediate level, you can store information about up to 620 * 620 = 384,400 leaf-level pages. If your data row size is 200 bytes, you can store 40 rows per leaf-level page and up to 15,376,000 rows in the index with just three levels. Adding another intermediate level to the index would essentially cover all possible integer values.

如您所见，索引始终有一个叶子级别，一个根级别，和零个或多个中间级别。唯一的不同是什么时候索引数据适合单个页面。在这种情况下，SQL Server不会创建单独的根级别页面，索引由单叶子级别页面组成。索引中的级别数量主要取决于行和索引键大小。例如，索引就可以了4字节整数列每行需要13个字节中级和根级。这13个字节由一个2字节的插槽组成数组条目，4字节索引键值，6字节页面指针和1-字节行开销，这是足够的，因为索引键没有包含可变长度和NULL列。结果，你可以容纳8,060字节/每行13字节=每页620行。这意味着，通过一个中间级别，您可以存储信息大约620 * 620 = 384,400个叶子级页面。如果数据行大小为200字节，每个叶级页面最多可存储40行索引中有15,376,000行，只有三个级别。添加另一个指数的中间级别基本上涵盖所有可能的整数值。

■ Note In real life, index fragmentation would reduce those numbers. We will talk about index fragmentation in Chapter 6

注意：在现实生活中，索引碎片会减少这些碎片数字。我们将在第6章讨论索引碎片。

There are three different ways in which SQL Server can read data from the index. The first one is by an ordered scan. Let’s assume that we want to run the SELECT Name FROM dbo.Customers ORDER BY CustomerId query. The data on the leaf level of the index is already sorted based on the CustomerId column value. As a result, SQL Server can scan the leaf level of the index from the first to the last page and return the rows in the order in which they were stored.

SQL Server可以通过三种不同的方式从中读取数据指数。第一个是有序扫描。让我们假设

我们想要运行SELECT Name FROM dbo.Customers ORDER BY CustomerId查询。索引的叶级别上的数据已经基于排序在CustomerId列值上。因此，SQL Server可以扫描索引的叶级从第一页到最后一页并返回行按存储顺序排列。

SQL Server starts with the root page of the index and reads the first row from there. That row references the intermediate page with the minimum key value from the table. SQL Server reads that page and repeats the process until it finds the first page on the leaf level. Then, SQL Server starts to read rows one by one, moving through the linked list of the pages until all rows have been read. Figure 2-8 illustrates this process.

SQL Server从索引的根页开始并从中读取第一行。该行引用了中间页面表中的最小键值。 SQL Server读取该页面和重复该过程，直到找到叶子级别的第一页。然后，SQL Server开始逐个读取行，通过链接列表的页面，直到读取所有行。图2-8说明了这个过程。

Figure 2-8. Ordered index scan

图2-8。有序索引扫描

The execution plan for the preceding query shows the Clustered Index Scan operator with the Ordered property set to true, as shown in Figure 2-9 .

上一个查询显示了聚集索引将Ordered属性设置为true的运算，如下图2-9所示。

Figure 2-9. Ordered index scan execution plan

图2-9。有序索引扫描执行计划

I t is worth mentioning that the order by clause is not required for an ordered scan to be triggered. An ordered scan just means that SQL Server reads the data based on the order of the index key. SQL Server can navigate through indexes in both directions, forward and backward. However, there is one important aspect that you must keep in mind: SQL Server does not use parallelism during backward index scans.

值得一提的是，排序不是必需的要触发的有序扫描。有序扫描只意味着SQL服务器根据索引键的顺序读取数据。SQL Server可以向前和向后扫描两个方向的索引。但是，必须牢记一个重要方面：SQL在反向索引扫描期间，服务器不使用并行性。

■ Tip Y ou can check scan direction by examining the INDEX SCAN or INDEX SEEK operator properties in the execution plan. Keep in mind, however, that Management Studio does not display these properties in the graphical representation of the execution plan. You need to open the Properties window to see it by selecting the operator in the execution plan and choosing the View/Properties Window menu item or by pressing the F4 key.

注意：您可以通过检查索引扫描来检查扫描方向或执行计划中的索引运算符属性。但是，请注意，Management Studio不会显示这些内容执行计划的图形表示中的属性。您需要打开“属性”窗口以通过选择运算符来查看它在执行计划中并选择“视图/属性窗口”菜单项目或按F4键。

The Enterprise Edition of SQL Server has an optimization feature called merry-go-round scan that allows multiple tasks to share the same index scan. Let’s assume that you have session S1, which is scanning the index. At some point in the middle of the scan, another session, S2, runs a query that needs to scan the same index. With a merry-go-round scan, S2 joins S1 at its current scan location. SQL Server reads each page only once, passing rows to both sessions.

SQL Server企业版具有优化的功能称为旋转木马扫描，允许多个任务共享相同的索引扫描。让我们假设您有会话S1，即扫描索引。在扫描中间的某个点，另一个会话S2运行需要扫描相同索引的查询。有了旋转木马扫描，S2在其当前扫描位置加入S1。 SQL服务器只读取每个页面一次，将行传递给两个会话。

When the S1 scan reaches the end of the index, S2 starts scanning data from the beginning of the index until the point where the S2 scan started. A merry-go-round scan is another example of why you cannot rely on the order of the index keys and why you should always specify an ORDER BY clause when it matters.

The next access method after the ordered scan is called an allocation order scan. S QL Server accesses the table data through the IAM pages, similar to how it does so with heap tables. The SELECT Name FROM dbo.Customers WITH (NOLOCK) query and Figure 2-10 illustrate this method. Figure 2-11 shows the query execution plan.

当S1扫描到达索引的结束时，S2从索引的开始开始扫描数据，直到S2扫描开始的点。旋转木马扫描是另一个示例，它说明了为什么您不能依赖于索引键的顺序，以及为什么在需要时应该始终指定ORDER BY子句。排序扫描之后的下一个访问方法称为分配顺序扫描。S QL Server通过IAM页面访问表数据，类似于使用堆表的方式。SELECT Name from dbo.Customers WITH(NOLOCK)查询和图2-10说明了这种方法。图2-11显示了查询执行计划。

Figure 2-10. Allocation order scan

图2-10。分配顺序扫描

Figure 2-11. A llocation order scan execution plan

图2-11。分配顺序扫描执行计划

Unfortunately, it is not easy to detect when SQL Server uses an allocation order scan. Even though the Ordered property in the execution plan shows false , it indicates that SQL Server does not care whether the rows were read in the order of the index key, not that an allocation order scan was used.

不幸的是，SQL Server使用时很难检测分配顺序扫描。即使在Ordered属性的执行计划显示false，表示SQL Server不关心是否按索引键的顺序读取行，而不是使用了分配订顺序扫描。

An allocation order scan can be faster for scanning large tables, although it has a higher startup cost. SQL Server does not use this access method when the table is small. Another important consideration is data consistency. SQL Server does not use forwarding pointers in tables that have a clustered index, and an allocation order scan can produce inconsistent results. Rows can be skipped or read multiple times due to the data movement caused by page splits. As a result, SQL Server usually avoids using allocation order scans unless it reads the data in READ UNCOMMITTED or SERIALIZABLE transaction-isolation levels.

The execution plan is shown in Figure 2-13 .

扫描大表时，分配顺序扫描可以更快，虽然它具有更高的启动成本。当表很小时SQL Server不使用它当访问方法。另一个重要的考虑是数据一致性。SQL Server不使用转发指针，具有聚集索引的表和分配顺序扫描可以产生不一致的结果。由于分页导致的数据移动，可以多次跳过或读取行。因此，SQLServer通常避免使用分配顺序扫描。除非它以READ UNCOMMITTED或SERIALIZABLE事务隔离级别读取数据。

■ Note We will talk about page splits and fragmentation in Chapter 6 , “Index Fragmentation,” and discuss locking and data consistency in Part III, “Locking, Blocking, and Concurrency.”

注意：我们将在第6章“索引分段”中讨论分页和分段，在第三部分“锁定、阻塞和并发”中讨论锁定和数据一致性。

The last index access method is called index seek . The SELECT Name FROM dbo.Customers WHERE CustomerId BETWEEN 4 AND 7 query and Figure 2-12 illustrate the operation

最后一个索引访问方法称为索引查找。 SELECT名称FROM dbo.Customers WHERE CustomerId BETWEEN 4 AND 7查询.图2-12说明了该操作

Figure 2-12. Index seek

图2-12。索引查找

In order to read the range of rows from the table, SQL Server needs to find the row with the minimum value of the key from the range, which is 4. SQL Server starts with the root page, where the second row references the page with the minimum key value of 350. It is greater than the key value that we are looking for (4), and SQL Server reads the intermediate-level data page (1:170) referenced by the first row on the root page.

为了从表中读取行的范围，SQL Server需要从范围中找到具有最小键值的行是4. SQL Server以根页面开始，其中第二行引用最小键值为350的页面。它大于我们要查找的键值（4），并且SQL Server读取根页面上第一行引用的中间级数据页（1：170）。

Similarly, the intermediate page leads SQL Server to the first leaf-level page (1:176). SQL Server reads that page, then it reads the rows with CustomerIds equal to 4 and 5, and, finally, it reads the two remaining rows from the second page.

同样，中间页面将SQL Server引向第一个叶子 - 级别页面（1：176）。 SQL Server读取该页面，然后读取行CustomerIds等于4和5，最后，它读取两个第二页的剩余行。

执行计划如图2-13所示。

Figure 2-13. Index seek execution plan

图2-13。索引查找执行计划

As you can guess, index seek is more efficient than index scan, because SQL Server processes just the subset of rows and data pages rather than scanning the entire table.

您可以猜测，索引搜索比索引扫描更有效，因为SQL Server只处理行和数据页的子集，而不是扫描整个表。

Technically speaking, there are two kinds of index seek operations. The first is called a singleton lookup , or sometimes point-lookup , where SQL Server seeks and returns a single row. You can think about WHERE CustomerId = 2 predicate as an example. The other type of index seek operation is called a range scan , and it requires SQL Server to find the lowest or highest value of the key and scan (either forward or backward) the set of rows until it reaches the end of scan range. The predicate WHERE CustomerId BETWEEN 4 AND 7 leads to the range scan. Both cases are shown as INDEX SEEK operations in the execution plans.

从技术上讲，索引搜索操作有两种。第一种称为单例查找，有时称为点查找，其中SQL Server寻找并返回单行。您可以考虑将WHERE CustomerId = 2谓词作为示例。另一种类型的索引查找操作称为范围扫描，它要求SQL Server查找键的最低值或最高值，并扫描（向前或向后）行集，直到达到扫描范围的末尾。 CustomerId BETWEEN 4和7之间的谓词WHERE导致范围扫描。这两种情况都在执行计划中显示为INDEX SEEK操作。

As you can guess, it is entirely possible for range scans to force SQL Server to process a large number or even all data pages from the index. For example, if you changed the query to use a WHERE CustomerId > 0 predicate, SQL Server would read all rows/pages, even though you would have an Index Seek operator displayed in the execution plan. You must keep this behavior in mind and always analyze the efficiency of range scans during query performance tuning.

可以猜到，范围扫描完全可以强制SQL Server处理索引中的大量甚至所有数据页。例如，如果您将查询更改为使用WHERE CustomerId> 0谓词，则SQL Server将读取所有行/页，即使您在执行计划中显示了Index Seek运算符。您必须牢记此行为，并始终在查询性能调整期间分析范围扫描的效率。

There is a concept in relational databases called SARGable predicates , which stands for S earch Argumentable . The predicate is SARGable if SQL Server can utilize an index seek operation, if an index exists. In a nutshell, predicates are SARGable when SQL Server can isolate the single value or range of index key values to process, thus limiting the search during predicate evaluation. Obviously, it is beneficial to write queries using SARGable predicates and utilize index seek whenever possible. SARGable predicates include the following operators: = , > , >= , < , <= , IN , BETWEEN , and LIKE (in case of prefix matching). Non-SARGable operators include NOT , <> , LIKE (in case of non-prefix matching), and NOT IN . Another circumstance for making predicates non-SARGable is using functions or mathematical calculations against the table columns. SQL Server has to call the function or perform the calculation for every row it processes. Fortunately, in some of cases you can refactor the queries to make such predicates SARGable. Table 2-1 shows a few examples of this.

关系数据库中有一个名为SARGable谓词的概念，它代表Search Argumentable。如果索引存在，如果SQL Server可以使用索引查找操作，则谓词是SARGable。简而言之，当SQL Server可以隔离要处理的索引键值的单个值或范围时，谓词是SARGable，因此在谓词评估期间限制搜索。显然，使用SARGable谓词编写查询并尽可能利用索引查找是有益的。 SARGable谓词包括以下运算符：=，>，> =，<，<=，IN，BETWEEN和LIKE（在前缀匹配的情况下）。非SARGable运算符包括NOT，<>，LIKE（在非前缀匹配的情况下）和NOT IN。使谓词非SARGable的另一种情况是对表列使用函数或数学计算。 SQL Server必须为其处理的每一行调用该函数或执行计算。幸运的是，在某些情况下，您可以重构查询以使这样的谓词成为SARGable。

表2-1列出了一些例子。

Table 2-1. Examples of Refactoring Non-SARGable Predicates into SARGable Ones

表2-1。将非SARGable谓词重构为SARGable的示例

Another important factor that you must keep in mind is type conversion . In some cases, you can make predicates non-SARGable by using incorrect data types. Let’s create a table with a varchar column and populate it with some data, as shown in Listing 2-6 .

Listing 2-6. SARG predicates and data types: Test table creation

另一个重要因素是类型转换。在某些情况下，您可以使用不正确的数据类型使谓词非SARGable。让我们创建一个带有varchar列的表，并用一些数据填充它，如清单2-6所示。

清单2-6 SARG谓词和数据类型：测试表创建

create table dbo.Data

(

VarcharKey varchar(10) not null,

Placeholder char(200)

);

create unique clustered index IDX_Data_VarcharKey

on dbo.Data(VarcharKey);

;with N1(C) as (select 0 union all select 0) -- 2 rows

,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

,IDs(ID) as (select row_number() over (order by (select null)) from N5)

insert into dbo.Data(VarcharKey)

select convert(varchar(10),ID) from IDs;

The clustered index key column is defined as varchar , even though it stores integer values. Now, let’s run two selects, as shown in Listing 2-7 , and look at the execution plans.

聚集索引键列定义为varchar，用它存储整数值。现在，让我们运行两个选择，如清单2-7所示，并查看执行计划。

Listing 2-7. SARG predicates and data types: Select with integer parameter

清单2-7 SARG谓词和数据类型：选择使用整数参数

declare

@IntParam int = '200'

select * from dbo.Data where VarcharKey = @IntParam;

select * from dbo.Data where VarcharKey = convert(varchar(10),@IntParam);

As you can see in Figure 2-14 , in the case of the integer parameter, SQL Server scans the clustered index, converting varchar to an integer for every row. In the second case, SQL Server converts the integer parameter to a varchar at the beginning and utilizes a much more efficient clustered index seek operation.

如图2-14所示，对于整数参数，SQL Server扫描聚簇索引，将varchar转换为每行的整数。在第二种情况下，SQL Server在开始时将整数参数转换为varchar，并使用更高效的聚簇索引查找操作。

Figure 2-14. SARG predicates and data types: Execution plans with integer parameter

图2-14。 SARG谓词和数据类型：带整数参数的执行计划

■ Tip Pay attention to the column data types in the join predicates. Implicit or explicit data type conversions can significantly decrease the performance of the queries.

注意：请注意连接谓词中的列数据类型。隐式或显式数据类型转换可能会显着降低查询的性能。

You will observe very similar behavior in the case of unicode string parameters. Let’s run the queries shown in Listing 2-8 . Figure 2-15 shows the execution plans for the statements.

在unicode字符串参数的情况下，您将观察到非常类似的行为。让我们运行清单2-8中所示的查询。图2-15显示了语句的执行计划。

Listing 2-8. SARG predicates and data types: Select with string parameter

清单2-8 SARG谓词和数据类型：使用字符串参数选择

select * from dbo.Data where VarcharKey = '200';

select * from dbo.Data where VarcharKey = N'200'; -- unicode parameter

As you can see, a unicode string parameter is non-SARGable for varchar columns. This is a much bigger issue than it appears to be. While you rarely write queries in this way, as shown in Listing 2-8 , most application development environments nowadays treat strings as unicode. As a result, SQL Server client libraries generate unicode ( nvarchar ) parameters for string objects unless the parameter data type is explicitly specified as varchar . This makes the predicates non-SARGable, and it can lead to major performance hits due to unnecessary scans, even when varchar columns are indexed.

如您所见，对于varchar列，unicode字符串参数是非SARGable。这是一个比看起来更大的问题。虽然您很少以这种方式编写查询，如清单2-8所示，但现在大多数应用程序开发环境都将字符串视为unicode。因此，除非将参数数据类型显式指定为varchar，否则SQL Server客户端库会为字符串对象生成unicode（nvarchar）参数。这使得谓词不具有SARG，并且由于不必要的扫描，它可能导致主要的性能命中，即使对varchar列进行索引也是如此。

■ Important Always specify parameter data types in client applications. For example, in ADO.Net, use

注意：始终在客户端应用程序中指定参数数据类例如，在ADO.Net中使用

You will observe very similar behavior in the case of unicode string parameters. Let’s run the queries shown in Listing 2-8 . Figure 2-15 shows the execution plans for the statements.

Listing 2-8. SARG predicates and data types: Select with string parameter

在unicode字符串参数的情况下，您将观察到非常类似的行为。运行清单2-8中所示的查询。图2-15显示了语句的执行计划。

清单2-8 SARG谓词和数据类型：使用字符串参数选择

select * from dbo.Data where VarcharKey = '200';

select * from dbo.Data where VarcharKey = N'200'; -- unicode parameter

Figure 2-15. S ARG predicates and data types: Execution plans with s tring parameter

图2-15。 S ARG谓词和数据类型：带有参数的执行计划

如您所见，对于varchar列，unicode字符串参数是非SARGable。这是一个比看起来更大的问题。虽然您很少以这种方式编写查询，如清单2-8所示，但现在大多数应用程序开发环境都将字符串视为unicode。因此，除非将参数数据类型显式指定为varchar，否则SQL Server客户端库会为字符串对象生成unicode（nvarchar）参数。这使得谓词不具有SARG，并且由于不必要的扫描，它可能导致不能搜索，即使对varchar列进行索引也是如此。

■ Important Always specify parameter data types in client applications. For example, in ADO.Net, use

注意：始终在客户端应用程序中指定参数数据类例如，在ADO.Net中使用

Parameters.Add("@ParamName",SqlDbType.Varchar, <Size>).Value = stringVariable instead of

Parameters.Add("@ParamName").Value = stringVariable overload.

Use mapping in ORM frameworks to explicitly specify non-unicode attributes in the classes.

在ORM框架中使用映射来显式指定类中的非unicode属性。

It is also worth mentioning that varchar parameters are SARGable for nvarchar unicode data columns.

值得一提的是，对于nvarchar unicode数据列，varchar参数是SARGable。

猜你喜欢