On the Selection of SQL Server Clustered Index Keys from the Perspective of Performance

Specifically: http://www.verydemo.com/demo_c155_i1328.html

Introduction

....................................................................

 

The column or combination of columns on which the clustered index resides is preferably unique

    This reason needs to be discussed from the principle of data storage. In SQL Server, data is not stored in row (Row) units, but in pages. Therefore, when looking up data, the smallest unit that SQL Server looks up is actually a page. That is to say, even if you only look up a small row of data, SQL Server will look up the entire page and put it in the buffer pool.

    The size of each page is 8K. Each page will have a physical address to SQL Server. This address is written as file number:page number (understanding file numbers requires some knowledge of files and file groups). For example, page 50 of the first file. Then the page number is 1:50. When the table does not have a clustered index, the data pages in the table are stored in the heap. On the basis of the page, SQL Server uniquely determines each row through an extra row number, which is the legendary RID. RID is represented by file number: page number: line number. Assuming that this line is in the fifth line of the page mentioned above, the RID is represented as 1:50:5, as shown in Figure 1.

   

    Figure 1. Example of RID

  

    From the concept of RID, RID is not only the only basis for SQL Server to determine each row, but also the storage location of the row. Pages are rarely moved when they are organized by the heap.

    When a clustered index is established on a table, the pages in the table are organized according to a B-tree. At this point, SQL Server is no longer looking for rows by RID, but instead uses keywords, that is, the columns of the clustered index are used as keywords to search. Suppose that in the table in Figure 1, we set the DepartmentID column as a clustered index column. Then the row of the non-leaf node of the B-tree only contains the DepartmentID and the bookmark (BookMark) pointing to the next level node.

    When the value of the clustered index we create is not unique, SQL Server cannot uniquely determine a row only through the clustered index column (that is, the keyword). At this time, in order to realize the unique distinction of each row, SQL Server needs to generate an additional identification information for the clustered index column of the same value to distinguish, which is the so-called uniquifiers. After using uniquifier, the impact on performance is divided into the following two parts:

  •     SQL Server must determine whether the current data is duplicated with the existing key when inserting or updating. If it is duplicated, it needs to generate a uniquifier, which is an extra cost.
  •     The size of the keys is additionally increased because an additional uniquifier needs to be added to distinguish keys of the same value. Therefore, both leaf nodes and non-leaf nodes require more pages for storage. This also affects the non-clustered index, making the bookmark column of the non-clustered index larger, so that the non-clustered index also requires more pages for storage.

    Let's test, create a test table, and create a clustered index. Insert 100,000 pieces of test data, where every 2 pieces are repeated, as shown in Figure 2.

   

    Figure 2. Test code to insert data

    

   At this point, let's check the number of pages occupied by this table, as shown in Figure 3.

   

    Figure 3. 100,000 data occupies 359 pages after inserting duplicate keys

 

    We insert 100,000 unique data again, as shown in Figure 4.

   

    Figure 4. Insert 100,000 unique built code

 

    At this time, the number of pages occupied is reduced to 335 pages, as shown in Figure 5.

   

    Figure 5. Reduced to 335 pages after inserting distinct keys

 

     Therefore, it is recommended to use a unique key for the column where the clustered index is located.

 

It is better to use a narrow column or combination of narrow columns as a clustered index column

    This principle is the same as the principle of reducing pages above, narrow columns make the size of keys smaller. The non-leaf nodes of the clustered index are reduced, and the bookmarks of the non-clustered index are reduced, so that the leaf node pages become fewer. Finally improved performance.

 

Use a column or combination of columns whose values ​​rarely change as a clustered index column

    We know before. After creating a clustered index for the table. SQL Server finds rows by key. Because the data is ordered in the B number, when the clustered index key changes, it is not only necessary to change the value itself, but also the position (RID) of the row where the key is located, so it is possible to move the row from one page. to another page. to achieve order. Therefore, the following problems will arise:

  •     Moving a row from one page to another requires overhead. Not only that, but this operation may also affect other rows, so that other rows also need to be moved, which may cause paging
  •     Movement of rows between pages creates index fragmentation
  •     The key change will affect the non-clustered index, so that the bookmark of the non-clustered index also needs to be changed, which is an additional overhead

     This is why many tables create a column that has nothing to do with the data itself as the primary key, such as the Person.Address table in the AdventureWorks database, and use the AddressID column, which has nothing to do with the data itself, as the clustered index column, as shown in Figure 6. If AddressLine1 is used as the primary key, the change of the employee's address may cause problems in the above list.

   

    Figure 6. Create a column independent of the data itself as a clustered index column

 

It is best to use auto-incrementing columns as clustered index columns

    This recommendation also recommends creating an auto-incrementing column that has nothing to do with the data itself as a clustered index column. We know that if the newly added data needs to be inserted into the current ordered B-tree if the clustered index column needs to be moved, other rows need to be moved to make room for the newly inserted row. Therefore, paging and index fragmentation may occur. Also, there is an additional burden of modifying nonclustered indexes. With auto-incrementing columns, the insertion of new rows will greatly reduce paging and fragmentation.

   I came across a situation recently. The performance of a table is extremely slow every few months, a preliminary look is due to a large number of index fragmentation. But rebuilding the index every few months tires me out. Eventually I found out that the problem was that the people who designed the database at the time built the clustered index on GUIDs, which were randomly generated and could be inserted anywhere in the table, greatly increasing the amount of fragmentation. Hence the above situation.

 

Summarize

    This article briefly introduces the principle of SQL Server storage and several clustered index establishment situations that should be avoided, but this is only about the choice of clustered index from the perspective of performance. For the selection of a clustered index, a comprehensive consideration is needed to make a decision.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326571004&siteId=291194637