clustered index, non-clustered index

Reposted from: https://www.cnblogs.com/aspnethot/articles/1504082.html

clustered index

  An index in which the logical order of key values ​​determines the physical order of the corresponding rows in a table. 
  A clustered index determines the physical order of data in a table. A clustered index is similar to a telephone book, which arranges data by last name. Since the clustered index specifies the physical storage order of data in the table, a table can contain only one clustered index. But the index can contain multiple columns (a composite index), much like a phone book is organized by first and last name. 
     
     Clustered indexes are especially effective for columns that are frequently searched for a range of values. Once the row containing the first value has been found using the clustered index, you can ensure that the rows containing subsequent index values ​​are physically adjacent. For example, if the application performs a query that frequently retrieves records for a certain date range, a clustered index can be used to quickly find the row containing the start date, and then retrieve all adjacent rows in the table until the end date is reached. This can help improve the performance of such queries. Likewise, if a column is frequently used when sorting data retrieved from a table, the table can be clustered on that column (physical sort) to save cost by avoiding sorting on that column every time it is queried. 
     

     Using a clustered index to find a specific row is also efficient when the index values ​​are unique. For example, the fastest way to look up a specific employee using the unique employee ID column emp_id is to create a clustered index or PRIMARY KEY constraint on the emp_id column.

 

 

 

nonclustered index

  An index in which the logical order of the index differs from the physical storage order of the rows on disk.

 

 

The index is described by the data structure of the binary tree. We can understand the clustered index in this way: the leaf node of the index is the data node. The leaf nodes of the non-clustered index are still index nodes, but there is a pointer pointing to the corresponding data block. As shown below: 

 

 

 

                                      (nonclustered index)

 

 

 

 

 

 

                                      (clustered index)

 

 

    1. Understand the index structure

      in simple terms In fact, you can understand the index as a special kind of directory. Microsoft's SQL SERVER provides two types of indexes: clustered index (clustered index, also known as clustered index, clustered index) and nonclustered index (nonclustered index, also known as nonclustered index, nonclustered index). Next, let's use an example to illustrate the difference between a clustered index and a non-clustered index:
      In fact, the text of our Chinese dictionary itself is a clustered index. For example, if we want to look up the word "安", we will naturally turn to the first few pages of the dictionary, because the pinyin of "安" is "an", and the dictionary that sorts Chinese characters according to the pinyin starts with the English letter "a" and If it ends with "z", then the word "An" is naturally placed at the front of the dictionary. If you can't find the word after searching all the parts starting with "a", it means that the word is not in your dictionary; similarly, if you look up the word "Zhang", you will also turn to your dictionary The last part, because the pinyin of "张" is "zhang". That is to say, the body part of the dictionary itself is a directory, and you don't need to search other directories to find what you need. We call this kind of text content itself a kind of directory arranged according to certain rules called "clustered index".
      If you know a word, you can quickly look it up from Auto. But you may also encounter a word you don’t know, and you don’t know its pronunciation. At this time, you can’t find the word you want to look up according to the method just now, but you need to find the word you’re looking for according to the “radicals”. , and then directly turn to a page according to the page number after the word to find the word you are looking for. However, the sorting of the characters you found by combining the "radicals catalog" and the "checklist" is not the real text sorting method. For example, if you look up the word "Zhang", we can see the checked characters after checking the radicals The page number of "Zhang" in the table is 672 pages, and the word "chi" is above "Zhang" in the word retrieval table, but the page number is 63 pages, and the word "crossbow" is below "Zhang", and the page is 390 pages. Obviously, these words are not really located above and below the word "Zhang", and the continuous "Chi, Zhang, Crossbow" you see now is actually their sorting in the non-clustered index, which is the dictionary text The mapping of the words in the nonclustered index. We can use this method to find the word you need, but it requires two processes, first find the result in the directory, and then turn to the page number you need. We call this kind of directory purely a directory, and the way the text is purely sorted is called a "non-clustered index".
      Through the above examples, we can understand what is "clustered index" and "non-clustered index". Extending it further, we can easily understand that each table can only have one clustered index, because the directory can only be sorted by one method.

    Second, when to use clustered index or non-clustered index
 

The following table summarizes when to use clustered or nonclustered indexes (important):

 

action description use clustered index Use a nonclustered index
Columns are often sorted by group answer answer
return data in a range answer should not
one or very few distinct values should not should not
small number of distinct values answer should not
large number of distinct values should not answer
frequently updated columns should not answer
foreign key column answer answer
primary key column answer answer
Frequently modify index columns should not answer



      In fact, we can understand the above table through the previous examples of the definitions of clustered indexes and non-clustered indexes. For example: return a data item within a certain range. For example, one of your tables has a time column, and you just build the aggregation index on this column. At this time, when you query all the data between January 1, 2004 and October 1, 2004, the speed will be reduced. It is very fast, because the body of your dictionary is sorted by date, the clustered index only needs to find the beginning and end of all the data to be retrieved; unlike the non-clustered index, you must first find Find the page number corresponding to each item of data in the catalog, and then find the specific content according to the page number.

    3. Combined with reality,

      the purpose of talking about the misunderstanding of index use is to apply it. Although we have just listed when clustered or non-clustered indexes should be used, in practice the above rules are easily overlooked or cannot be comprehensively analyzed based on the actual situation. Next, we will talk about the misunderstandings of index use based on the actual problems encountered in practice, so that everyone can master the method of index establishment.

    1.
      The idea that the primary key is the clustered index is extremely wrong, and it is a waste of the clustered index. Although SQL SERVER creates a clustered index on the primary key by default.
      Usually, we will create an ID column in each table to distinguish each piece of data, and this ID column is automatically increased, and the step size is generally 1. This is the case for the column Gid in our office automation example. At this point, if we set this column as the primary key, SQL SERVER will default this column as a clustered index. This has the advantage that your data can be physically sorted by ID in the database, but I don't think it makes much sense.
      Obviously, the advantages of clustered indexes are obvious, and the rule that there can only be one clustered index in each table makes the clustered index more precious.
      From the definition of the clustered index we talked about earlier, we can see that the biggest advantage of using a clustered index is that it can quickly narrow the scope of the query according to the query requirements and avoid full table scans. In practical applications, because the ID number is automatically generated, we do not know the ID number of each record, so it is difficult for us to use the ID number to query in practice. This makes it a waste of resources to use the primary key of the ID number as a clustered index. Secondly, using a field with a different ID number as a clustered index does not comply with the rule of "a large number of different values ​​​​should not build an aggregated index" rule; of course, this situation is only for users who often modify the record content, especially index items Sometimes it will have a negative effect, but it has no effect on the query speed.
      In the office automation system, whether it is a document displayed on the home page of the system that needs to be signed by the user, a meeting, or a file query by the user, the data query is inseparable from the field "date" and the user's own "username". .
      Usually, the home page of the office automation will display the files or meetings that each user has not yet signed for. Although our where statement can only limit the situation that the current user has not signed for it, if your system has been established for a long time and has a large amount of data, then a full table scan is performed every time each user opens the home page , it doesn’t make much sense to do so. Most of the users have already browsed the files one month ago, so doing so can only increase the overhead of the database. In fact, when a user opens the home page of the system, the database can only query the files that the user has not read in the past 3 months, and use the "date" field to limit table scanning and improve query speed. If your office automation system has been established for 2 years, then your home page display speed will theoretically be 8 times faster than the original speed, or even faster.
      The reason why the word "theoretical" is mentioned here is because if your clustered index is still blindly built on the ID primary key, your query speed is not so high, even if you are on the "date" field Created indexes (non-clustered indexes). Let's take a look at the speed performance of various queries in the case of 10 million pieces of data (250,000 pieces of data within 3 months): (1) Only

    build a clustered index on the primary key, and do not divide the time period :

    Select gid, fariqi, neibuyonghu, title from tgongwen

    : 128470 milliseconds (ie: 128 seconds)

    (2) Create a clustered index on the primary key and a non-clustered index on fariq:

    select gid, fariqi, neibuyonghu, title from Tgongwen
    where fariqi > dateadd(day,-90,getdate())

    time: 53763 milliseconds (54 seconds)

    (3) Build the aggregation index on the date column (fariqi):

    select gid,fariqi,neibuyonghu,title from Tgongwen
    where fariqi> dateadd( day,-90, getdate())

    time: 2423 milliseconds (2 seconds)

      Although each statement extracts 250,000 pieces of data, the difference in various situations is huge, especially when the clustered index is built on the date column time difference. In fact, if your database really has a capacity of 10 million, and the primary key is established on the ID column, just like the first and second cases above, the performance on the web page is timed out and cannot be displayed at all. This is also one of the most important factors for me to abandon the ID column as a clustered index. The method to obtain the above speed is: add before each select statement:

    declare @d datetime
    set @d=getdate()

    and add after the select statement:

    select [statement execution time (milliseconds)]=datediff(ms,@d,getdate())

    2. As long as the index is built, the query speed can be significantly improved
      In fact, we can find that in the above example, items 2 and 3 The statements are exactly the same, and the fields to be indexed are also the same; the only difference is that the former builds a non-aggregated index on the fariqi field, while the latter builds an aggregated index on this field, but the query speed is very different. Therefore, it is not that simply building an index on any field can improve query speed.
      From the table creation statement, we can see that there are 5003 different records in the fariqi field in this table with 10 million data. Building an aggregated index on this field is perfect. In reality, we send several documents every day, and the publication dates of these documents are the same, which fully meets the requirement of building a clustered index: "neither most of them are the same, nor only a few of them are the same". From this point of view, it is very important for us to build a "proper" aggregated index for us to improve query speed.

    3. Add all the fields that need to improve the query speed into the clustered index to improve the query speed. As
      mentioned above: the fields that cannot be separated from the data query are the "date" and the user's own "username". Since these two fields are so important, we can combine them to create a compound index.
      Many people think that as long as any field is added to the clustered index, the query speed can be improved, and some people are confused: if the composite clustered index fields are queried separately, will the query speed be slowed down? With this question in mind, let's take a look at the following query speed (result sets are all 250,000 pieces of data): (the date column fariqi ranks first in the starting column of the composite clustered index, and the user name neibuyonghu ranks in the second column): (

    1 )select gid, fariqi, neibuyonghu,

    Query speed: 2513 milliseconds

    (2) select gid, fariqi, neibuyonghu, title from Tgongwen
                where fariqi>''2004-5-5'' and neibuyonghu=''office''

    Query speed: 2516 milliseconds

    (3) select gid, fariqi ,neibuyonghu,title from Tgongwen where neibuyonghu=''office''Query

    speed: 60280 milliseconds

      From the above experiments, we can see that if only the starting column of the clustered index is used as the query condition and all the columns of the composite clustered index are used at the same time The query speed is almost the same, even slightly faster than using all the composite index columns (in the case of the same number of query result sets); and if only the non-starting columns of the composite clustered index are used as query conditions, this Indexes are useless. Of course, the query speeds of statements 1 and 2 are the same because the number of items to be queried is the same. If all the columns of the composite index are used and the query results are small, an "index coverage" will be formed, so the performance can be optimal. . At the same time, please remember: no matter whether you frequently use other columns of the aggregated index, its leading column must be the most frequently used column.

    4. Summary of experience in using indexes that are not found in other books       1.

    Using an aggregated index is faster than using a primary key that is not an aggregated     index






    select gid, fariqi, neibuyonghu, reader, title from Tgongwen where gid<=250000

    Usage time: 4470 milliseconds

    Here, using an aggregated index is nearly 1/4 faster than using a primary key that is not an aggregated index.

    2. Using aggregation index is faster than using general primary key as order by, especially in the case of small data volume, when

    selecting gid, fariqi, neibuyonghu, reader, title from Tgongwen order by fariqi

    : 12936

    select gid, fariqi, neibuyonghu, Reader, title from Tgongwen order by gid

    time: 18843

      Here, using the aggregation index is 3/10 faster than using the general primary key as the order by. In fact, if the amount of data is small, using a clustered index as a sorting column is significantly faster than using a non-clustered index; and if the amount of data is large, such as more than 100,000, the speed difference between the two is not obvious .

    3. When using the time period in the aggregation index, the search time will be reduced in proportion to the percentage of data in the entire data table, regardless of how many aggregation indexes are used: select gid, fariqi, neibuyonghu, reader,

    title from Tgongwen where fariqi>' Time spent on '2004-1-1'

    : 6343 milliseconds (1 million records extracted)

    select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi>''2004-6-6''

    time: 3170 milliseconds (extract 500,000 items)

    select gid, fariqi, neibuyonghu, reader, title from Tgongwen where fariqi=' Time spent on '2004-9-16'

    : 3326 milliseconds (exactly the same as the result in the previous sentence. If the number of collections is the same, then the greater than and equal signs are the same)

    select gid, fariqi, neibuyonghu, reader, title from Tgongwen
                where fariqi>''2004-1-1'' and fariqi<''2004-6-6''

    time: 3280 milliseconds

    4. The date column will not slow down the query speed due to the input of minutes and seconds
      In the following example, there are a total of 1 million pieces of data, there are 500,000 pieces of data after January 1, 2004, but there are only two different dates, and the dates are accurate to the day; before there are 500,000 pieces of data, there are 5,000 different dates, and the dates are accurate to the second .

    select gid, fariqi, neibuyonghu, reader,




    select gid, fariqi, neibuyonghu, reader, title from Tgongwen
                where fariqi<''2004-1-1'' order by fariqi

    Elapsed time: 6453 milliseconds

    5. Other precautions

      "Water can carry a boat, and it can also overturn it", the index also Same. Indexes can help improve retrieval performance, but too many or inappropriate indexes can also lead to system inefficiencies. Because every time the user adds an index to the table, the database has to do more work. Too many indexes can even lead to index fragmentation.
      Therefore, we need to establish a "proper" index system, especially the creation of aggregated indexes, and we should strive for excellence so that your database can be used with high performance.
      Of course, in practice, as a conscientious database administrator, you have to test some solutions to find out which one is the most efficient and effective.

Guess you like

Origin blog.csdn.net/qq_21514303/article/details/89215244