MySQL Condensed Notes (2)-Index Chapter

 Main quotes from the article notes:

Axiu’s study notes (interviewguide.cn)

Kobayashi coding (xiaolincoding.com)

How indexes improve query speed

Turn unordered data into relatively ordered data (just like checking for purpose)

Why use indexes?

  • By creating a unique index, you can ensure the uniqueness of each row of data in the database table .
  • It can greatly speed up data retrieval , which is also the main reason for creating an index.
  • Helps the server avoid sorting and temporary tables
  • Turn random IO into sequential IO .
  • It can speed up the connection between tables and is particularly meaningful in achieving referential integrity of data.

Why does Innodb use auto-incrementing id as the primary key?

If the table uses an auto-increasing primary key, then every time a new record is inserted, the records will be added sequentially to the subsequent position of the current index node . When a page is full, a new page will be automatically opened . If you use a non-auto-increasing primary key (such as ID number or student number, etc.), since the value of the primary key inserted each time is approximately random , each new record must be inserted somewhere in the middle of the existing index page. Frequently Moving and paging operations caused a large amount of fragmentation and resulted in an index structure that was not compact enough. Subsequently, OPTIMIZE TABLE (optimize table) had to be used to rebuild the table and optimize the filled pages.

Index classification

We can classify indexes from four perspectives.

  • Classified by " data structure ": B+tree index, Hash index, Full-text index .

  • Classified by " physical storage ": clustered index (primary key index), secondary index (auxiliary index) .

  • Classified by " field characteristics ": primary key index, unique index, ordinary index, prefix index .

  • Classified by " number of fields ": single column index, joint index .

What indexes are there in MySQL? What are the characteristics?

  • Normal index : only speeds up queries
  • Unique index : speed up query + unique column value (can have null)
  • Primary key index : speed up query + unique column value (cannot have null) + only one in the table
  • Combined index : Multiple column values ​​form an index, specifically used for combined searches, and its efficiency is greater than index merging
  • Full-text indexing : word segmentation of text content and search
  • Index merging : Combined searches using multiple single-column indexes
  • Covering index : The selected data column can only be obtained from the index without reading the data row. In other words, the query column must be covered by the built index.
  • Clustered index : Table data is stored together with the primary key. The leaf nodes of the primary key index store row data (including the primary key value), and the leaf nodes of the secondary index store the primary key value of the row. B+ tree is used as the storage structure of the index. Non-leaf nodes are all index keywords, but the specific content or content address of the corresponding record is not stored in the keywords in the non-leaf nodes. The data on the leaf nodes is the primary key and specific record (data content)

There are four index types in MySQL. Can you briefly explain them?

  • FULLTEXT  : It is a full-text index, currently only supported by the MyISAM engine. It can be used in CREATE TABLE, ALTER TABLE, and CREATE INDEX, but currently only full-text indexes can be created on CHAR, VARCHAR, and TEXT columns. It should be noted that full-text indexes are supported after MySQL 5.6, but were not supported before 5.6.
  • HASH  : Because HASH is almost unique (almost 100% unique) and has a key-value pair-like form, it is very suitable as an index. HASH indexes can be located once and do not need to be searched layer by layer like tree indexes, so they are extremely efficient. However, this efficiency is conditional, that is, it is only efficient under the "=" and "in" conditions, and is still not efficient for range queries, sorting, and combined indexes.
  • BTREE  : A BTREE index stores index values ​​in a tree-shaped data structure. This is the default and most commonly used index type in MySQL.
  • RTREE  : RTREE is rarely used in MySQL and only supports the geometry data type. The only storage engines that support this type are MyISAM, BDb, InnoDb, NDb, and Archive. Compared with BTREE, the advantage of RTREE lies in range search.

What are the two main data structures used by MySQL indexes?

  • Hash index . For hash index, the underlying data structure is definitely a hash table. Therefore, when most of the requirements are single record queries , you can choose hash index, which has the fastest query performance; in most other scenarios, It is recommended to choose BTree index

  • BTree index , Mysql's BTree index uses B+Tree in B-tree. BTREE index is a kind of index value that is stored in a tree-shaped data structure (binary tree) according to a certain algorithm. Each query is from Starting from the entry root of the tree, the nodes are traversed in order to obtain the leaf.

    But the implementation methods for the main two storage engines (MyISAM and InnoDB) are different.

What is a clustered (clustered) index?

The text content itself is a directory arranged according to certain rules called a "clustered index". The index and data rows are together, and the leaf nodes are data nodes .

The clustered index is to query according to Pinyin.

In fact, the text of our Chinese dictionary itself is a clustered index. For example, if we want to look up the word "安", we will naturally open the first few pages of the dictionary, because the pinyin of "安" is "an", and the dictionary of Chinese characters sorted according to pinyin starts with the English letter "a" and If it ends with "z", then the word "安" will naturally be ranked at the front of the dictionary. If you have searched through all the parts starting with "a" and still can't find the word, it means that the word is not in your dictionary; similarly, if you look up the word "张", you will also turn your dictionary to The last part is because the pinyin of "张" is "zhang". In other words, the main text part of the dictionary itself is a directory, and you do not need to search other directories to find what you are looking for.

# 34. What is a non-aggregated index?

The sorting method in which the directory is purely a directory and the text is purely text is called a "non-clustered index". If the leaf nodes do not store data rows, then it is a non-clustered index.

Non-clustered indexes are queried based on radicals, etc.

If you know a word, you can quickly look it up automatically. But you may also encounter a word you don't know and don't know its pronunciation. At this time, you can't find the word you want to look up according to the method just now, but you need to find the word you are looking for based on the "radicals" the word, and then turn directly to a certain page according to the page number after the word to find the word you are looking for. However, the sorting of characters you find by combining the "Radical Directory" and the "Character List" is not the real sorting method of the text. For example, if you search for the word "Zhang", we can see the character sorting after checking the radicals. The page number of "Zhang" in the table is page 672. Above "Zhang" in the word search table is the word "Chi", but the page number is 63 pages. Below "Zhang" is the word "婷", and the page number is 390. Obviously, these characters are not really located above and below the character "Zhang". The three consecutive characters " Chi, Zhang, and Nu" you see now are actually their sorting in the non-clustered index , which is the main text of the dictionary. The mapping of the words in the non-clustered index. We can find the words you need in this way, but it requires two processes: first find the results in the directory, and then turn to the page number you need.

What is the difference between clustered index and non-clustered index?

The difference between a clustered index and a non-clustered index is that the data you need to find can be found through the clustered index , while the primary key value corresponding to the record can be found through the non-clustered index , and then the value of the primary key can be used to find the required data through the clustered index. The fundamental difference between a clustered index and a non-clustered index is whether the order of table records is consistent with the order of the index .

The leaf nodes of a clustered index (Innodb) are data nodes, while the leaf nodes of a non-clustered index (MyISAM) are still index nodes, except that they contain a pointer to the corresponding data block.

Index Disadvantages (Since indexes have so many advantages, why not create an index for every column in the table?):

  • It needs to occupy physical space . The larger the quantity, the larger the space occupied;

  • Creating and maintaining indexes takes time , and this time increases as the amount of data increases;

  • It will reduce the efficiency of table additions, deletions and modifications , because every time the index is added, deleted or modified, the B+ tree needs to be dynamically maintained in order to maintain the order of the index.

When do you need to create a database index?

  • Fields with unique restrictions , such as product codes;

  • It is often used as WHEREa field for query conditions , which can improve the query speed of the entire table. If the query condition is not a field, a joint index can be established.

  • Fields often used for GROUP BYandORDER BY , fields used for querying and grouping, do not need to be sorted when querying, because we all already know that the records in the B+Tree after establishing the index are Sorted.

When is there no need to create a database index?

  • WHEREFor fields that are not used in the conditions, GROUP BY, the value of the index is to quickly locate the fields. If the fields cannot be positioned, there is usually no need to create an index , because the index will occupy physical space.ORDER BY

  • There is a large amount of duplicate data in the field , and there is no need to create an index. For example, the gender field only has men and women. If the records of men and women are evenly distributed in the database table, then no matter which value is searched, half of the data may be obtained. In these cases, it is better not to have an index, because MySQL also has a query optimizer. When the query optimizer finds that a certain value appears in a high percentage of the data rows of the table, it will generally ignore the index and perform a full table scan. .

  • When the table data is too small , there is no need to create an index;

  • There is no need to create an index for fields that are frequently updated . For example, do not create an index for the user balance of an e-commerce project because the index fields are frequently modified. Since the orderliness of B+Tree needs to be maintained, the index needs to be rebuilt frequently. This process will affect Database performance.

Index coverage and table recovery

Index coverage: An index contains (or covers) the values ​​of all fields that need to be queried

If a query statement uses a secondary index, but the queried data is not a primary key value, then after finding the primary key value in the secondary index, you need to obtain the data rows in the clustered index. This process is called "returning to the table." ”, that is to say, two B+ trees must be checked to find the data. However, when the queried data is the primary key value, because it can only be queried in the secondary index, there is no need to look up the clustered index. This process is called "index coverage", that is, only one B+ tree needs to be queried. can find data

Index optimization (what should you pay attention to when creating an index?)

  • Prefix index optimization : Use the first few characters of the string in a certain field to build an index, in order to reduce the size of the index field and effectively improve the query speed of the index .

  • Covering index optimization : The records obtained by querying in the secondary index do not need to be obtained by querying the clustered index, which can avoid the operation of returning to the table. Suppose we only need to query the name and price of the product and establish a joint index, that is, " product ID, name, price " as a joint index . If this data exists in the index, the query will not retrieve the primary key index again, thus avoiding table backs .

  • It is best for the primary key index to be self-increasing: inserting a new record is an append operation and does not require re-moving the data , so this method of inserting data is very efficient.

  • It is best to set the index to NOT NULL: the presence of NULL in the index column will make the optimizer's index selection more complicated and more difficult to optimize ; the NULL value is a meaningless value, but it will occupy physical space .

  • Prevent index failure; left or left fuzzy matching ; perform calculations, functions, and type conversion operations on index columns; correct use of joint indexes requires following the leftmost matching principle ; in the WHERE clause, if the condition column before OR is an index column, and the condition column after OR is not an index column, then the index will become invalid.

Precautions for using MySQL indexes (preventing index failure)

  • When we use left or left fuzzy matching, either or like %xxboth like %xx%methods will cause index failure;

  • When we use functions on index columns in query conditions, the index will fail.

  • When we perform expression calculation on the index column in the query conditions, the index cannot be used.

  • When MySQL encounters a comparison between a string and a number, it will automatically convert the string into a number and then compare. If the string is an index column and the input parameter in the conditional statement is a number, then implicit type conversion will occur on the index column . Since implicit type conversion is implemented through the CAST function, it is equivalent to using a function on the index column , so This will cause the index to fail .

  • To use the joint index correctly, you need to follow the leftmost matching principle, that is, the index is matched according to the leftmost priority method, otherwise the index will become invalid.

  • In the WHERE clause, if the condition column before the OR is an index column and the condition column after the OR is not an index column, the index will fail.

Things to note when using indexes

  • You can speed up searches on frequently searched columns;

  • Create indexes on columns that are frequently used in where clauses to speed up the judgment of conditions.

  • Set the column to be indexed to NOT NULL, otherwise it will cause the engine to give up using the index and perform a full table scan.

  • Create indexes on columns that often need to be sorted, because the index is already sorted, so that queries can take advantage of the sorting of the index to speed up sorting query times.

  • Avoid applying functions to fields in the where clause, which will cause the index to fail to be hit.

  • Indexes are very effective on medium to large tables, but the maintenance cost of extra-large tables will be very high, so they are not suitable for building indexes. Use logical indexes to build them.

  • On frequently used consecutive columns, these columns are mainly composed of some foreign keys, which can speed up the connection.

  • When using InnoDB, use an auto-incrementing primary key as the primary key, that is, use a logical primary key instead of a business primary key.

  • Delete indexes that have not been used for a long time. The existence of unused indexes will cause unnecessary performance losses.

  • When using limit offset to query the cache, you can use indexes to improve performance.

What is the difference between the way MyISAM and InnoDB implement B-tree indexes?

  • MyISAM, index files and data files are separated. The data field of the B+Tree leaf node stores the address of the data record . During index retrieval, the index is first searched according to the B+Tree search algorithm. If the specified key exists, then Take out the value of its data field, and then read the corresponding data record using the value of the data field as the address. This is called a " non-clustered index "

  • InnoDB, its data files themselves are index files. The table data file itself is an index structure organized by B+Tree. The data field of the leaf node of the tree saves complete data . The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index. This is It is called a "clustered index" or clustered index, and the remaining indexes are used as auxiliary indexes. The data field of the auxiliary index stores the value of the corresponding record's primary key instead of the address . This is also different from MyISAM.

    When searching based on the primary index, you can directly find the node where the key is located to retrieve the data; when searching based on the auxiliary index, you need to retrieve the value of the primary key first and then go through the primary index. Therefore, when designing a table, it is not recommended to use an overly long field as the primary key, nor is it recommended to use a non-monotonic field as the primary key, as this will cause the primary index to be frequently split.

Why do file indexes and database indexes use B+ trees? (Detailed answer to question 9)

The so-called index means that in order to quickly locate and search data, the structural organization of the index should minimize the number of disk I/O accesses during the search process, so the B+ tree is more suitable than the B tree. The database system cleverly uses the principle of locality and disk read-ahead to set the size of a node equal to one page, so that each node only needs one I/O to be fully loaded. The structure of the red-black tree is highly It is obviously much deeper, and because logically close nodes (parents and children) may be physically far away, locality cannot be exploited.

Convenient to scan the database: B-tree must use in-order traversal to scan the database in order, while B+ tree directly scans leaf nodes . B+ tree supports range search , which is very convenient, but B-tree does not support it. This is the main reason for choosing B+ tree for database. reason.

B+ tree search efficiency is more stable , B tree may find data at intermediate nodes, but the stability is not enough.

B+tree's disk read and write costs are lower : B+tree only has leaf nodes to store data . If all the keywords of non-leaf nodes are stored in the same disk, more keywords need to be read into the memory at one time, and the number of IO reads and writes will be relatively reduced;

The query efficiency of B+tree is more stable : only leaf nodes store data, and any keyword search must take a path from the root node to the leaf node. The path lengths of all keyword queries are the same , resulting in equal query efficiency for each data;

Increasing the number of paths in the B+ tree can reduce the height of the tree. So, can increasing the number of paths in the tree infinitely achieve optimal search efficiency?

Can't. Because this will form an ordered array, the file system and database indexes are stored on the hard disk, and if the amount of data is large, it may not be loaded into the memory at one time. The ordered array cannot be loaded into the memory at one time. At this time, the multi-channel storage power of the B+ tree comes out. You can load one node of the B+ tree at a time, and then search down step by step.

Why does MySQL index use B+ tree instead of hash table and B tree?

  • Using Hash requires loading all data into memory . If the amount of data is large, it is a very memory-consuming task. However, using B+ tree is based on loading in segments according to nodes, thereby reducing memory consumption .
  • It has something to do with business scenarios. For unique searches (finding a value), Hash is indeed faster, but multiple pieces of data are often queried in the database . At this time, due to the orderliness of B+ data, there is a linked list connected to the leaf node. His query The efficiency will be much faster than Hash.
  • The non-leaf nodes of the b+ tree do not store data , but only the critical value (maximum or minimum) of the subtree . Therefore , for nodes of the same size, the b+ tree can have more branches than the b tree, making the tree shorter and fatter. The number of IO operations performed during query is also fewer.

How InnoDB stores data

InnoDB data is read and written in units of "data pages" . The default size of InnoDB data pages is 16KB. There is a page directory in the data page , which serves as a record index.

The process of creating a page directory is as follows:

  1. Divide all records into several groups, including minimum records and maximum records, but excluding records marked as "deleted";
  2. The last record of each record group is the largest record in the group, and the header information of the last record will store the total number of records in the group as the n_owned field (the pink field in the above figure)
  3. The page directory is used to store the address offset of the last record in each group. These address offsets are stored in order. The address offset of each group is also called a slot. Each slot is equivalent to The pointer points to the last record of a different group .

As can be seen from the figure, the page directory is composed of multiple slots, and the slots are equivalent to the indexes of grouped records . Then, because the records are sorted from small to large according to the "primary key value", when we search for records through slots, we can use the dichotomy method to quickly locate which slot (which record group) the record to be queried is in. After locating the slot, then Traverse all records in the slot to find the corresponding record without traversing the record list in the entire page starting from the smallest record.

 

Guess you like

Origin blog.csdn.net/shisniend/article/details/131869861