Database Index Principle and Optimization

1. Abstract

This article takes MySQL database as the research object and discusses some topics related to database indexing. In particular, MySQL supports many storage engines, and various storage engines support indexes differently. Therefore, MySQL database supports multiple index types, such as BTree index, hash index, full-text index and so on. In order to avoid confusion, this article will only focus on the BTree index, because this is the index that is mainly dealt with when using MySQL, and the hash index and full-text index will not be discussed in this article.

2. Common query algorithms and data structures

Why are we talking about query algorithms and data structures here? Because the reason to build an index is actually to build a data structure on which an efficient query algorithm can be applied to ultimately improve the query speed of the data.

2.1 The nature of indexes

MySQL's official definition of an index is: An index is a data structure that helps MySQL efficiently obtain data. Extracting the backbone of the sentence, you can get the essence of the index: the index is a data structure.

2.2 Common query algorithms

We know that database query is one of the most important functions of the database. We all hope that the speed of querying data can be as fast as possible, so the designers of database systems will optimize from the perspective of query algorithms. So what query algorithms can make the query faster?

The most basic query algorithm is of course linear search, which is a method of comparing each element, but this algorithm is extremely inefficient when the amount of data is large. 
Data Structure: Ordered or Unordered Queue 
Complexity: O(n) 
Example Code:

//顺序查找
int SequenceSearch(int a[], int value, int n)
{
    int i;
    for(i=0; i<n; i++)
        if(a[i]==value)
            return i;
    return -1;
}

A faster query method than sequential search should be binary search. The principle of binary search is that the search process starts from the middle element of the array. If the middle element is exactly the element to be searched, the search process ends; if a specific element is greater than Or less than the middle element, search in the half of the array that is greater or less than the middle element, and start the comparison from the middle element as at the beginning. If the array is empty at a certain step, it means it was not found. 
Data Structure: Sorted Array 
Complexity: O(logn) 
Example Code:

//二分查找,递归版本
int BinarySearch2(int a[], int value, int low, int high)
{
    int mid = low+(high-low)/2;
    if(a[mid]==value)
        return mid;
    if(a[mid]>value)
        return BinarySearch2(a, value, low, mid-1);
    if(a[mid]<value)
        return BinarySearch2(a, value, mid+1, high);
}

2.2.3 Binary Sort Tree Search

The characteristics of a binary sorted tree are:

  1. If its left subtree is not empty, the value of all nodes on the left subtree is less than the value of its root node;
  2. If its right subtree is not empty, the value of all nodes on the right subtree is greater than the value of its root node;
  3. Its left and right subtrees are also binary sorted trees, respectively.

The principle of search:

  1. If b is an empty tree, the search fails, otherwise:
  2. If x is equal to the value of the data field of the root node of b, the search succeeds; otherwise:
  3. If x is less than the value of the data field of the root node of b, search the left subtree; otherwise:
  4. Find the right subtree.

Data Structure: Binary Sort Tree 
Time Complexity: O(log2N)

2.2.4 Hash Hash Method (Hash Table)

The principle is to first create a hash table (hash table) according to the key value and the hash function, and the burnup will locate the data element position through the hash function according to the key value.

Data structure: Hash table 
Time complexity: almost O(1), depends on how many collisions are generated.

2.2.5 Block search

Blocked search, also known as indexed sequential search, is an improved method of sequential search. The algorithm idea is to divide n data elements into m blocks (m ≤ n) "in block order". The nodes in each block do not have to be ordered, but the blocks must be "block ordered"; that is, the key of any element in the first block must be smaller than the key of any element in the second block; and Any element in block 2 must be less than any element in block 3, and so on. 
   
Algorithm flow:

  1. First select the largest keyword in each block to form an index table;
  2. The search is divided into two parts: first, perform a binary search or a sequential search on the index table to determine which block the record to be searched is in; then, use the sequential method to search in the determined block.

This search algorithm reduces the search range by half for each comparison. Their query speed has been greatly improved, and the complexity is . If you analyze it a little, you will find that each search algorithm can only be applied to a specific data structure. For example, binary search requires the retrieved data to be ordered, and binary tree search can only be applied to binary search trees, but the data itself The organizational structure cannot fully satisfy various data structures (for example, it is theoretically impossible to organize both columns in order at the same time), so, in addition to the data, the database system also maintains data structures that satisfy specific search algorithms. Structures reference (point to) data in some way so that advanced lookup algorithms can be implemented on those data structures. This data structure is the index.

2.3 Balanced multi-way search tree B-tree (B-tree)

The above mentioned binary tree, its search time complexity is O(log2N), so its search efficiency is related to the depth of the tree. If the query speed is to be improved, the depth of the tree must be reduced. To reduce the depth of the tree, a natural way is to use a multi-fork tree, combined with the idea of ​​​​a balanced binary tree, we can build a balanced multi-fork tree structure, and then we can build a balanced multi-way search algorithm on it to improve the large amount of data. search efficiency.

2.3.1 B Tree

B-tree (Balance Tree) is also called B-tree (in fact, B- is translated from B-tree, so B-tree and B-tree are the same concept), it is a balanced multi-way search tree. The following figure is a typical B-tree: 
write picture description here

From the above figure, we can roughly see some characteristics of the B-tree. In order to better describe the B-tree, we define the record as a two-tuple [key, data], the key is the key value of the record, and the data represents other data (only the key in the above figure, data data is not drawn). The following is a detailed definition of B-tree:

1. 有一个根节点,根节点只有一个记录和两个孩子或者根节点为空;
2. 每个节点记录中的key和指针相互间隔,指针指向孩子节点;
3. d是表示树的宽度,除叶子节点之外,其它每个节点有[d/2,d-1]条记录,并且些记录中的key都是从左到右按大小排列的,有[d/2+1,d]个孩子;
4. 在一个节点中,第n个子树中的所有key,小于这个节点中第n个key,大于第n-1个key,比如上图中B节点的第2个子节点E中的所有key都小于B中的第2个key 9,大于第1个key 3;
5. 所有的叶子节点必须在同一层次,也就是它们具有相同的深度;

Due to the characteristics of B-Tree, the algorithm for retrieving data by key in B-Tree is very intuitive: first, perform a binary search from the root node, if found, return the data of the corresponding node, otherwise, recursively search for the node pointed to by the pointer of the corresponding interval , until a node is found or a null pointer is found, the former succeeds and the latter fails. The pseudocode of the search algorithm on B-Tree is as follows:

BTree_Search(node, key) {
     if(node == null) return null;
     foreach(node.key){
          if(node.key[i] == key) return node.data[i];
          if(node.key[i] > key) return BTree_Search(point[i]->node);
      }
     return BTree_Search(point[i+1]->node);
  }
data = BTree_Search(root, my_key);

There are a series of interesting properties about B-Tree. For example, for a B-Tree with degree d, if its index is N keys, the upper limit of its tree height h is . To logd((N+1)/2)retrieve a key, the asymptotic complexity of finding the number of nodes is O(logdN). It can be seen from this point that B-Tree is a very efficient index data structure.

In addition, since inserting and deleting new data records will destroy the nature of B-Tree, when inserting and deleting, it is necessary to perform a split, merge, transfer and other operations on the tree to maintain the nature of B-Tree. This article does not intend to fully discuss B-Tree. These contents, because there are many materials that describe the mathematical properties of B-Tree and the insertion and deletion algorithm in detail, interested friends can refer to other literature for detailed research.

2.3.2 B+Tree

In fact, there are many variants of B-Tree, the most common of which is B+Tree. For example, MySQL generally uses B+Tree to implement its index structure. Compared with B-Tree, B+Tree has the following differences:

  • The upper limit of the pointer per node is 2d instead of 2d+1;
  • The inner node does not store data, only the key;
  • Leaf nodes do not store pointers;

The following is a simple B+Tree schematic. 
write picture description here

Since not all nodes have the same domain, leaf nodes and inner nodes in a B+Tree are generally of different sizes. This is different from B-Tree. Although the number of keys and pointers stored in different nodes in B-Tree may be inconsistent, the domain and upper limit of each node are consistent, so in implementation, B-Tree often applies the same amount to each node. size of space. Generally speaking, B+Tree is more suitable for implementing an external storage index structure than B-Tree. The specific reasons are related to the principle of external storage and the principle of computer access, which will be discussed below.

2.3.3 B+Tree with sequential access pointers

The B+Tree structures generally used in database systems or file systems are optimized on the basis of classic B+Trees, adding sequential access pointers. 
write picture description here

As shown in the figure, adding a pointer to the adjacent leaf node in each leaf node of the B+Tree forms a B+Tree with sequential access pointers. The purpose of this optimization is to improve the performance of interval access. For example, in Figure 4, if you want to query all data records with keys from 18 to 49, when 18 is found, you only need to traverse the nodes and pointers in order for one-time access. To all data nodes, the efficiency of interval query is greatly mentioned.

This section gives a brief introduction to B-Tree and B+Tree. The next section combines the memory access principle to introduce why B+Tree is currently the preferred data structure for database systems to implement indexes.

3. Computer principles related to index data structure design

As mentioned above, data structures such as binary trees and red-black trees can also be used to implement indexes, but file systems and database systems generally use B-/+Tree as the index structure. In this section, we will discuss B- /+Tree serves as the theoretical basis for indexing.

3.1 Two types of storage

Two types of storage are generally included in a computer system, computer main memory (RAM) and external memory (such as hard disks, CDs, SSDs, etc.). When designing indexing algorithms and storage structures, we must take into account the characteristics of these two types of storage. The reading speed of the main memory is fast. Compared with the main memory, the data reading rate of the external disk is several orders of magnitude slower than that of the master-slave. The difference between them will be described in detail later. All the query algorithms mentioned above assume that the data is stored in the main memory of the computer. The main memory of the computer is generally small, and the data in the actual database is stored in the external memory.

In general, the index itself is also very large, and it is impossible to store it all in memory, so the index is often stored on disk in the form of index files. In this case, disk I/O consumption will be generated during the index search process. Compared with memory access, the consumption of I/O access is several orders of magnitude higher. Therefore, the most important indicator to evaluate the quality of a data structure as an index is The asymptotic complexity of the number of disk I/O operations during a seek. In other words, the structure of the index should minimize the number of disk I/O accesses during the lookup process. The following describes the memory and disk access principles in detail, and then combines these principles to analyze the efficiency of B-/+Tree as an index.

3.2 The principle of main memory access

At present, the main memory used by computers is basically random access memory (RAM). The structure and access principle of modern RAM are relatively complex. Here, this article abandons the specific differences and abstracts a very simple access model to illustrate the working principle of RAM. 
write picture description here

From an abstract point of view, main memory is a matrix of a series of memory cells, each of which stores a fixed size of data. Each storage unit has a unique address. The addressing rules of modern main memory are more complicated. Here, it is simplified to a two-dimensional address: a row address and a column address can uniquely locate a storage unit. The image above shows a 4 x 4 main memory model.

The main memory access process is as follows:

When the system needs to read the main memory, it puts the address signal on the address bus and uploads it to the main memory. After the main memory reads the address signal, it parses the signal and locates the specified storage unit, and then puts the data of this storage unit on the data bus. , for other components to read. The process of writing to the main memory is similar. The system places the unit address and data to be written on the address bus and data bus respectively, and the main memory reads the contents of the two buses and performs the corresponding write operation.

It can be seen here that the main memory access time is only linearly related to the number of accesses. Because there is no mechanical operation, the "distance" of the data accessed twice will not have any effect on the time. For example, take A0 first and then take A1 takes the same time as taking A0 and then taking D3.

3.3 The principle of disk access

As mentioned above, indexes are generally stored on disk in the form of files, and index retrieval requires disk I/O operations. Unlike main memory, disk I/O has mechanical movement costs, so the time consumption of disk I/O is huge.

The disk reads data by mechanical movement. When data needs to be read from the disk, the system will transmit the logical address of the data to the disk. Which track and which sector the data is in. In order to read the data in this sector, the head needs to be placed above the sector. In order to achieve this, the head needs to move to align with the corresponding track. This process is called seek, and the time it takes is called seek time. Then the disk rotates to The target sector rotates under the magnetic head. The time spent in this process is called rotation time, and the last is the transmission of the read data. Therefore, the time spent each time reading data can be divided into three parts: seek time, rotation delay, and transmission time. in:

  • The seek time is the time required for the magnetic arm to move to the specified track, and the mainstream disk is generally less than 5ms.
  • The rotation delay is the speed of the disk that we often hear about. For example, a disk of 7200 rpm means that it can rotate 7200 times per minute, which means that it can rotate 120 times in 1 second, and the rotation delay is 1/120/2 = 4.17ms.
  • The transfer time refers to the time to read or write data from the disk, which is generally in tenths of a millisecond, which is negligible relative to the first two times.

Then the time to access a disk, that is, the time of one disk IO is about 5+4.17 = 9ms, which sounds pretty good, but know that a 500-MIPS machine can execute 500 million instructions per second, because instructions Relying on the nature of electricity, in other words, 400,000 instructions can be executed in one IO time, and the database can easily contain 100,000,000,000 or even 10,000,000-level data, and every 9 milliseconds is obviously a disaster.

3.4 The principle of locality and disk read-ahead

Due to the characteristics of the storage medium, the access of the disk itself is much slower than that of the main memory. In addition to the cost of mechanical movement, the access speed of the disk is often one-hundredth of the main memory. Therefore, in order to improve efficiency, it is necessary to reduce the number of disks as much as possible. I/O. In order to achieve this purpose, the disk is often not read strictly on demand, but will read ahead every time. Even if only one byte is required, the disk will start from this position and sequentially read data of a certain length backward into memory. The rationale for this is the well-known locality principle in computer science: when one piece of data is used, its nearby data is usually used immediately. The data required during program operation is usually concentrated.

Since disk sequential reads are very efficient (no seek time, only very little spin time), read-ahead can improve I/O efficiency for programs with locality. The read-ahead length is generally an integer multiple of the page. A page is a logical block of computer management memory. Hardware and operating systems often divide main memory and disk storage into consecutive blocks of equal size. Each block of storage is called a page (in many operating systems, the size of a page is usually 4k), main memory and disk exchange data in units of pages. When the data to be read by the program is not in the main memory, a page fault exception will be triggered. At this time, the system will send a disk read signal to the disk, and the disk will find the starting position of the data and read one or several pages continuously. Load into memory, then return abnormally, and the program continues to run.

Fourth, the data structure B-/+Tree used in database index and its performance analysis

At this point, we can finally analyze why the database index adopts the B-/+Tree storage structure. As mentioned above, the database index is stored on the disk, and we generally use the number of disk I/Os to evaluate the pros and cons of the index structure. First, from the B-Tree analysis, according to the definition of B-Tree, it can be seen that at most h-1one node (the root node resident in memory) needs to be accessed at a time. The designers of the database system cleverly used the principle of disk read-ahead to set the size of a node equal to a page, so that each node can be fully loaded with only one I/O. In order to achieve this goal, the following techniques are also needed in the actual implementation of B-Tree: each time a new node is created, a space for a page is directly applied, so as to ensure that a node is also physically stored in a page, and the computer storage allocation is all Aligned by page, a node only needs one I/O.

A retrieval in B-Tree requires at most h-1one I/O (root node resident memory), and the asymptotic complexity is O(h)=O(logdN). In general practical applications, the out-degree d is a very large number, usually more than 100, so h is very small (usually no more than 3).

To sum up, if we use the B-Tree storage structure, the number of I/Os during search will generally not exceed 3 times, so the efficiency of using B-Tree as the index structure is very high.

4.1 B+ tree performance analysis

From the above introduction, we know that the search complexity of the B-tree is O(h)=O(logdN), so the larger the out-degree d of the tree, the smaller the depth h, and the less the number of I/Os. B+Tree can just increase the width of the out-degree d, because the size of each node is a page size, so the upper limit of the out-degree depends on the size of the key and data in the node:

dmax=floor(pagesize/(keysize+datasize+pointsize))//floor表示向下取整

Since the data field is removed from the nodes in the B+Tree, it can have a larger out-degree and thus have better performance.

4.2 B+ tree search process

write picture description here 
The search process of B-tree and B+ tree is basically the same. As shown in the figure above, if you want to find the data item 29, then the disk block 1 will be loaded from the disk to the memory first. At this time, an IO occurs, and the binary search is used to determine that 29 is between 17 and 35 in the memory, and the disk block 1 is locked. The P2 pointer, the memory time is negligible because it is very short (compared to the IO of the disk), and the disk block 3 is loaded from the disk to the memory through the disk address of the P2 pointer of the disk block 1, and the second IO occurs, 29 at 26 and Between 30, the P2 pointer of disk block 3 is locked, and disk block 8 is loaded into memory through the pointer, and the third IO occurs. At the same time, a binary search is performed in the memory to find 29, and the query is ended, with a total of three IOs. The real situation is that a 3-layer b+ tree can represent millions of data. If millions of data lookups only require three IOs, the performance improvement will be huge. If there is no index, each data item will have an IO. , then a total of millions of IOs are required, obviously the cost is very, very high.

This chapter discusses the data structures and algorithms related to indexes from a theoretical point of view. The next chapter will discuss how B+Tree is implemented as an index in MySQL. At the same time, it will introduce non-clustered indexes and clustered indexes in combination with MyISAM and InnDB storage engines. Two different forms of index implementation.

Five, MySQL index implementation

In MySQL, indexes belong to the concept of storage engine level. Different storage engines implement indexes in different ways. This article mainly discusses the index implementation methods of MyISAM and InnoDB storage engines.

5.1 MyISAM Index Implementation

The MyISAM engine uses B+Tree as the index structure, and the data field of the leaf node stores the address of the data record. The following figure is a schematic diagram of the MyISAM index: 
write picture description here

There are three columns in the table here. Assuming that we use Col1 as the primary key, the above figure is a schematic representation of the primary key of a MyISAM table. It can be seen that the index file of MyISAM only saves the address of the data record. In MyISAM, there is no difference in structure between the primary index and the secondary key (Secondary key), but the primary index requires the key to be unique, while the key of the secondary index can be repeated. If we build a secondary index on Col2, the structure of this index is shown in the following figure: 
write picture description here

It is also a B+Tree, and the data field saves the address of the data record. Therefore, the algorithm of index retrieval in MyISAM is to first search the index according to the B+Tree search algorithm. If the specified Key exists, take out the value of its data field, and then use the value of the data field as the address to read the corresponding data record. 
The index method of MyISAM is also called "non-clustered", which is called to distinguish it from the clustered index of InnoDB.

5.2 InnoDB Index Implementation

Although InnoDB also uses B+Tree as the index structure, the specific implementation is completely different from MyISAM.

The first major difference is that InnoDB's data files are themselves index files. From the above, it is known that the MyISAM index file and the data file are separated, and the index file only saves the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+Tree, and the data field of the leaf node of this tree saves complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index. 
write picture description here

The above figure is a schematic diagram of the InnoDB main index (which is also a data file). You can see that the leaf nodes contain complete data records. Such an index is called a clustered index. Because the data files of InnoDB are aggregated by the primary key, InnoDB requires that the table must have a primary key (MyISAM may not have it). If it is not specified explicitly, the MySQL system will automatically select a column that can uniquely identify the data record as the primary key. If it does not exist For this type of column, MySQL automatically generates an implicit field as the primary key for the InnoDB table. The length of this field is 6 bytes and the type is long.

The second difference from MyISAM indexes is that InnoDB's secondary index data field stores the value of the corresponding record's primary key instead of its address. In other words, all secondary indexes in InnoDB refer to the primary key as the data field. For example, the following figure shows an auxiliary index defined on Col3: 
write picture description here

Here, the ASCII code of English characters is used as the comparison criterion. The implementation of the clustered index makes the search by the primary key very efficient, but the secondary index search needs to retrieve the index twice: first, the secondary index is retrieved to obtain the primary key, and then the primary key is used to retrieve the records in the primary index.

Knowing how indexes are implemented in different storage engines is very helpful for correct use and optimization of indexes. For example, after knowing the index implementation of InnoDB, it is easy to understand why it is not recommended to use a field that is too long as a primary key, because all secondary indexes refer to the primary key. Index, a long primary index will make the secondary index too large. For another example, it is not a good idea to use a non-monotonic field as the primary key in InnoDB, because the InnoDB data file itself is a B+Tree, and the non-monotonic primary key will cause the data file to maintain the B+Tree characteristics when inserting new records. Frequent split adjustment is very inefficient, and using an auto-increment field as the primary key is a good choice.

These index-related optimization strategies are discussed in detail in the next chapter.

6. Index usage strategy and optimization

MySQL optimization is mainly divided into structural optimization (Scheme optimization) and query optimization (Query optimization). The high-performance indexing strategies discussed in this chapter fall primarily into the category of structural optimization. The content of this chapter is entirely based on the above theoretical foundation, in fact once the mechanism behind the index is understood, then choosing a high-performance strategy becomes pure reasoning, and the logic behind these strategies can be understood.

6.1 Joint Index and Leftmost Prefix Principle

Union Index (Compound Index)

First, let's talk about federated indexes. The joint index is actually very simple. Compared with the general index with only one field, the joint index can create an index for multiple fields. Its principle is also very simple. For example, if we create a joint index on the (a,b,c) field, the index records will first be sorted by the A field, then by the B field and then by the C field. Therefore, the joint The characteristics of the index are:

  • The first field must be ordered
  • When the value of the first field is equal, the second field is in order. For example, when A=2 in the following table, all the values ​​of B are in order, and so on. When the same B is worth all the C fields are in order

    | A | B | C | 
    | 1 | 2 | 3 | 
    | 1 | 4 | 2 | 
    | 1 | 1 | 4 | 
    | 2 | 3 | 5 | 
    | 2 | 4 | 4 | 
    | 2 | 4 | 6 | 
    | 2 | 5 | 5 |

In fact, the search of the joint index is the same as looking up a dictionary. First, it is searched according to the first letter, and then it is searched according to the second letter, or only the first letter is searched, but the first letter cannot be skipped from the second letter. Start looking for letters. This is called the leftmost prefix principle.

leftmost prefix principle

Let's talk about the query of the joint index in detail. Still in the above example, we (a,b,c)built a joint index on the field, so the index is first sorted by a, then by b, and then by c, so:

The following query methods can use the index

select * from table where a=1;
select * from table where a=1 and b=2;
select * from table where a=1 and b=2 and c=3;

The above three queries  (a ), (a,b ),(a,b,c )can use the index according to the order, which is the leftmost prefix match.

If the query statement is:

select * from table where a=1 and c=3; 那么只会用到索引a。

If the query statement is:

select * from table where b=2 and c=3; 因为没有用到最左前缀a,所以这个查询是用户到索引的。

If the leftmost prefix is ​​used, but the index code is used in reverse order?

for example:

select * from table where b=2 and a=1;
select * from table where b=2 and a=1 and c=3;

If the leftmost prefix is ​​used but the order is reversed, the index can also be used, because the mysql query optimizer will judge the order in which the SQL statement should be executed most efficiently, and finally generate the real execution plan. But it's still better to query in index order so that the query optimizer doesn't have to recompile.

prefix index

In addition to the joint index, there is actually a prefix index for mysql. The prefix index is to use the prefix of the column instead of the entire column as the index key. When the prefix length is appropriate, the selectivity of the prefix index can be close to that of the full column index, and the size and maintenance of the index file can be reduced because the index key is shortened. overhead.

In general, prefix indexes can be used in the following situations:

  • String columns (varchar, char, text, etc.) require full field matching or pre-matching. That is ='xxx' or like 'xxx%'
  • The strings themselves can be long and start to be different from the first few characters. For example, it is meaningless for us to use prefix index for Chinese names, because Chinese names are very short, and it is not very practical to use prefix index for recipient addresses, because on the one hand, recipient addresses generally start with XX province. That is to say, the first few characters are similar, and the retrieval address is generally like '%xxx%', and the previous match will not be used. Conversely, a prefix index can be used for names of foreigners, because the characters are longer and the first few characters are more selective. Also email is a field that can be indexed using a prefix.
  • The index selectivity of the first half of the characters is already close to the index selectivity of the full field. If the length of the entire field is 20, the index selectivity is 0.9, and the selectivity of the prefix index we build for the first 10 characters is only 0.5, then we need to continue to increase the length of the prefix character, but at this time, the advantage of the prefix index has been Not obvious, there is no need to build a prefix index.

Some articles also mention:

The MySQL prefix index can effectively reduce the size of the index file and improve the speed of the index. But prefix indexes also have their downsides: MySQL cannot use prefix indexes in ORDER BY or GROUP BY, nor can they be used as covering indexes.

6.2 Index Optimization Strategy

    • The leftmost prefix matching principle, mentioned above
    • The primary key external check must build an index
    • Use indexes on columns appearing in where, on, group by, order by
    • Try to choose a column with a high degree of discrimination as an index. The formula for the degree of discrimination is count(distinct col)/count(*), which indicates the proportion of fields that are not repeated. The larger the proportion, the fewer records we scan. 1, and some status and gender fields may be 0 in the face of big data
    • Use indexes on smaller data columns, which will make the index file smaller and more index keys can be loaded in memory
    • The index column cannot participate in the calculation, keep the column "clean", such as from_unixtime(create_time) = '2014-05-29', the index cannot be used, the reason is very simple, the b+ tree stores all the field values ​​in the data table, but When retrieving, you need to apply the function to all elements to compare, which is obviously too expensive. So the statement should be written as create_time = unix_timestamp('2014-05-29');
    • Use prefix indexing for longer strings
    • Try to expand the index, do not create a new index. For example, there is already an index of a in the table, and now you need to add an index of (a, b), then you only need to modify the original index
    • Do not create too many indexes, and weigh the relationship between the number of indexes and DML, which is the operation of inserting and deleting data. There is a trade-off here. The purpose of indexing is to improve query efficiency, but too many indexes are established, which will affect the speed of inserting and deleting data, because the index of the table data we modify also needs to be adjusted and rebuilt.
    • For like queries, do not put "%" in front. 
      SELECT * FROMhoudunwang WHEREuname LIKE'后盾%' -- 走索引 
      SELECT * FROMhoudunwang WHEREunameLIKE "%后盾%" -- 不走索引
    • Query where condition data type does not match and cannot use index 
      string and number comparison does not use index; 
      CREATE TABLEa (a char(10));
      EXPLAIN SELECT * FROMa WHEREa ="1" - go index 
      EXPLAIN SELECT * FROM  a WHERE  a= 1 - do not go index 
      regular expression does not use index, this should be well understood , so why it's hard to see the regexp keyword in SQL.
    • Reference article:

      http://blog.csdn.net/suifeng3051/article/details/49530299?locationNum=1 
      http://tech.meituan.com/mysql-index.html 
      https://yq.aliyun.com/articles/39841 
      http://blog.csdn.net/lovelion/article/details/8462814

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325117742&siteId=291194637