High-performance MYSQL (study notes) - index 3

clustered index

A clustered index is not a separate index type, but a way of storing data. The exact details depend on how it is implemented, but INNODB's clustered index actually stores B-Tree indexes and data rows in the same structure. Clustering means that data rows and adjacent key values ​​are stored compactly together.

The storage method of clustered index is as follows. The leaf page contains all the data of the row, but the node page only contains the index column, and the index column contains the integer value here. InnoDB will aggregate data by the primary key, which means that the column on which the graph is indexed is the primary key column. If there is no primary key column, InnoDB will choose a unique non-null index instead. If there is no index, InnoDB will implicitly define a primary key as a clustered index. InnoDB only aggregates records in the same page. Pages containing adjacent key-values ​​may be far apart.


Aggregated data has some important advantages:

1. Store related data together. For example, when implementing an e-mail box, the data can be aggregated according to the user's ID, so that only a few data pages need to be read from the disk to obtain all the emails of a certain user. Without a clustered index, each message could cause a disk I/O.

2. Data access is faster. The data index keeps the index and data in the same B-Tree, so it is faster to get data from clustered index than from non-clustered index.

3. The query using the covering index scan can directly use the primary key value in the page node.

Disadvantages of clustered indexes:

1. Clustered data maximizes the performance of I/O-intensive applications, but if all data is placed in memory, the order of access is not so important, and clustered indexes have no advantage.

2. The insertion speed is heavily dependent on the insertion order. Inserting in primary key order is the fastest way to load data into an InnoBD table. But if the data is not loaded in the order of the primary key, then you need to do an OPTIMIZETABLE command to reorganize the table after loading.

3. Updating clustered index columns is expensive because InnoDB is forced to move each updated row to a new location.

4. When a new row is inserted into a table based on a clustered index, or when the primary key needs to be updated and the row needs to be moved, it may face the problem of "page split". When the primary key value of the row requires that this row must be inserted into a certain existing When the page is full, the storage engine will two pages to accommodate the row, which is a page split operation. Page splits can cause tables to take up too much disk space.

5. Clustered indexes may slow down full table scans, especially when rows are sparse, or data storage is discontinuous due to page splits.

6. The secondary index (non-clustered index) may be larger than expected, because the leaf node of the secondary index contains the primary key column of the referenced row.

7. The secondary index requires two index searches. The reason lies in the nature of the row pointer held by the secondary index. Remember, the secondary index leaf node does not store the physical location of the row, but the primary key value of the row. It means that to find rows in the secondary index, the storage engine needs to find the leaf node of the secondary index to obtain the corresponding primary key value, and then find the corresponding row in the clustered index according to this value. Duplicate work is done here: two B-Tree lookups instead of one, for InnoDB, adaptive hash indexes can reduce this work.

Data distribution for InnoDBhe and MyISAM

The way the data is stored on disk is already optimal, but the order of the rows is random. The value of the column is a random number from 1 to 100, so there are many repeated values.

MySIAM data distribution

Very simple, display the row number next to it, increasing from 0, because the row is fixed length, so MyISAM can skip the required bytes from the beginning of the table to find the required row, this distribution is easy to create and create an index.

INNODB data distribution

Because INNODB supports clustered indexes, the same data is stored in a very different way. Clustered indexes are tables, unlike MYISAM, which require independent row storage. Each leaf node of the clustered index contains the primary key value, transaction ID, rollback pointers for transactions and MVCC, and all remaining columns. If the primary key is a column prefix index, INNODB will also contain the full primary key column and the rest of the other columns.

The difference between INNODB in the secondary index and the clustered index is that the leaf node of the secondary index does not store the row pointer, but the primary key value, which is used as the pointer to the row. This strategy reduces the maintenance work of secondary indexes when rows are moved or data pages are split. Using the primary key value as a pointer will make the secondary index take up more space. In exchange, the benefit is that INNODB does not need to be updated when moving rows. The secondary index of this pointer.

Insert rows in INNODB table in primary key order

Why is the primary key usually set? When there is no data in your table that needs to be aggregated, you can set an AUTO_INCREMENT column for data that is not related to the application, which can ensure that the data rows are written in order, and the performance of the association operation based on the primary key will be better. It is best to avoid random (discontinuous and widely distributed values) cast indices. It is best to avoid random cast indexes, especially for IO-intensive applications. For example, for performance reasons, using UUID as cast index will be bad, it makes the insertion of cluster index completely random, so that the data does not have any clustering characteristics . Use the primary key value to do the index, each record is stored after the previous record, when the maximum fill factor of the page is reached, the next record will be written into a new page Once the data is loaded in this way, the primary key page will be nearly filled with records. When using UUID as a clustered index, because the primary key value of the new row is not necessarily larger than the previously inserted value, INNODB cannot simply always insert the new row into the index. The location needs to reallocate space.

The disadvantages are as follows:

1. The written target page cannot be determined to be flushed to disk and removed from the cache, or not loaded into the cache. INNODB needs to find and read the target page from disk into memory. Will cause a lot of random IO.

2. Because the writes are out of order, INNODB has to do frequent page splits in order to allocate space for new rows. This will cause a lot of random IO

3. Due to frequent page splits, pages will become sparse and filled irregularly, and eventually the index data will be fragmented.

  So: using INNODB, you should use the primary key order to insert data as much as possible, and use the value of the monotonically increasing cluster key to insert new rows as much as possible.

covering index

Usually, an appropriate index is created according to the where condition of the query, but this is only one aspect of the index. An excellent index should consider the entire query, not just a part of the where condition. MYSQL can use the index to directly obtain the data of the column, so that No need to read rows of data. If an index contains the values ​​of all fields that need to be queried, we call it a "covering index". Covering indexes are very useful tools that can greatly improve performance. The benefits are as follows:

1. So the entry is usually much smaller than the size of the data row, so if you only need to read the index, then MYSQL will greatly reduce the amount of data access.

2. Indexes are stored in column order, so IO-intensive range queries are much less I/O than randomly reading each row from disk. For some storage engines, the OPTIMIZE command can be used to make the index completely in order.

3. Some storage engines only cache indexes in memory in MYSIAM, and the data depends on the operating system for storage, so a system call is required to access the data. This can cause serious performance problems, especially in those scenarios where system calls account for the most overhead in data access.

4. INNODB's clustered index, covering index is particularly useful for INNODB tables. The INNODB secondary index stores the primary key value of the row in the leaf node, so if the secondary primary key can cover the query, the secondary query on the primary key index can be avoided.

Not all covering indexes can cover indexes. Covering indexes must store the values ​​of index columns, while hash indexes, spatial indexes, and full-text indexes do not store the values ​​of index columns, so MYSQL can only use B-Tree indexes for coverage. index.

Like EXPLAIN SELECT * FROM product WHERE actor='SEAN CARREY' and title like '%apple%', this index cannot cover the query for two reasons:

1. No index can cover this query, because the query selects all the columns from the table, and no index covers all the columns.

2.  MYSQL cannot perform the LIKE operation in the index, only simple operations (for example: equal to, not equal to, and greater than), MYSQL can do the LIKE comparison of the leftmost prefix match in the index, but if it is a like at the beginning of a wildcard Query, the storage engine can't do the comparison and matching, the MYSQL server can only extract the value of the data row instead of the index value to compare, there is a way to rewrite the query and design the index cleverly:

SELECT * FROM productsjoin(select prod_id from products where actor='sean carry' and title like '%apple%')as t1 on(t1.prod_id = products.prod_id), this is called delayed association because the access to the column is delayed , in the first stage of the query, you can use the MYSQL covering index to match the prod_id in the from clause query, and then match the values ​​of all the columns you need in the outer query based on these pro_id values. But when the number of rows returned by where is large, it is difficult to achieve optimal results, and most of the time is spent reading and sending data.

Sorting using an index scan

MYSQL can generate ordered results in two ways: by sorting operation or scanning according to index order, if the type column value of explain is "index", it means that MYSQL uses index scanning for sorting (do not compare with "extra column" using index" confusing).

Scanning the index itself is fast because it only needs to move from one record to the immediately next record. But if the index cannot cover all the columns required by the query, then you have to scan each index record and return the table to query the corresponding row once. MYSQL can use the index to sort the results only when the column order of the index is exactly the same as the order by clause, and all the columns are sorted in the same direction. If the query needs to associate multiple tables, the index can be used for sorting only when the fields referenced by the order by clause are all the first table. The restriction of the Order by clause is the same as that of the search query: it needs to meet the requirements of the leftmost prefix of the index; otherwise, MYSQL needs to perform the sorting operation, and cannot use the index sorting.

For example, such an index is created in the table rantal: retina_date, inventory_id, customer,

Key idx_fk_inventory_id(inventory_id),

Key idx_fx_customer_id(customer_id)

Key idx_fx_staff_id(staff_id)

The following are indexes that can be sorted:

… whereretal_date=’2018-04-22’ order by inventory_id desc

…whererental_date>’2018-04-22’ order by rental_date,inventory_id

Because the above one is that the first column of the index is referred to as a constant, and the second column is used for sorting, combining the two columns forms the leftmost prefix of the index

Here are some queries that cannot be sorted using an index:

...rental_date='2018-04-22'order by inventory_id desc,customer_id asc; sort in different directions, because the index column is sorted in positive order

rental_date='2018-04-22'order by inventory_id ,staff_id; column not in index column

rental_date='2018-04-22'order by customer_id; cannot be combined into the leftmost prefix of the index;

rental_date>'2018-04-22'order by inventory_id,customer_id; The first column of the index is a range query condition, and the indexed column cannot be used, but if the order by rental_date is on the machine, it is ok

rental_date='2018-04-22'and inventory_id in (1,2)order bycustomer_id; is also a range query and cannot be indexed

Compressed prefix index

  The way MYSIAM compresses each index block is to completely save the first value in the index block, and then compare other values ​​with the first value to obtain the same number of prefix bytes and the remaining different suffix parts, and store this part. Just get up. Compressed blocks use less space, at the cost that some operations may be slower.

Redundant and duplicate indexes

1. MYSQL unique restrictions and primary key restrictions are achieved through the index! So there is no need to re-index on unique constraints and primary keys!

2. Adding indexes will slow down INSERT UPDATE DELETE.

Indexes and locks

Indexes allow queries to lock fewer rows. If your query never accesses unneeded rows, it locks fewer rows. This is good for performance in two ways. First, INNODB row locks are efficient and memory usage is low, but There is still additional overhead when locking rows, and secondly locking more rows than needed increases lock contention and reduces concurrency.

INNODB only locks rows when they are accessed, and indexes can reduce the number of rows that INNODB accesses, thereby reducing the number of locks. But this is only valid when INNODB can filter out all unnecessary rows at the storage engine layer. If the index cannot filter out the unnecessary rows, then after INNODB retrieves the data and returns it to the service layer, the MYSQL server can apply the where statement , it is unavoidable to lock the row at this time.

Extra: Using where, indicating that the MYSQL server returns the rows from the storage engine and then applies the where filter condition.

像:select actor_id from actor whereactor<>5 and actor_id<>1 for update;

This MYSQL will lock the row with actor_id=1, that is, MYSQL uses using where.

When you execute select actor_id where actor whereactor_id=1 for update;, even if the index is used, INNODB may lock some unnecessary data, actor_id=1. The problem can be worse if you can't use the index to find and lock the row, MYSQL will do a full table scan and lock all the rows, whether needed or not!


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325476803&siteId=291194637