7-1 indexing works with the slow query optimization (1)

An introduction

Why have an index?

General applications, read and write in the ratio of about 10: 1, and rarely insert and update operations general performance problems in a production environment, we encounter the most, is the most problematic, or some complex query operation, thus optimizing the query statement is clearly a top priority. Speaking to speed up queries, we have to mention indexed.

What is the index?

In MySQL indexes also called "key", it is a data structure storage engine used to quickly find the record. Index for good performance
is critical, especially when the amount of data in the table more and more, the index more important effect on performance.
Index optimization should be the most effective means to optimize the performance of queries. Index can easily improve query performance by several orders of magnitude.
Dictionary index is equivalent to the sequencer table, if you want to check a word, if you do not use the sequencer table, you will need hundreds of pages from one page to check.

                      30

        10                          40

   5         15               35          66

1    6    11   19          21   39     55    100

Do you misunderstand index?

The index is an important aspect of application design and development. If too many indexes, the application's performance may be affected. The index too, will produce the impact on query performance, to find a balance, which is critical to the performance of the application. Some developers are always an afterthought add an index ---- I always think that this stems from a wrong development model. If you know the use of data, you should add indexes where necessary from the start. Developers tend to use the database to stay in the application level, such as writing SQL statements, stored procedures of the class, they may not even know the existence of an index, or think afterwards so that the relevant DBA can add. DBA is often not enough data to understand the flow of business, while adding an index need to monitor a large number of SQL statements and then find the problem, the time required for this step is certainly much greater than the time required for the initial index added and may miss part of the index. Of course, the index is also not possible, I have encountered such a problem: For a certain MySQL server iostat show disk usage has been at 100 percent, after analysis found that due to the developers to add too many index, delete after some unnecessary index, disk usage immediately dropped to 20%. Add a visible index is very technical content.

Second, the principle index

A principle index

The purpose of the index is to improve query efficiency, and we used the catalog for books is a reason: to locate the chapter, and then navigate to a section in this chapter, and then find the pages. There are similar examples: dictionary, check train trips, airplane flights, etc.

Essentially: to filter through continuous narrow range of data you want to get the final results you want, while the random events become the order of events, that is to say, with this indexing mechanism, we can always use Find a way to lock the same data.

Database is the same, but obviously much more complex, because not only faced with the equivalent of a query, and the query range (>, <, between, in), fuzzy queries (like), and set the query (or), and so on. Database should choose how to deal with all kind of ways the problem? We recall the example of the dictionary, can not put the data into segments and sub-queries it? The simplest if 1000 data, the first segment is divided into 1 to 100, 101 to 200 into the second segment, a third segment into 201-300 ...... article such check data 250, as long as the third stage to find , all of a sudden go to 90% of invalid data in addition. But if it is a record 10 million, divided into paragraphs is better? Slightly algorithm based on the students think of the search tree, the average complexity is lgN, has good query performance. But here we have overlooked a critical issue, the complexity of each model is based on the same operating costs to consider. The database implementation is more complicated, on the one hand the data is stored on disks, on the other hand in order to improve performance, but also every part of the data can be read into memory to compute, because we know the cost of access to the disk is about one hundred thousand access memory around times, so simple search tree is difficult to meet the complex application scenarios.

Two disk IO and pre-reading

Mentioned earlier access to the disk, then here briefly explain the disk IO and pre-reading, reading data on the disk is a mechanical movement, each time it takes to read the data can be divided seek time, rotational latency, transmission time three section, refers to seek the magnetic track arm moves to the specified time required, generally in the mainstream disk 5ms or less; rotational delay is that we hear disk rotational speed, such as a disk 7200, it can be represented by 7200 revolutions per minute , that is able to turn 120 times a second, rotational latency is 1/120/2 = 4.17ms; refers to a transmission time to read or write data from disk to disk, usually in a few tenths of milliseconds, with respect to The first two times negligible. So once the disk access time, namely a disk IO time equal to approximately about 5 + 4.17 = 9ms, it sounds pretty good, but you know a 500 -MIPS (Million Instructions Per Second) The machine can perform per second 5 million instructions, because the instruction relies on the electrical properties, in other words the first time to perform IO can perform about 450 million instructions, databases easily one hundred thousand millions and ten million data, each 9 milliseconds, obviously a disaster. FIG computer hardware delay the comparison chart for reference:

Considering the very high disk IO operation, the computer operating system to do some optimization, when the IO once, not just the current disk address of the data, but also the adjacent data is read into memory buffer, because the local pre-reading principle tells us that when a computer accesses data address when adjacent data will soon be accessed. Every time we read IO data call a (page). How much data with a specific operating system, generally for the 4k or 8k, that is, when we read the data in a fact only occur once IO, the data structure design theory for the index is very helpful.

Three index data structure

Speaking in front of the basic principles of the index, the complexity of the database, but also talked about the knowledge of the operating system, the purpose is to let everyone know, any kind of data structure is not created out of thin air, there will be its background and context, we to summarize, we need this data structure what can be done, it is actually very simple, that is: every time the data to find the number of disk IO control in a very small magnitude, preferably constant magnitude. Then we wonder whether if a highly controllable multiple search trees to meet demand? In this way, b + tree emerged (B + Tree is a binary search tree through, then the balanced binary tree, B tree evolved).

As shown above, the tree is b + a, b + tree definition can be found in B + tree, where only some of said key, which we call a blue block disk block, the block can be seen that each disk contains a few data items (shown in dark blue) and a pointer (shown in yellow), a magnetic disk comprising a block of data items 17 and 35 contain pointers P1, P2, P3, P1 represents a disk block is smaller than 17, P2 represents between 17 and 35 disk blocks, P3 that is greater than the disk block 35. Real data exists in the leaf node that is 3,5,9,10,13,15,28,29,36,60,75,79,90,99. Not only non-leaf nodes store the actual data, storing data items only guide the direction of the search, such as 17, 35 does not exist in the real data in the table.

### b + tree discovery process is shown in Figure 29, if you want to find the data item, then the first block will disk by the disk 1 is loaded into memory, IO occurs a case, a binary search in memory 17 and 29 is determined by 35, the locking disk block P2 of the pointer 1, since the memory is very short time (as compared to a disk IO) is negligible, the disk block 3 is loaded into memory from the disk by the disk blocks P2 disk address pointer 1, the first occurrence secondary IO, 29 between 26 and 30, locking disk block pointer P2 3 through 8 pointer is loaded into memory disk blocks, the occurrence of the third IO, while memory do binary search to find 29, the end of the inquiry, a total of three times IO . The truth is, the layer 3 b + tree can represent millions of data, if millions of data to find only three IO, performance improvement would be great, if there is no index, each data item occurs once every IO then a total of millions of IO, obviously very, very high cost.

### b + 1. The nature of the tree to be as small as possible index field: Through the above analysis, we know that the number of IO depends on the height h b + number, data of the current data table is assumed to N, the number of data items for each disk block is m, there ㏒ h = (m + 1) N, N when the data amount constant, the greater the m, the smaller H; m = size and the size of disk block size / data entries, disk blocks is also is the size of a data page is fixed, the smaller the space occupied if the data item, the more the number of data items, the lower the height of the tree. This is why each data item, or index fields to be as small as possible, such as int occupies 4 bytes, less than half bigint8 bytes. This is why the real requirements b + tree data into a leaf node rather than the inner nodes, once placed in the inner layer node, the data item will be a significant decline in disk blocks, resulting in increased tree. When the data item will be equal to a degenerate linear tables. 2. leftmost index matching property: when the data item is a compound b + tree data structure, such as (name, age, sex) time, the number b + a left to right order to build a search tree, such as when (Zhang, 20, F) when such data to retrieve, b + tree name priority comparison determines the next search direction, if the same name and age Sex comparison in turn, finally obtained data retrieved; but ( 20, F) data such as no name to time, b + tree node does not know what the next step to the investigation, because the time to establish the search tree name is the first comparative factor, must first search based on name to be known under where to go one step inquiry. For example, when (Zhang, F) ​​to retrieve such data, b + tree name can be used to specify the search direction, but the lack of age next field, so only the name is equal to the seating of the data is found, then the matching sex F of the data, this is a very important property, namely the left-most matching characteristics of the index.

Four clustered index and secondary indexes

In the database, the height of B + trees are generally in 2 to 4 layers, which means that a maximum of only 2-4 times IO when looking for a key value of a row record, this a good one. Because the current can do at least 100 times per second IO general mechanical hard drive, 2 to 4 times the IO means that the query time requires only 0.02 to 0.04 seconds.

B + tree index database may be divided into clustered index (clustered index) and secondary indexes (secondary index),

The same aggregation index and the secondary index is: either clustered index or secondary index, which in the form of B + trees are internal, i.e. the height is balanced, with all the leaf nodes storing the data.

Different clustered index and secondary index is: whether the leaf node is stored in an entire row of information

1, a clustered index

#InnoDB存储引擎表示索引组织表,即表中数据按照主键顺序存放。而聚集索引(clustered index)就是按照每张表的主键构造一棵B+树,同时叶子结点存放的即为整张表的行记录数据,也将聚集索引的叶子结点称为数据页。聚集索引的这个特性决定了索引组织表中数据也是索引的一部分。同B+树数据结构一样,每个数据页都通过一个双向链表来进行链接。

#如果未定义主键,MySQL取第一个唯一索引(unique)而且只含非空列(NOT NULL)作为主键,InnoDB使用它作为聚簇索引。

#如果没有这样的列,InnoDB就自己产生一个这样的ID值,它有六个字节,而且是隐藏的,使其作为聚簇索引。

#由于实际的数据页只能按照一棵B+树进行排序,因此每张表只能拥有一个聚集索引。在多少情况下,查询优化器倾向于采用聚集索引。因为聚集索引能够在B+树索引的叶子节点上直接找到数据。此外由于定义了数据的逻辑顺序,聚集索引能够特别快地访问针对范围值得查询。

One of the benefits of clustered index: its primary sort key lookup and range search speed is very fast, the leaf nodes of the data is user data to be queried. If users need to find a table, query last 10 user information, because the B + tree index is doubly linked list, so users can quickly find the last data page, and remove the 10 records

#参照第六小结测试索引的准备阶段来创建出表s1
mysql> desc s1; #最开始没有主键
+--------+-------------+------+-----+---------+-------+
| Field  | Type        | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| id     | int(11)     | NO   |     | NULL    |       |
| name   | varchar(20) | YES  |     | NULL    |       |
| gender | char(6)     | YES  |     | NULL    |       |
| email  | varchar(50) | YES  |     | NULL    |       |
+--------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> explain select * from s1 order by id desc limit 10; #Using filesort,需要二次排序
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra          |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+----------------+
|  1 | SIMPLE      | s1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 2633472 |   100.00 | Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+----------------+
1 row in set, 1 warning (0.11 sec)

mysql> alter table s1 add primary key(id); #添加主键
Query OK, 0 rows affected (13.37 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> explain select * from s1 order by id desc limit 10; #基于主键的聚集索引在创建完毕后就已经完成了排序,无需二次排序
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------+
|  1 | SIMPLE      | s1    | NULL       | index | NULL          | PRIMARY | 4       | NULL |   10 |   100.00 | NULL  |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.04 sec)

The second advantage of the clustered index: range queries (range query), if you want to find data that is within a certain range of primary key, through the upper middle leaf node can get a range of pages, then you can directly read the data page

mysql> alter table s1 drop primary key;
Query OK, 2699998 rows affected (24.23 sec)
Records: 2699998  Duplicates: 0  Warnings: 0

mysql> desc s1;
+--------+-------------+------+-----+---------+-------+
| Field  | Type        | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| id     | int(11)     | NO   |     | NULL    |       |
| name   | varchar(20) | YES  |     | NULL    |       |
| gender | char(6)     | YES  |     | NULL    |       |
| email  | varchar(50) | YES  |     | NULL    |       |
+--------+-------------+------+-----+---------+-------+
4 rows in set (0.12 sec)

mysql> explain select * from s1 where id > 1 and id < 1000000; #没有聚集索引,预估需要检索的rows数如下
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | s1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 2690100 |    11.11 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> alter table s1 add primary key(id);
Query OK, 0 rows affected (16.25 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> explain select * from s1 where id > 1 and id < 1000000; #有聚集索引,预估需要检索的rows数如下
+----+-------------+-------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | s1    | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 1343355 |   100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.09 sec)

2, auxiliary index

In addition the table indexes are clustered index other secondary indexes (Secondary Index, also known as non-clustered index), and the clustered index difference is: the auxiliary leaf node index does not contain all the data rows recorded.

In addition to the leaf nodes contain key, each leaf node in the index row also contains a bookmark (bookmark). The bookmark is used to tell where the InnoDB storage engine to find rows of data corresponding to the index.

Because InnoDB storage engine is index-organized tables, so bookmark the secondary index of the InnoDB storage engine is a clustered index key corresponding row of data. As shown below

There is a secondary index does not affect the organization of data in the clustered index, so you can have multiple secondary indexes on each table, but only one clustered index. When looking for data through the secondary index, InnoDB storage engine goes through secondary indexes and primary keys just want to get through the primary key index leaf level pointer, and then to find a complete line of record by the primary key index.

For example, if the secondary index to find data for the species at a height of 3, that the need for this secondary indexes tree traversal three times to find the specified primary keys, if the height of the tree the same clustered index is 3, you also need to clustered index tree find three times, finally found a page full of rows of data where, logic Thus a total of six times to get the final IO access a data page.

Five MySQL Index Management

A function

#1. 索引的功能就是加速查找
#2. mysql中的primary key,unique,联合唯一也都是索引,这些索引除了加速查找以外,还有约束的功能

Two commonly used index MySQL

普通索引INDEX:加速查找

唯一索引:
    -主键索引PRIMARY KEY:加速查找+约束(不为空、不能重复)
    -唯一索引UNIQUE:加速查找+约束(不能重复)

联合索引:
    -PRIMARY KEY(id,name):联合主键索引
    -UNIQUE(id,name):联合唯一索引
    -INDEX(id,name):联合普通索引

Each index of scenarios

举个例子来说,比如你在为某商场做一个会员卡的系统。

这个系统有一个会员表
有下列字段:
会员编号 INT
会员姓名 VARCHAR(10)
会员身份证号码 VARCHAR(18)
会员电话 VARCHAR(10)
会员住址 VARCHAR(50)
会员备注信息 TEXT

那么这个 会员编号,作为主键,使用 PRIMARY
会员姓名 如果要建索引的话,那么就是普通的 INDEX
会员身份证号码 如果要建索引的话,那么可以选择 UNIQUE (唯一的,不允许重复)

#除此之外还有全文索引,即FULLTEXT
会员备注信息 , 如果需要建索引的话,可以选择全文搜索。
用于搜索很长一篇文章的时候,效果最好。
用在比较短的文本,如果就一两行字的,普通的 INDEX 也可以。
但其实对于全文搜索,我们并不会使用MySQL自带的该索引,而是会选择第三方软件如Sphinx,专门来做全文搜索。

#其他的如空间索引SPATIAL,了解即可,几乎不用

Two types of hash and btree three indexes

#我们可以在创建上述索引的时候,为其指定索引类型,分两类
hash类型的索引:查询单条快,范围查询慢
btree类型的索引:b+树,层数越多,数据量指数级增长(我们就用它,因为innodb默认支持它)

#不同的存储引擎支持的索引类型也不一样
InnoDB 支持事务,支持行级别锁定,支持 B-tree、Full-text 等索引,不支持 Hash 索引;
MyISAM 不支持事务,支持表级别锁定,支持 B-tree、Full-text 等索引,不支持 Hash 索引;
Memory 不支持事务,支持表级别锁定,支持 B-tree、Hash 等索引,不支持 Full-text 索引;
NDB 支持事务,支持行级别锁定,支持 Hash 索引,不支持 B-tree、Full-text 等索引;
Archive 不支持事务,支持表级别锁定,不支持 B-tree、Hash、Full-text 等索引;

Four create / delete index syntax

#方法一:创建表时
      CREATE TABLE 表名 (
                字段名1  数据类型 [完整性约束条件…],
                字段名2  数据类型 [完整性约束条件…],
                [UNIQUE | FULLTEXT | SPATIAL ]   INDEX | KEY
                [索引名]  (字段名[(长度)]  [ASC |DESC]) 
                );


#方法二:CREATE在已存在的表上创建索引
        CREATE  [UNIQUE | FULLTEXT | SPATIAL ]  INDEX  索引名 
                     ON 表名 (字段名[(长度)]  [ASC |DESC]) ;


#方法三:ALTER TABLE在已存在的表上创建索引
        ALTER TABLE 表名 ADD  [UNIQUE | FULLTEXT | SPATIAL ] INDEX
                             索引名 (字段名[(长度)]  [ASC |DESC]) ;

#删除索引:DROP INDEX 索引名 ON 表名字;

demonstration

#方式一
create table t1(
    id int,
    name char,
    age int,
    sex enum('male','female'),
    unique key uni_id(id),
    index ix_name(name) #index没有key
);


#方式二
create index ix_age on t1(age);

#方式三
alter table t1 add index ix_sex(sex);

#查看
mysql> show create table t1;
| t1    | CREATE TABLE `t1` (
  `id` int(11) DEFAULT NULL,
  `name` char(1) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  `sex` enum('male','female') DEFAULT NULL,
  UNIQUE KEY `uni_id` (`id`),
  KEY `ix_name` (`name`),
  KEY `ix_age` (`age`),
  KEY `ix_sex` (`sex`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Six tests index

A ready

#1. 准备表
create table s1(
id int,
name varchar(20),
gender char(6),
email varchar(50)
);

#2. 创建存储过程,实现批量插入记录
delimiter $$ #声明存储过程的结束符号为$$
create procedure auto_insert1()
BEGIN
    declare i int default 1;
    while(i<3000000)do
        insert into s1 values(i,'egon','male',concat('egon',i,'@oldboy'));
        set i=i+1;
    end while;
END$$ #$$结束
delimiter ; #重新声明分号为结束符号

#3. 查看存储过程
show create procedure auto_insert1\G 

#4. 调用存储过程
call auto_insert1();

Two test query speed under the premise of no index

#无索引:mysql根本就不知道到底是否存在id等于333333333的记录,只能把数据表从头到尾扫描一遍,此时有多少个磁盘块就需要进行多少IO操作,所以查询速度很慢
mysql> select * from s1 where id=333333333;
Empty set (0.33 sec)

Under the premise of three existing large amounts of data in the table, the index for a field segment, the establishment can be slow

Four after the indexing is completed, when the field is to the query, the query speed increase significantly

PS:

  1. mysql go index list based on the search tree principles b + id equal to quickly search the record does not exist 333333333, the IO significantly reduced, thus significantly faster

  2. We can find the data directory to the mysql table, you can see more hard disk space occupied

  3. Note that, as FIG.

Five summary

#1. 一定是为搜索条件的字段创建索引,比如select * from s1 where id = 333;就需要为id加上索引

#2. 在表中已经有大量数据的情况下,建索引会很慢,且占用硬盘空间,建完后查询速度加快
比如create index idx on s1(id);会扫描表中所有的数据,然后以id为数据项,创建索引结构,存放于硬盘的表中。
建完以后,再查询就会很快了。

#3. 需要注意的是:innodb表的索引会存放于s1.ibd文件中,而myisam表的索引则会有单独的索引文件table1.MYI

MySAM索引文件和数据文件是分离的,索引文件仅保存数据记录的地址。而在innodb中,表数据文件本身就是按照B+Tree(BTree即Balance True)组织的一个索引结构,这棵树的叶节点data域保存了完整的数据记录。这个索引的key是数据表的主键,因此innodb表数据文件本身就是主索引。
因为inndob的数据文件要按照主键聚集,所以innodb要求表必须要有主键(Myisam可以没有),如果没有显式定义,则mysql系统会自动选择一个可以唯一标识数据记录的列作为主键,如果不存在这种列,则mysql会自动为innodb表生成一个隐含字段作为主键,这字段的长度为6个字节,类型为长整型.

 

Guess you like

Origin www.cnblogs.com/shibojie/p/11665149.html