Mysql advanced - index design principles

Index design principles

data preparation

Create database, create table

CREATE DATABASE atguigudb1;
USE atguigudb1;
    #1.创建学生表和课程表
CREATE TABLE `student_info` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `student_id` INT NOT NULL ,
    `name` VARCHAR(20) DEFAULT NULL,
    `course_id` INT NOT NULL ,
    `class_id` INT(11) DEFAULT NULL,
    `create_time` DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
 CREATE TABLE `course` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `course_id` INT NOT NULL ,
    `course_name` VARCHAR(40) DEFAULT NULL,
    PRIMARY KEY (`id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Create the stored functions necessary to simulate data

#函数1：创建随机产生字符串函数

DELIMITER //
CREATE FUNCTION rand_string ( n INT ) RETURNS VARCHAR ( 255 ) #该函数会返回一个字符串
BEGIN
	DECLARE
		chars_str VARCHAR ( 100 ) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ';
	DECLARE
		return_str VARCHAR ( 255 ) DEFAULT '';
	DECLARE
		i INT DEFAULT 0;
	WHILE
			i < n DO
			
			SET return_str = CONCAT(
				return_str,
			SUBSTRING( chars_str, FLOOR( 1+RAND ()* 52 ), 1 ));
		
		SET i = i + 1;
		
	END WHILE;
	RETURN return_str;
	
END // 
DELIMITER;


#函数2：创建随机数函数

DELIMITER //
CREATE FUNCTION rand_num ( from_num INT, to_num INT ) RETURNS INT ( 11 ) BEGIN
	DECLARE
		i INT DEFAULT 0;
	
	SET i = FLOOR(
		from_num + RAND()*(
			to_num - from_num + 1 
		));
	RETURN i;
	
END // 
DELIMITER;

In master-slave replication, the host will record write operations in the bin-log log. The slave reads the bin-log log and executes statements to synchronize data. If functions are used to operate data, the slave and primary key operation times will be inconsistent. Therefore, by default, mysql does not enable creation function settings.

Check whether mysql allows creating functions:

show variables like 'log_bin_trust_function_creators';

Command on: Allow creation of function settings:

set global log_bin_trust_function_creators=1; # 不加global只是当前窗口有效。

When mysqld restarts, the above parameters will disappear again. Permanent method:

Under windows: my.ini[mysqld] plus:

log_bin_trust_function_creators=1

Under linux: add my.cnf[mysqld] under /etc/my.cnf:

log_bin_trust_function_creators=1

Create a stored procedure that inserts mock data

# 存储过程1：创建插入课程表存储过程

DELIMITER //
CREATE PROCEDURE insert_course ( max_num INT ) BEGIN
	DECLARE
		i INT DEFAULT 0;
	
	SET autocommit = 0;#设置手动提交事务
	REPEAT#循环
		
		SET i = i + 1;#赋值
		INSERT INTO course ( course_id, course_name )
		VALUES
			(
				rand_num ( 10000, 10100 ),
			rand_string ( 6 ));
		UNTIL i = max_num 
	END REPEAT;
	COMMIT;#提交事务
	
END // 
DELIMITER;

# 存储过程2：创建插入学生信息表存储过程

DELIMITER //
CREATE PROCEDURE insert_stu ( max_num INT ) BEGIN
	DECLARE
		i INT DEFAULT 0;
	
	SET autocommit = 0;#设置手动提交事务
	REPEAT#循环
		
		SET i = i + 1;#赋值
		INSERT INTO student_info ( course_id, class_id, student_id, NAME )
		VALUES
			(
				rand_num ( 10000, 10100 ),
				rand_num ( 10000, 10200 ),
				rand_num ( 1, 200000 ),
			rand_string ( 6 ));
		UNTIL i = max_num 
	END REPEAT;
	COMMIT;#提交事务
	
END // 
DELIMITER;

Call stored procedure

CALL insert_course(100);
CALL insert_stu(1000000);

When is it appropriate to create an index?

1. The value of the field has uniqueness restrictions

Fields with unique business characteristics, even combined fields, must be built into unique indexes. (Source: Alibaba)
Note: Don’t think that the unique index affects the insert speed. This speed loss can be ignored, but the improvement in search speed is obvious.

2. Fields frequently used as WHERE query conditions

If a certain field is frequently used in the WHERE condition of the SELECT statement, then you need to create an index for this field. Especially at

When the amount of data is large, creating a common index can greatly improve the efficiency of data query.

For example, in the student_info data table (containing 1 million pieces of data), suppose we want to query the user information of student_id=123110.

3. Frequent GROUP BY and ORDER BY columns

Indexes allow data to be stored or retrieved in a certain order, so when we use grouping GROUP BYqueries on data or sort data, we need it . If there are multiple columns to be sorted , you can create them on these columns .
ORDER BY对分组或者排序的字段进行索引
组合索引

4. WHERE condition column of UPDATE and DELETE

After querying the data according to a certain condition and then performing an UPDATE or DELETE operation, if an index is created on the WHERE field,
the efficiency can be greatly improved. The principle is that we need to retrieve this record based on the WHERE condition column first, and then update or
delete it. If the updated fields are non-index fields when updating, the efficiency improvement will be more obvious. This is because
updating does not require index maintenance.

5. The DISTINCT field needs to be indexed

Sometimes we need to deduplicate a certain field. Using DISTINCT, creating an index on this field will also improve query efficiency.
For example, we want to query the different student_ids in the curriculum. If we do not create an index on student_id, execute the
SQL statement:

SELECT DISTINCT(student_id) FROM `student_info`;

If we create an index on student_id and then execute the SQL statement:

You can see that the SQL query efficiency has been improved, and the displayed student_id is still displayed in increasing order. This is because
the index sorts the data in a certain order, so it will be much faster when deduplicating.

6. Things to note when creating indexes during multi-table JOIN connection operations

First of all, the number of connection tables should not exceed 3 as much as possible , because each additional table is equivalent to adding a nested loop, and the order of magnitude
will increase very quickly, seriously affecting the efficiency of the query.
Secondly, create an index on the WHERE condition, because WHERE is the filter for data conditions. If the amount of data is very large,
filtering without WHERE conditions is very scary.
Finally, create an index on the field used for the join , and the field must be of the same type in multiple tables . For example, course_id
is of type int(11) in both the student_info table and the course table, but one cannot be of int type and the other is of varchar type.
For example, if we only create an index on student_id, execute the SQL statement:

SELECT course_id, name, student_info.student_id, course_name
FROM student_info JOIN course
ON student_info.course_id = course.course_id
WHERE name = '462eed7ac6e791292a79';

Here we create an index on name and then execute the above SQL statement. The running time is 0.002s.

7. Create an index using a column with a small type

8. Create index using string prefix

Create a merchant table. Because the address field is relatively long, create a prefix index on the address field.

create table shop(address varchar(120) not null);
alter table shop add index(address(12));

The question is, how much to intercept? If you intercept too much, the purpose of saving index storage space will not be achieved; if you intercept too little, there will be too much duplicate content, and the
hash degree (selectivity) of the field will be reduced. How to calculate selectivity for different lengths?

Let’s first look at the field’s selectivity in all data:

select count(distinct address) / count(*) from shop;

Calculate through different lengths and compare with the selectivity of the entire table:

official:

count(distinct left(列名, 索引长度))/count(*)

for example

select count(distinct left(address,10)) / count(*) as sub10, -- 截取前10个字符的选择度
count(distinct left(address,15)) / count(*) as sub11, -- 截取前15个字符的选择度
count(distinct left(address,20)) / count(*) as sub12, -- 截取前20个字符的选择度
count(distinct left(address,25)) / count(*) as sub13 -- 截取前25个字符的选择度
from shop;

This leads to another question: the impact of index column prefix on sorting.
Extension: Alibaba "Java Development Manual"
[Mandatory] When creating an index on a varchar field, the index length must be specified. It is not necessary to index the entire field. It is
determined based on the actual text distinction. Index length.
Note: Index length and discrimination are a pair of contradictions. Generally, for string type data, the index with a length of 20 will have a discrimination of more than 90%. You can
use count(distinct left(column name, index length))/ Determined by the distinction of count(*).

9. Columns with high distinction (high hashability) are suitable as indexes

10. Place the most frequently used columns on the left side of the joint index

This also allows fewer indexes to be created. At the same time, due to the "leftmost prefix principle", the usage of joint indexes can be increased.

11. When multiple fields need to be indexed, a joint index is better than a single value index.

In what situations is it not suitable to create an index?

1. Do not set indexes for fields that are not used in where

2. It is best not to use indexes for tables with small data volumes.

# 创建表1
CREATE TABLE t_without_index ( a INT PRIMARY KEY AUTO_INCREMENT, b INT );#创建存储过程

DELIMITER //
CREATE PROCEDURE t_wout_insert () BEGIN
	DECLARE
		i INT DEFAULT 1;
	WHILE
			i <= 900 DO
			INSERT INTO t_without_index ( b ) SELECT
			RAND()* 10000;
		
		SET i = i + 1;
		
	END WHILE;
	COMMIT;
	
END // 
DELIMITER;#调用
CALL t_wout_insert ();

# 创建表2：
CREATE TABLE t_with_index ( a INT PRIMARY KEY AUTO_INCREMENT, b INT, INDEX idx_b ( b ) );#创建存储过程

DELIMITER //
CREATE PROCEDURE t_with_insert () BEGIN
	DECLARE
		i INT DEFAULT 1;
	WHILE
			i <= 900 DO
			INSERT INTO t_with_index ( b ) SELECT
			RAND()* 10000;
		
		SET i = i + 1;
		
	END WHILE;
	COMMIT;
	
END // 
DELIMITER;#调用
CALL t_with_insert ();

Query comparison:

mysql> select * from t_without_index where b = 9879;
+------+------+
| a | b |
+------+------+
| 1242 | 9879 |
+------+------+
1 row in set (0.00 sec)
mysql> select * from t_with_index where b = 9879;
+-----+------+
| a | b |
+-----+------+
| 112 | 9879 |
+-----+------+
1 row in set (0.00 sec)

You can see that the running results are the same, but when the amount of data is not large, the index cannot play its role.

Conclusion: When the number of data rows in the data table is relatively small, such as less than 1,000 rows, there is no need to create an index.

3. Do not create indexes on columns with a large amount of duplicate data

Example 1: To find 500,000 rows in 1 million rows of data (for example, data with male gender), once the index is created, you need to access the index 500,000 times first, and then
access the data table 500,000 times. This adds The overhead may be greater than not using indexes.

Example 2: Suppose there is a student table, the total number of students is 1 million, and there are only 10 males, which is 1/100,000 of the total population.
The student table student_gender has the following structure. The value of the student_gender field in the data table is 0 or 1, with 0 representing female and 1 representing
male.

4. Avoid creating too many indexes on frequently updated tables

5. It is not recommended to use unordered values as indexes

For example, ID card, UUID (needs to be converted to ASCII during index comparison, and may cause page splitting during insertion), MD5, HASH, unordered long string,
etc.

6. Delete indexes that are no longer used or rarely used

7. Do not define redundant or duplicate indexes

① Redundant index
Example: The table creation statement is as follows

CREATE TABLE person_info(
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    name VARCHAR(100) NOT NULL,
    birthday DATE NOT NULL,
    phone_number CHAR(11) NOT NULL,
    country varchar(100) NOT NULL,
    PRIMARY KEY (id),
    KEY idx_name_birthday_phone_number (name(10), birthday, phone_number),
    KEY idx_name (name(10))
);

We know that the name column can be quickly searched through the idx_name_birthday_phone_number index. Creating an
index specifically for the name column is even a redundant index. Maintaining this index will only increase the maintenance cost and will not benefit the search
. .

② Repeat index

In another case, we may repeatedly index a certain column, for example:

CREATE TABLE repeat_index_demo (
    col1 INT PRIMARY KEY,
    col2 INT,
    UNIQUE uk_idx_c1 (col1),
    INDEX idx_c1 (col1)
);

We see that col1 is not only the primary key, but also defined as a unique index, and a normal index is defined for it. However, the primary key itself
will clustered index, so the unique index and the normal index defined are duplicates. This kind of situation to avoid.