MySQL high-frequency interview questions

What is MySQL

MySQL is a relational database that stores data in the form of tables. You can understand it as an Excel table. Since data is stored in the form of a table, there is a table structure (rows and columns). Rows represent each row of data, and columns represent each value in that row. The value on the column has a data type, such as: integer, string, date and so on.

Three paradigms of database

First Normal Form 1NF

Ensure the atomicity of database table fields. The most comprehensive Java interview site

For example, the field userInfo: 广东省 10086', must be split into two fields userInfo: 广东省 userTel: according to the first normal form 10086.

Second Normal Form 2NF

First of all, the first normal form must be satisfied, and there are two other parts. One is that the table must have a primary key; the other is that the non-primary key columns must be completely dependent on the primary key, not only a part of the primary key.

for example. Assume that the course selection relationship table is student_course(student_no, student_name, age, course_name, grade, credit), and the primary key is (student_no, course_name). The credits are completely dependent on the course name, and the name and age are completely dependent on the student number, which does not conform to the second normal form, which will lead to data redundancy (students choose n courses, and there are n records for name and age), and insertion exceptions (insert a new course, Because there is no student number, the new class record cannot be saved) and other issues.

It should be split into three tables: student: student(stuent_no, student_name, age); course: course(course_name, credit); course selection relationship: student_course_relation(student_no, course_name, grade).

Third Normal Form 3NF

First of all, the second normal form must be satisfied. In addition, non-primary key columns must directly depend on the primary key, and there cannot be transitive dependencies. That is, it cannot exist: non-primary key column A depends on non-primary key column B, and non-primary key column B depends on the primary key.

Assume that the student relationship table is Student(student_no, student_name, age, academy_id, academy_telephone), and the primary key is "student number", where the college id depends on the student number, and the college location and college phone depend on the college id. There is a transitive dependency, which does not meet third paradigm.

The student relationship table can be divided into the following two tables: student: (student_no, student_name, age, academy_id); college: (academy_id, academy_telephone).

The difference between 2NF and 3NF?

2NF is based on whether the non-primary key columns are completely dependent on the primary key or part of the primary key.
3NF is based on whether the non-primary key column directly depends on the primary key or directly depends on the non-primary key.

This article has been included in the Github warehouse, which includes computer foundation, Java foundation, multithreading, JVM, database, Redis, Spring, Mybatis, SpringMVC, SpringBoot, distributed, microservices, design patterns, architecture, school recruitment and social recruitment sharing, etc. Core knowledge points, welcome to star~

Github address

If you can't access Github, you can access the gitee address.

gitee address

Four characteristics of business?

Transaction characteristics ACID : Atomicity ( Atomicity), Consistency ( Consistency), Isolation ( Isolation), Persistence ( Durability).

Atomicity means that all operations contained in a transaction either succeed or fail and roll back.
Consistency means that a transaction must be in a consistent state before and after execution. For example, accounts a and b have a total of 1,000 yuan. After the transfer between the two, no matter whether it succeeds or fails, the sum of their accounts is still 1,000.
isolation . Related to the isolation level, for example read committed, a transaction can only read committed changes.
Persistence means that once a transaction is committed, the changes to the data in the database are permanent, even if the database system encounters a failure, the operation of committing the transaction will not be lost.

What are the transaction isolation levels?

First understand the following concepts: dirty reads, non-repeatable reads, and phantom reads.

Dirty read refers to reading data in another uncommitted transaction during one transaction processing.
Non-repeatable reading means that for a row of records in the database, multiple queries within the scope of a transaction return different data values. This is because another transaction modifies the data and submits it during the query interval.
Phantom reading is when a transaction is reading a range of records, and another transaction inserts new records in the range. The correct understanding of phantom reading is that the conclusion of a read operation within a transaction cannot support subsequent business execution. Assuming that the transaction wants to add a new record, the primary key is id, and the select is executed before the new addition, but no record with the id xxx is found, but a primary key conflict occurs when inserting, which is a phantom read, and the primary key conflict is found when the record cannot be read It is because the record has actually been inserted by other transactions, but the current transaction is not visible.

The difference between non-repeatable reads and dirty reads is that dirty reads mean that a transaction reads dirty data that has not been committed by another transaction, while non-repeatable reads mean that data committed by a previous transaction is read.

Transaction isolation is to solve the problems of dirty reads, non-repeatable reads, and phantom reads mentioned above.

The MySQL database provides us with four isolation levels:

Serializable : Solve the phantom read problem by forcing transactions to be ordered so that they cannot conflict with each other.
Repeatable read (repeatable read): MySQL's default transaction isolation level, which ensures that multiple instances of the same transaction will see the same data row when reading data concurrently, which solves the problem of non-repeatable read.
Read committed (read committed): A transaction can only see the changes made by the committed transaction. Dirty reads can be avoided.
Read uncommitted (read uncommitted): All transactions can see the execution results of other uncommitted transactions.

Check the isolation level:

select @@transaction_isolation;

Set the isolation level:

set session transaction isolation level read uncommitted;

What isolation level is generally used for production environment databases?

Most production environments use RC . Why not RR?

Repeatable Read (Repeatable Read), referred to as RR
Read Committed (Read Committed), referred to as RC

Reason 1: Under the RR isolation level, there is a gap lock, which leads to a much higher probability of deadlock than RC!
Reason 2: Under the RR isolation level, if the conditional column misses the index, the table will be locked! And under the RC isolation level, only lock rows!

That is to say, the concurrency of RC is higher than that of RR.

And in most scenarios, the non-repeatable read problem is acceptable. After all, the data has already been submitted, and there is no big problem in reading it out!

Let me share with you a Github warehouse, which contains more than 300 classic computer book PDFs compiled by Dabin, including C language, C++, Java, Python, front-end, database, operating system, computer network, data structure and algorithm, machine learning, Programming life , etc., you can star it, next time you look for a book directly search on it, the warehouse is continuously updated~

Github address

The relationship between encoding and character set

We can usually input various Chinese and English letters on the editor, but these are read by people, not by computers. In fact, the computer actually saves and transmits data in binary 0101 format .

Then there needs to be a rule to convert Chinese and English letters into binary. Where d corresponds to 64 in hexadecimal, which can be converted to 01 binary format. So the letters and numbers correspond one by one, which is the ASCII encoding format.

It uses one byte , that is, 8位to identify characters. There are 128 basic symbols and 128 extended symbols. It can only represent English letters and numbers .

This is clearly not enough. Therefore, in order to identify Chinese , the encoding format of GB2312 appeared . In order to identify the Greek language , the greek encoding format appeared , and in order to identify the Russian language , the cp866 encoding format was adjusted .

In order to unify them, the Unicode encoding format appeared , which uses 2 to 4 bytes to represent characters, so that all symbols can be included in theory, and it is also fully compatible with ASCII encoding, that is, the same The letter d is represented by 64 in ASCII, and it is still represented by 64 in Unicode.

But the difference is that ASCII encoding is represented by 1 byte, while Unicode is represented by two bytes.

It is also the letter d, unicode uses one more byte than ascii, as follows:

D   ASCII:           01100100
D Unicode:  00000000 01100100

It can be seen that the unicode encoding above is all 0 in front, which is not practical, but it still occupies a byte, which is a bit wasteful. If we can hide it when it should be hidden, we can save a lot of space. According to this idea, we will have UTF-8 encoding .

To sum up, according to certain rules, corresponding symbols and binary codes are codes . Gathering more than n such encoded characters together is what we often call a character set .

For example, the utf-8 character set is a collection of all characters in the utf-8 encoding format.

Want to see which character sets mysql supports. can be executedshow charset;

The difference between utf8 and utf8mb4

As mentioned above, utf-8 is an optimization based on unicode. Since unicode has a way to represent all characters, utf-8 can also represent all characters. In order to avoid confusion, I will call it big utf8 later .

The character sets supported by mysql include utf8 and utf8mb4.

Let’s talk about utf8mb4 encoding first. mb4 means most bytes 4. As you can see from the far right of the above figure Maxlen, it supports a maximum of 4 bytes to represent characters, and it can be used to represent almost all currently known characters.

Let's talk about utf8 in the mysql character set , which is the default character set of the database . But note that this utf8 is not the other utf8 , we call it a small utf8 character set. Why do you say that, because it can be seen from Maxlen that it supports up to 3 bytes to represent characters. According to the naming method of utf8mb4, it should be called utf8mb3 to be precise .

Utf8 is like a castrated version of utf8mb4, only supports some characters. For example, emojiemoticons, it does not support.

In the character set supported by mysql, the third column, collation , refers to the comparison rules of the character set .

For example, "debug" and "Debug" are the same word, but they are capitalized differently. Should they be judged as the same word?

At this time, collation is needed.

SHOW COLLATION WHERE Charset = 'utf8mb4';You can see utf8mb4what comparison rules are supported by .

If collation = utf8mb4_general_ci, refers to the premise of using the utf8mb4 character set, compare characters one by one ( general), and do not distinguish between case and case ( _ci，case insensitice).

In this case, "debug" and "Debug" are the same word.

If it is changed collation=utf8mb4_bin, it means comparing the size of binary bits one by one .

So "debug" and "Debug" are not the same word.

Does utf8mb4 have any disadvantages compared to utf8?

We know that in the database table, if the field type is char(2)yes, it 2refers to the number of characters , that is to say , no matter what encoding character set is used in this table , 2 characters can be placed.

And char is a fixed length , in order to be able to put down 2 utf8mb4 characters, char will reserve 2*4（maxlen=4）= 8a byte space by default.

If it is utf8mb3, 2 * 3 (maxlen=3) = 6bytes . That is, utf8mb4 will use some more space than utf8mb3 in this case .

index

What is an index?

An index is a data structure used by the storage engine to improve the access speed of the database table . It can be compared to the directory of a dictionary, which can help you quickly find the corresponding records.

Indexes are generally stored in files on disk, which take up physical space.

Advantages and disadvantages of indexing?

advantage:

Speed up data lookups
Adding indexes to fields used for sorting or grouping can speed up grouping and sorting
Speed up table-to-table joins

shortcoming:

Indexing requires physical space
It will reduce the efficiency of adding, deleting, and modifying tables, because every time a table record is added, deleted, and modified, it is necessary to dynamically maintain the index , resulting in longer time for adding, deleting, and modifying

The role of the index?

The data is stored on the disk. When querying the data, if there is no index, all the data will be loaded into the memory and retrieved sequentially. The number of disk reads is large. With the index, there is no need to load all the data, because the height of the B+ tree is generally 2-4 layers, and it only needs to read the disk 2-4 times at most, and the query speed is greatly improved.

Under what circumstances do you need to build an index?

Fields frequently used for queries
The fields that are often used for connection are indexed, which can speed up the connection
Fields that often need to be sorted are indexed, because the index is already sorted, which can speed up the sorting query

Under what circumstances is the index not built?

whereFields not used in conditions are not suitable for indexing
The table has fewer records. For example, if there are only a few hundred pieces of data, there is no need to add an index.
Need to add and delete frequently. Need to evaluate whether it is suitable for indexing
Columns participating in column calculations are not suitable for indexing
Fields with low discrimination are not suitable for indexing, such as gender, which only has three values: male/female/unknown. Adding an index will not improve query efficiency.

indexed data structure

The data structure of the index mainly includes B+ tree and hash table, and the corresponding indexes are B+ tree index and hash index respectively. The index types of the InnoDB engine include B+ tree index and hash index, and the default index type is B+ tree index.

B+ tree index

The B+ tree is implemented based on the B tree and leaf node sequential access pointers. It has the balance of the B tree and improves the performance of interval queries through sequential access pointers.

In the B+ tree, the nodes are arranged in increments keyfrom left to right . If the left and right neighbors of a pointer keyare key _i and key _i+1 respectively , then the pointer points to all the nodes that keyare greater than or equal to key _i and less than or equal to key _{i+ 1} .

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-PREuMzTG-1683246051797)(http://img.topjavaer.cn/img/B+tree index 0.png) ]

When performing a search operation, first perform a binary search on the root node, find keythe pointer where it is located, and then search recursively at the node pointed to by the pointer. Until the leaf node is found, then binary search is performed on the leaf node to find keythe corresponding data item.

The most commonly used index type in MySQL database is BTREEindex, and the underlying layer is implemented based on the B+ tree data structure.

mysql> show index from blog\G;
*************************** 1. row ***************************
        Table: blog
   Non_unique: 0
     Key_name: PRIMARY
 Seq_in_index: 1
  Column_name: blog_id
    Collation: A
  Cardinality: 4
     Sub_part: NULL
       Packed: NULL
         Null:
   Index_type: BTREE
      Comment:
Index_comment:
      Visible: YES
   Expression: NULL

hash index

The hash index is implemented based on the hash table. For each row of data, the storage engine will perform hash calculation on the index column to obtain the hash code, and the hash algorithm should try to ensure that the hash code value calculated by different column values is Differently, the value of the hash code is used as the key value of the hash table, and the pointer to the data row is used as the value value of the hash table. In this way, the time complexity of finding a data is O(1), which is generally used for precise searching.

What is the difference between Hash index and B+ tree index?

Hash indexes do not support sorting because hash tables are unordered.
Hash indexes do not support range lookups .
Hash indexes do not support fuzzy queries and leftmost prefix matching of multi-column indexes.
Because there will be hash conflicts in the hash table , the performance of the hash index is unstable, while the performance of the B+ tree index is relatively stable, and each query is from the root node to the leaf node.

Why is B+ tree more suitable for implementing database index than B tree?

Since the data of the B+ tree is stored in the leaf nodes, the leaf nodes are all indexes, which is convenient for scanning the database. You only need to scan the leaf nodes once, but because the B tree also stores data in its branch nodes, we need to find For specific data, an in-order traversal is required to scan in order, so B+ trees are more suitable for interval queries, and range-based queries are very frequent in databases, so B+ trees are usually used for database indexes.
The nodes of the B+ tree only store the index key value, and the address of the specific information exists in the address of the leaf node. This allows more nodes to be stored in the page-based index. Reduce more I/O spending.
The query efficiency of the B+ tree is more stable, and any keyword search must take a path from the root node to the leaf node. All keyword queries have the same path length, resulting in the same query efficiency for each data.

What are the categories of indexes?

1. Primary key index : the only non-null index named primary, which does not allow null values.

2. Unique index : The value in the index column must be unique, but null values are allowed. The difference between a unique index and a primary key index is that the unique index field can be null and there can be multiple null values, while the primary key index field cannot be null. The purpose of the unique index: to uniquely identify each record in the database table, mainly to prevent repeated insertion of data. The SQL statement to create a unique index is as follows:

ALTER TABLE table_name
ADD CONSTRAINT constraint_name UNIQUE KEY(column_1,column_2,...);

3. Composite index : The index created on the combination of multiple fields in the table will only be used when the left field of these fields is used in the query condition. When using the composite index, the principle of the leftmost prefix must be followed.

4. Full-text index : Full-text index can only be used on the CHAR, VARCHARand TEXTtype fields.

5. Ordinary index : Ordinary index is the most basic index, it has no restrictions, and the value can be empty.

What is the leftmost matching principle?

If the leftmost index in the composite index is used in the SQL statement, then this SQL statement can use the composite index for matching. When a range query ( >, <, between, like) is encountered, the matching will stop, and the subsequent fields will not use indexes.

For (a,b,c)indexing, using a/ab/abc for query conditions will lead to indexing, and using bc will not lead to indexing.

For (a,b,c,d)indexing, the query condition is a = 1 and b = 2 and c > 3 and d = 4, then the three fields a, b, and c can use the index, but d cannot use the index. Because a range query was encountered.

As shown in the figure below, index (a, b) is established, a is globally ordered in the index tree, and b is globally unordered and locally ordered (when a is equal, it will be sorted according to b). Direct execution b = 2of such query conditions cannot use indexes.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-MCF3wj0C-1683246051807)(http://img.topjavaer.cn/img/leftmost prefix.png)]

When the value of a is determined, b is ordered. For example a = 1, when the value of b is 1, 2 is an ordered state. At that a = 2time, the value of b was 1, and 4 was also an ordered state. a = 1 and b = 2The a and b fields can use the index when executing . When executing a > 1 and b = 2, the a field can use the index, but the b field cannot use the index. Because the value of a is a range at this time, it is not fixed, and the value of b is not ordered in this range, so the b field cannot use the index.

What is a clustered index?

InnoDB uses the primary key of the table to construct the primary key index tree, and the record data of the entire table is stored in the leaf nodes. The storage of the leaf nodes of the clustered index is logically continuous, connected by a doubly linked list, and the leaf nodes are sorted according to the order of the primary key, so the sorting search and range search for the primary key are faster.

The leaf nodes of the clustered index are the row records of the entire table. InnoDB primary keys use clustered indexes. Clustered indexes are much more efficient than non-clustered index queries.

For , the clustered index is generally the primary key index in the table. If the specified primary key is not displayed in the table, the first unique index that InnoDBis not allowed in the table will be selected. NULLIf there is no primary key and no suitable unique index, InnoDBa hidden primary key will be generated internally as a clustered index. The hidden primary key is 6 bytes in length, and its value will increase automatically as the data is inserted.

What is a covering index?

selectThe data columns can be obtained only from the index, and there is no need to go back to the table for secondary query, that is to say, the query column must be covered by the index used. For innodbthe secondary index of the table, if the index can cover the queried columns, then the secondary query of the primary key index can be avoided.

Not all types of indexes can be covering indexes. The covering index needs to store the value of the index column, but the hash index and the full-text index do not store the value of the index column, so MySQL uses the b+ tree index as the covering index.

For queries that use a covering index, used in front of the query explain, the extra column of the output will be displayed as using index.

For example, user_likein the user praise table, the combined index is (user_id, blog_id), user_idand blog_idis not null.

explain select blog_id from user_like where user_id = 13;

explainThe resulting Extracolumn Using indexis that the queried column is covered by the index, and the where filter condition conforms to the leftmost prefix principle, and the qualified data can be directly found through the index search , without the need to return to the table to query the data.

explain select user_id from user_like where blog_id = 1;

explainExtraThe column of the result Using where; Using indexis, the queried column is covered by the index, the where filter condition does not conform to the leftmost prefix principle, and the qualified data cannot be found through the index search, but the qualified data can be found through the index scan, and there is no need to return the table to query the data .

Index design principles?

For fields that are often used as query conditions, indexes should be built to improve query speed
Index fields that frequently require sorting, grouping, and union operations
The higher the degree of discrimination of the index column , the better the effect of the index. For example, using a column with a low degree of discrimination such as gender as an index will have a poor effect.
Avoid indexing "large fields". Try to use fields with a small amount of data as indexes. Because MySQLthe field values are maintained together when maintaining the index, this will inevitably cause the index to take up more space, and it will take more time to compare when sorting.
Try to use short indexes . When indexing longer strings, you should specify a shorter prefix length, because smaller indexes involve less disk I/O and the query speed is faster.
The more indexes the better, each index requires additional physical space, and maintenance also takes time.
Do not create indexes for fields that are frequently added, deleted, or modified. Assuming that a field is frequently modified, it means that the index needs to be rebuilt frequently, which will inevitably affect the performance of MySQL
Use the leftmost prefix principle .

When will the index become invalid?

Circumstances that lead to index failure:

For composite indexes, if the leftmost field of the composite index is not used, the index will not be used
For like queries starting with % %abc, indexes cannot be used; for like queries not starting with % abc%, they are equivalent to range queries, and indexes will be used
The column type in the query condition is a string, without quotation marks, and implicit conversion may occur due to different types, making the index invalid
When judging whether the index column is not equal to a certain value
Operate on indexed columns
Query conditions use orjoins, which will also cause index failure

What is a prefix index?

Sometimes it is necessary to create an index on a very long character column, which can cause the index to be very large and slow. Using a prefix index avoids this problem.

A prefix index refers to indexing the first few characters of a text or string, so that the length of the index is shorter and the query speed is faster.

The key to creating a prefix index is to choose a long enough prefix to ensure high index selectivity . The higher the selectivity of the index, the higher the query efficiency, because the index with high selectivity allows MySQL to filter out more data rows when searching.

How to create a prefix index:

// email列创建前缀索引
ALTER TABLE table_name ADD KEY(column_name(prefix_length));

index pushdown

Refer to my other article: Graphical Index Push Down!

What are the common storage engines?

The four commonly used storage engines in MySQL are: MyISAM , InnoDB , MEMORY , ARCHIVE . The default storage engine after MySQL 5.5 is InnoDB.

InnoDB storage engine

InnoDB is MySQL's default transactional storage engine , the most widely used, based on clustered indexes. InnoDB has done a lot of optimizations internally, such as being able to automatically create adaptive hash indexes in memory to speed up read operations.

Advantages : Supports transactions and crash recovery capabilities; introduces row-level locks and foreign key constraints.

Disadvantages : The data space occupied is relatively large.

Applicable scenarios : Transaction support is required, and there is a high frequency of concurrent reads and writes.

MyISAM storage engine

Data is stored in a compact format. For read-only data, or if the table is relatively small and can tolerate repair operations, the MyISAM engine can be used. MyISAM stores tables in two files, the data file .MYDand the index file .MYI.

Pros : Fast access.

Disadvantages : MyISAM does not support transactions and row-level locks, does not support safe recovery after a crash, and does not support foreign keys.

Applicable scenarios : There is no requirement for transaction integrity; the data in the table will be read-only.

MEMORY storage engine

The MEMORY engine puts all the data in the memory, and the access speed is faster, but once the system crashes, the data will be lost.

The MEMORY engine uses a hash index by default, and stores the hash value of the key and the pointer to the data row in the hash index.

Pros : faster access.

Disadvantages :

Hash index data is not stored in the order of index values and cannot be used for sorting.
Partial index match lookups are not supported because hash indexes use the entire contents of the indexed columns to compute the hash value.
Only equality comparison is supported, range query is not supported.
When a hash conflict occurs, the storage engine needs to traverse all the row pointers in the linked list and compare row by row until a row that meets the conditions is found.

ARCHIVE storage engine

The ARCHIVE storage engine is ideal for storing large amounts of independent, historical data. ARCHIVE provides a compression function and has an efficient insertion speed, but this engine does not support indexes, so the query performance is poor.

What is the difference between MyISAM and InnoDB?

difference in storage structure . Each MyISAM is stored as three files on disk. The file name begins with the name of the table, and the extension indicates the file type. .frm files store table definitions. Data files have the extension .MYD (MYData). Index files have the extension .MYI (MYIndex). All InnoDB tables are stored in the same data file (or multiple files, or independent table space files), and the size of InnoDB tables is only limited by the size of the operating system file, generally 2GB.
difference in storage space . MyISAM supports three different storage formats: static table (default, but note that there should be no spaces at the end of the data, it will be removed), dynamic table, and compressed table. After the table is created and the data is imported, it will not be modified. You can use the compressed table to greatly reduce the disk space usage. InnoDB requires more memory and storage, and it builds its dedicated buffer pool in main memory for caching data and indexes.
Portability, Backup and Recovery . MyISAM data is stored in the form of files, so it will be very convenient in cross-platform data transfer. A table can be operated independently during backup and recovery. For InnoDB, the feasible solution is to copy data files, backup binlog, or use mysqldump, which is relatively troublesome when the amount of data reaches tens of gigabytes.
Whether to support row-level locks . MyISAM only supports table-level locks. When a user operates a MyISAM table, the select, update, delete, and insert statements will automatically lock the table. If the locked table meets the concurrency of insert, a new one can be inserted at the end of the table. data. InnoDB supports row-level locks and table-level locks, and the default is row-level locks. Row locks greatly improve the performance of multi-user concurrent operations.
Whether to support transactions and safe recovery after a crash . MyISAM does not provide transaction support. InnoDB provides transaction support, with transaction, rollback and crash recovery capabilities.
Whether to support foreign keys . MyISAM does not support it, but InnoDB does.
Whether to support MVCC . MyISAM does not support it, but InnoDB supports it. To deal with high concurrent transactions, MVCC is more efficient than simple locking.
Whether clustered indexes are supported . MyISAM does not support clustered indexes, and InnoDB supports clustered indexes.
Full-text indexing . MyISAM supports full-text indexes of the FULLTEXT type. InnoDB does not support full-text index of FULLTEXT type, but innodb can use sphinx plug-in to support full-text index, and the effect is better.
Table primary key . MyISAM allows tables without any indexes and primary keys to exist, and the indexes are all to save the address of the row. For InnoDB, if no primary key or non-empty unique index is set, a 6-byte primary key (not visible to the user) will be automatically generated.
The number of rows in the table . MyISAM saves the total number of rows in the table, if select count(*) from table; will take out the value directly. InnoDB does not save the total number of rows in the table. If you use select count(*) from table; it will traverse the entire table, which consumes a lot. However, after adding the where condition, MyISAM and InnoDB handle it in the same way.

What locks does MySQL have?

Classified by lock granularity , there are row-level locks, table-level locks, and page-level locks.

Row-level locks are the most fine-grained locks in MySQL. Indicates that only the row currently being operated is locked. Row-level locks can greatly reduce conflicts in database operations. The locking granularity is the smallest, but the locking overhead is also the largest. There are three main types of row-level locks:
- Record Lock, record lock, that is, only lock a record;
- Gap Lock, gap lock, locks a range, but does not contain the record itself;
- Next-Key Lock: A combination of Record Lock + Gap Lock, which locks a range and locks the record itself.
Table-level lock is a kind of lock with the largest locking granularity in mysql, which means to lock the entire table currently being operated. It is simple to implement, consumes less resources, and is supported by most mysql engines. The most commonly used MyISAM and InnoDB both support table-level locking.
Page-level lock is a kind of lock in MySQL whose locking granularity is between row-level lock and table-level lock. Table-level locks are fast, but have more conflicts, and row-level locks have fewer conflicts, but are slower. Therefore, a compromised page-level lock is adopted to lock a group of adjacent records at a time.

Classified by lock level , there are shared locks, exclusive locks, and intent locks.

A shared lock, also known as a read lock, is a lock created by a read operation. Other users can read the data concurrently, but no transaction can modify the data (obtain an exclusive lock on the data) until all shared locks have been released.
Exclusive locks are also called write locks and exclusive locks. If transaction T adds an exclusive lock to data A, other transactions cannot add any type of lock to A. Transactions that are granted exclusive locks can both read and modify data.
Intention locks are table-level locks designed primarily to reveal the type of lock that will be requested for the next row in a transaction. Two table locks in InnoDB:

Intent shared lock (IS): Indicates that the transaction is going to add a shared lock to the data row, that is to say, the IS lock of the table must be obtained before adding a shared lock to a data row;

Intent exclusive lock (IX): Similar to the above, it means that the transaction is going to add an exclusive lock to the data row, indicating that the transaction must first obtain the IX lock of the table before adding an exclusive lock to a data row.

Intention locks are automatically added by InnoDB without user intervention.

For INSERT, UPDATE, and DELETE, InnoDB will automatically add exclusive locks to the data involved; for general SELECT statements, InnoDB will not add any locks, and transactions can explicitly add shared or exclusive locks through the following statements.

shared lock:SELECT … LOCK IN SHARE MODE;

Exclusive lock:SELECT … FOR UPDATE;

MVCC implementation principle?

MVCC( Multiversion concurrency control) is a way to keep multiple versions of the same data to achieve concurrency control. When querying, read viewfind the corresponding version of data through the version chain.

Function: Improve concurrency performance. For high concurrency scenarios, MVCC has less overhead than row-level locks.

The principle of MVCC implementation is as follows:

The implementation of MVCC depends on the version chain, which is realized through three hidden fields of the table.

DB_TRX_ID: The current transaction id, the time sequence of the transaction is judged by the size of the transaction id.
DB_ROLL_PTR: Rollback pointer, pointing to the previous version of the current row record, through this pointer, multiple versions of the data are connected together to form a undo logversion chain.
DB_ROW_ID: Primary key, if the data table does not have a primary key, InnoDB will automatically generate a primary key.

Each table record looks like this:

When using transactions to update row records, a version chain will be generated, and the execution process is as follows:

Lock the row with an exclusive lock;
Copy the original value of the row to undo logas the old version for rollback;
Modify the value of the current row, generate a new version, update the transaction id, and make the rollback pointer point to the record of the old version, thus forming a version chain.

Let's give an example to facilitate your understanding.

1. The initial data is as follows, where DB_ROW_IDand DB_ROLL_PTRis empty.

2. Transaction A modifies the data in this row and agechanges it to 12. The effect is as follows:

3. Afterwards, transaction B also modifies the row record and agechanges it to 8. The effect is as follows:

4. At this time, there are two lines of records in the undo log, and they are linked together by the rollback pointer.

Next, understand the concept of read view.

read viewIt can be understood as taking a "photograph" of the state of the data at each moment and recording it. When acquiring data at a certain time t, the data is acquired from the "photo" taken at time t.

read viewAn active transaction list is maintained internally, indicating transactions read viewthat are still active when they are generated. This linked list contains read viewtransactions that have not been committed before creation, and does not include read viewtransactions committed after creation.

Different isolation levels create different timings for read view.

read committed: Every time select is executed, a new read_view is created to ensure that the modifications committed by other transactions can be read.
Repeatable read: Within the scope of a transaction, the read_view is updated when the first selection is made, and will not be updated in the future. All subsequent selections reuse the previous read_view. This can ensure that the content read every time within the scope of the transaction is the same, and can be read repeatedly.

Record filtering method of read view

Premise : DATA_TRX_IDRepresents the latest transaction ID of each data row; up_limit_idrepresents the first transaction in the current snapshot; low_limit_idrepresents the slowest transaction in the current snapshot, that is, the last transaction.

If DATA_TRX_ID< up_limit_id: It means that read viewthe transaction that modifies the data row has been committed at the time of creation, and the record of this version can be read by the current transaction.
If DATA_TRX_ID>= low_limit_id: It means that the transaction of the current version of the record is read viewgenerated after creation, and the data row of this version cannot be accessed by the current transaction. At this time, it is necessary to find the previous version through the version chain, and then re-evaluate the visibility of the record of this version to the current transaction.
if up_limit_id<= DATA_TRX_ID< low_limit_i:
1. It is necessary to check whether there is a transaction whose ID is the value in the active transaction list DATA_TRX_ID.
2. If present, the record is not visible because transactions in the active transaction list are uncommitted. At this point, you need to find the previous version through the version chain, and then re-judgment the visibility of this version.
3. If it does not exist, it means that the transaction trx_id has been submitted, and this line of records is visible.

Summary : InnoDB MVCCis implemented through read viewand version chain. The version chain saves historical version records read view. By judging whether the data of the current version is visible, if not, find the previous version from the version chain and continue to judge until it finds a visible version.

Snapshot read and current read

Table records can be read in two ways.

Snapshot read: The snapshot version is read. The common one SELECTis snapshot read. Concurrency control is performed through mvcc without locking.
Currently read: read the latest version. UPDATE、DELETE、INSERT、SELECT … LOCK IN SHARE MODE、SELECT … FOR UPDATEis the current read.

In the case of snapshot reading, InnoDB mvccavoids phantom reading through the mechanism. However, mvccthe mechanism cannot avoid the phantom reading phenomenon that occurs in the current reading situation. Because the current read reads the latest data each time, if there are other transactions inserting data between the two queries, a phantom read will occur.

Here is an example to illustrate:

1. First, the user table has only two records, as follows:

2. Transaction a and transaction b start transactions at the same time start transaction;

3. Transaction a inserts data and submits;

insert into user(user_name, user_password, user_mail, user_state) values('tyson', 'a', 'a', 0);

4. Transaction b executes the update of the whole table;

update user set user_name = 'a';

5. Transaction b then executes the query, and finds the data inserted in transaction a. (The left side of the figure below is transaction b, and the right side is transaction a. Before the transaction starts, there are only two records. After transaction a inserts one piece of data, transaction b queries three pieces of data)

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Rbo6ijrB-1683246051814)(http://img.topjavaer.cn/img/phantom reading 1.png)]

The above is the phantom reading phenomenon that occurs in the current reading.

So how does MySQL avoid phantom reading?

In the case of snapshot reads, MySQL uses it mvccto avoid phantom reads.
In the current reading situation, MySQL uses next-keyto avoid phantom reading (realized by adding row locks and gap locks).

next-key consists of two parts: row lock and gap lock. Row locks are locked on indexes, and gap locks are added between indexes.

SerializableThe isolation level can also avoid phantom reading, which will lock the entire table, and the concurrency is extremely low, so it is generally not used.

shared and exclusive locks

There are two main types of read locks for SELECT: shared locks and exclusive locks.

select * from table where id<6 lock in share mode;--共享锁
select * from table where id<6 for update;--排他锁

The main difference between these two methods is that LOCK IN SHARE MODE it is easy to cause deadlock when multiple transactions update the same form at the same time.

The premise of applying for an exclusive lock is that no thread uses an exclusive or shared lock on any row data of the result set, otherwise the application will be blocked. When performing transaction operations, MySQL will add an exclusive lock to each row of data in the query result set, and other threads will be blocked from changing or deleting these data (only read operations) until the transaction of the statement is executed by the statement commitor rollbackstatement until the end.

SELECT... FOR UPDATEPrecautions for use:

for updateOnly applicable to innodb, and must be within the scope of the transaction to take effect.
Query based on the primary key, the query condition is likeor is not equal, and the primary key field generates a table lock .
Queries based on non-indexed fields will generate table locks .

bin log/redo log/undo log

MySQL logs mainly include query logs, slow query logs, transaction logs, error logs, and binary logs. The more important ones are bin log(binary log) and redo log(redo log) and undo log(rollback log).

bin log

bin logIt is a file at the MySQL database level, which records all operations to modify the MySQL database, does not record select and show statements, and is mainly used to restore the database and synchronize the database.

redo log

redo logIt is the innodb engine level, used to record the transaction log of the innodb storage engine, regardless of whether the transaction is committed or not, it will be recorded for data recovery. When the database fails, the innoDB storage engine will use redo logthe recovery to the moment before the failure to ensure data integrity. If the parameter innodb_flush_log_at_tx_commitis set to 1, it will be written to disk synchronously when commit is executed redo log.

undo log

In addition to the record redo log, it will also be recorded when the data is modified undo log, undo logwhich is used for the recall operation of the data, and it retains the content before the record modification. undo logTransaction rollback can be achieved, and MVCC can be implemented according to the dataundo log that can be traced back to a specific version .

What is the difference between bin log and redo log?

bin logAll log records will be recorded, including the logs of storage engines such as InnoDB and MyISAM; redo logonly the transaction logs of innoDB itself will be recorded.
bin logIt is only written to the disk before the transaction is committed, and a transaction is only written once; while the transaction is in progress, it will be redo logcontinuously written to the disk.
bin logIt is a logical log, which records the original logic of the SQL statement; redo logit is a physical log, which records what changes have been made on a certain data page.

Tell me about the MySQL architecture?

MySQL is mainly divided into the Server layer and the storage engine layer:

Server layer : mainly includes connectors, query caches, analyzers, optimizers, executors, etc. All cross-storage engine functions are implemented at this layer, such as stored procedures, triggers, views, functions, etc., and there is a general The log module of the binglog log module.
Storage engine : Mainly responsible for data storage and reading. The server layer communicates with the storage engine through the API.

Server Layer Basic Components

Connector: When the client connects to MySQL, the server layer will perform identity authentication and permission verification on it.
Query cache: When executing a query statement, it will first query the cache and check whether the sql has been executed. If the sql is cached, it will be returned to the client directly. If there is no hit, subsequent operations will be performed.
Analyzer: If there is no cache hit, the SQL statement will pass through the analyzer, which is mainly divided into two steps, lexical analysis and syntax analysis, first look at what the SQL statement will do, and then check whether the syntax of the SQL statement is correct.
Optimizer: The optimizer optimizes the query, including rewriting the query, determining the read and write order of the table, and selecting the appropriate index, etc., to generate an execution plan.
Executor: First, it will check whether the user has permission before execution. If there is no permission, an error message will be returned. If there is permission, it will call the interface of the engine according to the execution plan and return the result.

Sub-library and sub-table

When the amount of data in a single table reaches 1000W or 100G, optimizing indexes and adding slave databases may not significantly improve the performance of the database, so it is necessary to consider splitting it. The purpose of segmentation is to reduce the burden on the database and shorten the query time.

Data segmentation can be divided into two ways: vertical division and horizontal division.

vertical division

The vertical partition database is divided according to the business. For example, in the shopping scene, the tables related to commodities, orders, and users in the library can be divided into a library, and the performance can be improved by reducing the size of a single library. Similarly, the case of sub-table is to split a large table into sub-tables according to business functions, such as product basic information and product description. The basic product information is generally displayed in the product list, and the product description is on the product detail page. The basic information and product description are split into two tables.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-UOpBC19V-1683246051815)(http://img.topjavaer.cn/img/vertical division.png)]

Advantages : The row records become smaller, the data page can store more records, and the number of I/Os is reduced during query.

Disadvantages :

The primary key is redundant, and redundant columns need to be managed;
It will cause table connection JOIN operation, and the database pressure can be reduced by joining on the business server;
There is still the problem that the amount of data in a single table is too large.

horizontal division

Horizontal division is to split data according to certain rules, such as time or id sequence value. For example, split different databases according to the year. Each database has the same structure, but the data can be split to improve performance.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-9aY4g4fl-1683246051816)(http://img.topjavaer.cn/img/horizontal division.png)]

Advantages : The amount of data in a single database (table) can be reduced, improving performance; the split table structure is the same, and the program changes less.

Disadvantages :

Shard transaction consistency is difficult to solve
Poor cross-node joinperformance and complex logic
Data fragmentation needs to be migrated during expansion

What is a partition table?

Partitioning is to divide the data of a table into N multiple blocks. A partition table is an independent logical table, but the bottom layer consists of multiple physical sub-tables.

When the data of the query condition is distributed in a certain partition, the query engine will only query a certain partition instead of traversing the entire table. At the management level, if you need to delete the data of a certain partition, you only need to delete the corresponding partition.

Partitions are generally placed on a single machine, and time range partitions are more commonly used for easy archiving. It's just that sub-database sub-table requires code implementation, and partition is implemented inside mysql. Sub-database sub-table and partition do not conflict and can be used in combination.

partition table type

range partition , partition by range. For example, partitioning by time range

CREATE TABLE test_range_partition(
       id INT auto_increment,
       createdate DATETIME,
       primary key (id,createdate)
   ) 
   PARTITION BY RANGE (TO_DAYS(createdate) ) (
      PARTITION p201801 VALUES LESS THAN ( TO_DAYS('20180201') ),
      PARTITION p201802 VALUES LESS THAN ( TO_DAYS('20180301') ),
      PARTITION p201803 VALUES LESS THAN ( TO_DAYS('20180401') ),
      PARTITION p201804 VALUES LESS THAN ( TO_DAYS('20180501') ),
      PARTITION p201805 VALUES LESS THAN ( TO_DAYS('20180601') ),
      PARTITION p201806 VALUES LESS THAN ( TO_DAYS('20180701') ),
      PARTITION p201807 VALUES LESS THAN ( TO_DAYS('20180801') ),
      PARTITION p201808 VALUES LESS THAN ( TO_DAYS('20180901') ),
      PARTITION p201809 VALUES LESS THAN ( TO_DAYS('20181001') ),
      PARTITION p201810 VALUES LESS THAN ( TO_DAYS('20181101') ),
      PARTITION p201811 VALUES LESS THAN ( TO_DAYS('20181201') ),
      PARTITION p201812 VALUES LESS THAN ( TO_DAYS('20190101') )
   );

/var/lib/mysql/data/The corresponding data files can be found in , each partition table has a table file named using # separation:

   -rw-r----- 1 MySQL MySQL    65 Mar 14 21:47 db.opt
   -rw-r----- 1 MySQL MySQL  8598 Mar 14 21:50 test_range_partition.frm
   -rw-r----- 1 MySQL MySQL 98304 Mar 14 21:50 test_range_partition#P#p201801.ibd
   -rw-r----- 1 MySQL MySQL 98304 Mar 14 21:50 test_range_partition#P#p201802.ibd
   -rw-r----- 1 MySQL MySQL 98304 Mar 14 21:50 test_range_partition#P#p201803.ibd
...

list partition

The list partition is similar to the range partition, the main difference is that list is a collection of enumerated value lists, and range is a collection of continuous interval values. For list partitioning, the partition field must be known. If the inserted field is not in the enumeration value when partitioning, it will not be inserted.

create table test_list_partiotion
   (
       id int auto_increment,
       data_type tinyint,
       primary key(id,data_type)
   )partition by list(data_type)
   (
       partition p0 values in (0,1,2,3,4,5,6),
       partition p1 values in (7,8,9,10,11,12),
       partition p2 values in (13,14,15,16,17)
   );

hash partition

Data can be evenly distributed into pre-defined partitions.

create table test_hash_partiotion
   (
       id int auto_increment,
       create_date datetime,
       primary key(id,create_date)
   )partition by hash(year(create_date)) partitions 10;

partition problem?

Opening and locking all underlying tables can be costly. When a query accesses a partitioned table, MySQL needs to open and lock all the underlying tables. This operation occurs before partition filtering, so partition filtering cannot be used to reduce this overhead, which will affect the query speed. This overhead can be reduced through batch operations, such as bulk inserts and LOAD DATA INFILEdeletions of multiple rows at a time.
Maintaining partitions can be expensive. For example, to reorganize a partition, a temporary partition will be created first, then the data will be copied to it, and finally the original partition will be deleted.
All partitions must use the same storage engine.

Query statement execution flow?

The execution process of the query statement is as follows: permission verification, query cache, analyzer, optimizer, permission verification, executor, and engine.

For example, the query statement is as follows:

select * from user where id > 1 and name = '大彬';

Check the permission first, and return an error if there is no permission;
Before MySQL8.0, the cache will be queried, and the cache will be returned directly if the cache hits, and the next step will be executed if there is no cache;
Lexical analysis and syntax analysis. Extract table names, query conditions, and check for syntax errors;
Two execution schemes, check first id > 1or name = '大彬'the optimizer selects the scheme with the best execution efficiency according to its own optimization algorithm;
Check the authority, call the database engine interface if you have the authority, and return the execution result of the engine.

Update statement execution process?

The update statement execution flow is as follows: analyzer, permission check, executor, engine, redo log( preparestatus), binlog, redo log( commitstatus)

For example, the update statement is as follows:

update user set name = '大彬' where id = 1;

The record with id 1 is queried first, and the cache will be used if there is a cache.
Get the query result, update the name to Dabin, then call the engine interface, write the updated data, the innodb engine will save the data in the memory, record it at the same time, and enter the state at this redo logtime .redo logprepare
After the executor receives the notification, it records binlog, then calls the engine interface, and submits it redo logas committhe status.
update completed.

Why do you not submit directly after recording redo log, but enter preparethe state first?

Assume that the write redo logis submitted first, and then write . After binlogwriting redo log, the machine hangs up and binlogthe log is not written. Then after the machine restarts, the machine will restore the data, but the data is not recorded redo logat this time , and the machine will be backed up later. binlog, this piece of data will be lost, and the master-slave synchronization will also lose this piece of data.

The difference between exist and in?

existsUsed to filter external records. existsIt will traverse the outer table and substitute each row of the outer query table into the inner query for judgment. When existsthe conditional statement in the table can return the record row, the condition is true, and the current record in the table is returned. Conversely, if existsthe conditional statement inside cannot return record rows and the condition is false, the current record in the foreign table will be discarded.

select a.* from A awhere exists(select 1 from B b where a.id=b.id)

inIt is to first find out the following statement and put it in the temporary table, then traverse the temporary table, and substitute each row of the temporary table into the outer query to find it.

select * from Awhere id in(select id from B)

When the subquery table is relatively large , using it existscan effectively reduce the total number of loops to improve the speed; when the table of the outer query is relatively large , using it incan effectively reduce the loop traversal of the outer query table to improve the speed.

What is the difference between int(10) and char(10) in MySQL?

The 10 in int(10) represents the length of the displayed data, and the char(10) represents the length of the stored data.

What is the difference between truncate, delete and drop?

Same point:

truncateand without whereclauses delete, and dropwill delete the data in the table.
drop, truncateare DDLstatements (data definition language), will be automatically submitted after execution.

difference:

truncate and delete only delete the data but not the structure of the table; the drop statement will delete the constraints, triggers, and indexes that the structure of the table depends on;
In general, execution speed: drop > truncate > delete.

What is the difference between having and where?

The objects they act on are different. whereThe clause acts on tables and views, and havingacts on groups.
whereFilter before data grouping and havingfilter after data grouping.

Why do master-slave synchronization?

Read-write separation enables the database to support greater concurrency.
Generate real-time data on the master server, and analyze this data on the slave server, thereby improving the performance of the master server.
Data backup to ensure data security.

What is MySQL master-slave synchronization?

masterMaster-slave synchronization enables data to be replicated from one database server to other servers, with one server acting as the master ( ) and the rest acting as slaves ( slave) when replicating data.

Because the replication is asynchronous, the slave server does not need to be connected to the master server all the time, and the slave server can even connect to the master server intermittently through dial-up. Through the configuration file, you can specify to replicate all databases, a certain database, or even a certain table on a certain database.

What are optimistic locks and pessimistic locks?

The concurrency control in the database is to ensure that the isolation and unity of the transaction and the unity of the database are not destroyed when multiple transactions access the same data in the database at the same time. Optimistic locking and pessimistic locking are the main technical means used in concurrency control.

Pessimistic locking: Assuming that concurrency conflicts will occur, the data of the operation will be locked, and the lock will not be released until the transaction is committed, and other transactions can be modified. Implementation method: use the lock mechanism in the database.
Optimistic lock: Assuming that no concurrency conflicts will occur, only check whether the data has been modified when the operation is committed. Add a field to the table , and check whether it is equal to the original value versionbefore submitting the modification. If they are equal, it means that the data has not been modified and can be updated. Otherwise, the data is dirty and cannot be updated. Implementation method: Optimistic locking is generally implemented using a version number mechanism or algorithm.versionversionCAS

Have you used processlist?

show processlistOr show full processlistyou can check whether the current MySQL is under pressure, is running, and whether it is running SQLslowly . SQLThe return parameters are as follows:

id : Thread ID, can be used to kill idkill a thread
db : database name
user : database user
host : the IP of the database instance
command : the currently executed command, such as Sleep, Query, Connect etc.
time : consumption time, in seconds
state : Execution state, mainly has the following states:
- Sleep, the thread is waiting for the client to send a new request
- Locked, the thread is waiting for the lock
- Sending data, the records of the query are being processed SELECT, and the results are sent to the client at the same time
- Kill, killthe statement is being executed, killing the specified thread
- Connect, a slave node connects to the master node
- Quit, the thread is exiting
- Sorting for group, GROUP BYsorting for
- Sorting for order, ORDER BYsorting for
infoSQL : the statement being executed

Is MySQL query limit 1000,10 as fast as limit 10?

Two query methods. Corresponding to limit offset, sizeand limit sizetwo ways.

In fact limit size, it is equivalent to limit 0, size. That is, take size pieces of data starting from 0.

In other words, the difference between the two methods lies in whether the offset is 0.

Let's first look at the internal execution logic of limit sql.

MySQL is internally divided into a server layer and a storage engine layer . In general, the storage engine uses innodb.

There are many modules in the server layer, among which the executor is a component used to deal with the storage engine.

The executor can retrieve a row of data by calling the interface provided by the storage engine. When the data fully meets the requirements (such as meeting other where conditions), it will be placed in the result set and finally returned to the client calling mysql .

Take the limit execution process of the primary key index as an example:

Execution select * from xxx order by id limit 0, 10;, select is followed by an asterisk , that is, it is required to obtain all field information of row data.

The server layer will call the innodb interface, get the 0th to 10th complete row data in the primary key index in innodb , return them to the server layer in turn, put them in the result set of the server layer, and return them to the client.

Make the offset bigger, for example, the execution is:select * from xxx order by id limit 500000, 10;

The server layer will call the innodb interface. Since the offset=500000 this time, it will get the 0th to (500000 + 10) complete rows of data in the primary key index in innodb , and return them to the server layer and discard them one by one according to the offset value. , and finally only leave the last size item , that is, 10 pieces of data, put it in the result set of the server layer, and return it to the client.

It can be seen that when the offset is not 0, the server layer will obtain a lot of useless data from the engine layer , and obtaining these useless data is time-consuming.

Therefore, limit 1000,10 will be slower than limit 10 in mysql query. The reason is that limit 1000,10 will take out 1000+10 pieces of data and discard the first 1000 pieces, which takes more time.

How much data can a B+ tree with a height of 3 store?

The InnoDB storage engine has its own smallest storage unit - Page.

The command to query InnoDB page size is as follows:

mysql> show global status like 'innodb_page_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_page_size | 16384 |
+------------------+-------+

It can be seen that the default page size of innodb is 16384B = 16384/1024 = 16kb.

In MySQL, it is most appropriate to set the size of a node of the B+ tree to a page or a multiple of a page. Because if the size of a node is < 1 page, then when reading this node, it actually reads one page, which causes a waste of resources.

Non- leaf nodes in the B+ tree store key + pointers ; leaf nodes store data rows .

For leaf nodes, if the size of a row of data is 1k, then 16 pieces of data can be stored in one page.

For non-leaf nodes, if the key uses bigint, it is 8 bytes, and the pointer is 6 bytes in MySQL, a total of 14 bytes, then 16k can store 16 * 1024 / 14 = 1170 index pointers.

So it can be calculated that for a B+ tree with a height of 2, the root node stores the index pointer node, then it has 1170 leaf nodes to store data, and each leaf node can store 16 pieces of data, a total of 1170 x 16 = 18720 pieces of data. For a B+ tree with a height of 3, it can store 1170 x 1170 x 16 = 21902400 pieces of data ( more than 20 million pieces of data ), that is, for more than 20 million pieces of data, we only need a B+ tree with a height of 3 It can be completed, and the corresponding data can be found through the primary key query only 3 IO operations.

Therefore, when the B+ tree height in InnoDB is generally 3 layers, it can meet the data storage of tens of millions.

How to optimize deep paging?

Still empty with the above SQL:select * from xxx order by id limit 500000, 10;

Method one :

From the above analysis, it can be seen that when the offset is very large, the server layer will obtain a lot of useless data from the engine layer, and when the select is followed by an *, it is necessary to copy the complete row information . Copying one or two of the column fields in the row data is more time-consuming.

Because the previous offset data is unnecessary at the end, there is no need to copy the complete field, so you can modify the sql statement to:

select * from xxx  where id >=(select id from xxx order by id limit 500000, 1) order by id limit 10;

Execute the subquery first select id from xxx by id limit 500000, 1. This operation will actually obtain 500000+1the data in the primary key index in InnoDB, and then the server layer will discard the first 500,000 data, and only keep the id of the last data.

But the difference is that in the process of returning to the server layer, only the id column in the data row will be copied, and all the columns of the data row will not be copied. When the amount of data is large, the time consumption of this part is quite obvious. .

After getting the above id, assuming that the id is exactly equal to 500000, then the sql becomes

select * from xxx  where id >=500000 order by id limit 10;

In this way, innodb walks the primary key index again, and quickly locates the row data with id=500000 through the B+ tree. The time complexity is lg(n), and then fetches 10 pieces of data backward.

Method Two:

Sort all the data according to the id primary key , and then fetch it in batches, and use the maximum id of the current batch as the next filtering condition for query.

select * from xxx where id > start_id order by id limit 10;

Through the primary key index, locate the position of start_id each time, and then traverse 10 data backwards, so that no matter how large the data is, the query performance is relatively stable.

How to optimize the slow query of large tables?

A table has tens of millions of data, and the query is slow. How to optimize it?

When the number of records in a single MySQL table is too large, the performance of the database will drop significantly. Some common optimization measures are as follows:

Build indexes properly. Build indexes on appropriate fields, such as building indexes on columns involved in WHERE and ORDER BY commands, and check whether indexes or full table scans are used according to EXPLAIN
Index optimization, SQL optimization. The leftmost matching principle, etc., refer to: https://topjavaer.cn/database/mysql.html#%E4%BB%80%E4%B9%88%E6%98%AF%E8%A6%86%E7%9B %96%E7%B4%A2%E5%BC%95
Create partitions. Create horizontal partitions for key fields, such as the time field. If the query conditions are often queried through the time range, it can improve a lot of performance
Take advantage of caching. Use Redis to cache hot data to improve query efficiency
Limit the scope of the data. For example: when users query historical information, they can control it within a month
Read and write separation. In the classic database splitting scheme, the main library is responsible for writing, and the slave library is responsible for reading
Optimize by sub-database and sub-table, mainly including vertical split and horizontal split
Build indexes properly. Create indexes on appropriate fields, such as columns involved in WHERE and ORDERBY commands

Data heterogeneity to es
Separation of hot and cold data. Put the data that is not commonly used a few months ago in the cold storage, and put the latest data in the hot storage
Upgrade the database type and change to a database compatible with MySQL (OceanBase, tidb)

How big is a MySQL single table for sub-database sub-table?

There are currently two mainstream theories:

If the amount of data in a single MySQL table is greater than 20 million rows, the performance will drop significantly. Consider splitting databases and tables.
Alibaba's "Java Development Manual" proposes that the number of rows in a single table exceed 5 million rows or the capacity of a single table exceeds 2GB, and it is recommended to divide databases and tables.

In fact, this value has nothing to do with the actual number of records, but with the configuration of MySQL and the hardware of the machine. Because MySQL will load the index of the table into memory in order to improve performance. When the InnoDB buffer size is sufficient, it can be fully loaded into memory, and there will be no problem with queries. However, when the single-table database reaches the upper limit of a certain magnitude, the memory cannot store its index, so that subsequent SQL queries will generate disk IO, resulting in performance degradation. Of course, this is also related to the design of the specific table structure, and the final problem is memory limitation.

Therefore, for the sub-database and sub-table, it needs to be combined with the actual needs, and it is not suitable to over-design. At the beginning of the project, the sub-database and sub-table design should not be adopted, but as the business grows, when the optimization cannot be continued, then consider the sub-library Improve system performance with sub-tables. In this regard, Alibaba's "Java Development Manual" added: If the estimated data volume in three years will not reach this level at all, please do not divide the database into tables when creating tables.

As for the size of the MySQL single table for sub-database and sub-table, it should be evaluated according to the machine resources.

Talk about the difference between count(1), count(*) and count(field name)

Well, first talk about the difference between count(1) and count(field name).

The main difference between the two is

count(1) will count all the records in the table, including the records whose field is null.
count (field name) will count the number of times the field appears in the table, ignoring the case that the field is null. That is, records whose fields are null are not counted.

Next, let's look at the differences between the three.

In terms of execution effect:

count(*) includes all columns, which is equivalent to the number of rows. When counting the results, the column value NULL will not be ignored
count(1) includes ignoring all columns, using 1 to represent the code line, and will not ignore the column value NULL when counting the results
The count (field name) only includes the column name. When counting the results, it will ignore the count of the column value being empty (the empty here is not just an empty string or 0, but means null), that is, the value of a certain field is When NULL, do not count .

In terms of execution efficiency:

The column name is the primary key, and count(field name) will be faster than count(1)
The column name is not the primary key, count(1) will be faster than count(column name)
If the table has multiple columns and no primary key, count(1) performs better than count(*)
If there is a primary key, the execution efficiency of select count (primary key) is optimal
If the table has only one field, select count(*) is optimal.

What is the difference between DATETIME and TIMESTAMP in MySQL?

Well, TIMESTAMPboth and DATETIMEcan be used to store time, they mainly have the following differences:

1. Express range

DATETIME: 1000-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999
TIMESTAMP: '1970-01-01 00:00:01.000000' UTC 到 '2038-01-09 03:14:07.999999' UTC

TIMESTAMPThe supported time range is relatively DATATIMEsmall, and it is easy to exceed the situation.

2. Space occupation

TIMESTAMP : 4 bytes
DATETIME: Before MySQL 5.6.4, it occupies 8 bytes, and in later versions, it occupies 5 bytes

3. Whether the deposit time will be automatically converted

TIMESTAMPType By default, when inserting, updating data, TIMESTAMPthe column will be automatically CURRENT_TIMESTAMPfilled/updated with the current time ( ). DATETIMEIt will not do any conversion, nor will it detect the time zone, what data you give, and what data it stores.

4. TIMESTAMPThe comparison is affected by the time zone timezone and the MYSQL version and SQL MODE of the server. Because TIMESTAMPthe time stamp is stored, the time obtained in different time zones is inconsistent.

5. If NULL is stored, the actual stored values of the two are different.

TIMESTAMP: The current time now() will be automatically stored.
DATETIME: The current time will not be stored automatically, and the NULL value will be stored directly.

Tell me why it is not recommended to use foreign keys?

A foreign key is a constraint. The existence of this constraint will ensure that the data relationship between tables is always complete. The existence of foreign keys is not entirely without advantages.

The foreign key can ensure the integrity and consistency of the data, and the cascading operation is convenient. Moreover, the use of foreign keys can entrust the judgment of data integrity to the database, reducing the amount of code in the program.

Although foreign keys can ensure the integrity of data, they will bring many defects to the system.

1. Concurrency issues. In the case of using foreign keys, every time you modify the data, you need to go to another table to check the data, and you need to acquire additional locks. In high-concurrency and high-traffic transaction scenarios, using foreign keys is more likely to cause deadlocks.

2. Scalability issues. For example, from MySQLmigration to Oracleforeign key depends on the characteristics of the database itself, it may be inconvenient to do migration.

3. It is not conducive to sub-database sub-table. In the case of horizontal splitting and sub-databases, foreign keys cannot take effect. Putting the maintenance of the relationship between data into the application program will save a lot of trouble for future sub-database and sub-table.

What are the benefits of using an auto-increment primary key?

The auto-increment primary key allows the primary key index to be inserted in increasing order as much as possible, avoiding page splitting, so the index is more compact, and the query efficiency is higher.

Why can't InnoDB's self-increment value be recycled?

Mainly to improve the efficiency and parallelism of inserting data.

Assuming that there are two transactions executing in parallel, when applying for self-increment, in order to prevent the two transactions from applying for the same self-increment id, lock must be added, and then apply sequentially.

Assuming that transaction A has applied for id=2, and transaction B has applied for id=3, then the self-increment value of table t is 4 at this time, and then continue to execute.

Transaction B commits correctly, but transaction A has a unique key violation.

If transaction A is allowed to roll back the auto-increment id, that is, change the current auto-increment value of table t back to 2, then there will be such a situation: there is already a row with id=3 in the table, and the current auto-increment id value is 2.

Next, other transactions that continue to execute will apply for id=2, and then apply for id=3. At this time, the insert statement will report an error "primary key conflict".

In order to solve this primary key conflict, there are two ways:

Before each application for an id, first determine whether the id already exists in the table. If it exists, skip this id. However, this method is costly. Because originally applying for an id is a quick operation, and now it is necessary to go to the primary key index tree to determine whether the id exists.
To expand the lock range of the auto-increment id, you must wait until a transaction is executed and submitted before the next transaction can apply for the auto-increment id. The problem with this method is that the granularity of the lock is too large, and the concurrency capability of the system is greatly reduced.

It can be seen that both methods will cause performance problems.

Therefore, InnoDB gave up the design of "allowing auto-increment id fallback", and the statement execution failure does not fall back to auto-increment id.

Where is the auto-increment primary key stored?

Different engines have different storage strategies for auto-increment:

The self-increment value of the MyISAM engine is stored in the data file.
Before MySQL8.0, the self-increment value of the InnoDB engine was stored in memory. After MySQL restarts, this value in the memory is lost. When the table is opened for the first time after each restart, it will find the maximum value max(id) of the self-increment value, and then add 1 to the maximum value as the self-increment value of the table; MySQL8 The .0 version will record the self-incremental changes in the redo log, and rely on the redo log to recover when restarting.

Must auto-increment primary keys be consecutive?

Not necessarily, there are several situations that will cause the auto-increment primary key to be discontinuous.

1. Unique key conflicts lead to discontinuous auto-increment primary keys. When we insert data into an InnoDB table with an auto-increment primary key, if we violate the unique constraint of the unique index defined in the table, the data insertion will fail. At this time, the key value of the auto-increment primary key of the table will scroll backward by 1. The next time you insert data again, you can no longer use the key value generated by scrolling due to the failure to insert data last time, and you must use the key value generated by the new scroll.

2. The auto-increment primary key is discontinuous due to transaction rollback. When we insert data into an InnoDB table with an auto-increment primary key, if the transaction is explicitly opened, and then the transaction is finally rolled back for some reason, the auto-increment value of the table will also roll, and the next new The inserted data will also not be able to use the rolled self-increment value, but needs to re-apply for a new self-increment value.

3. Batch insertion results in discontinuous self-increment. MySQL has a strategy for auto-incrementing ids for batch applications:

During the execution of the statement, apply for the auto-increment id for the first time, and allocate 1 auto-increment id
After 1 is used up, the second application will allocate 2 auto-increment ids
After 2 are used up, the third application will allocate 4 auto-increment ids
And so on, each application is twice as much as the previous one (the last application may not all be used)

If the next transaction inserts data again, it will apply again based on the self-increment value after the previous transaction application. At this time, the self-incremental discontinuity occurs.

4. If the auto-increment step size is not 1, it will also cause the auto-increment primary key to be discontinuous.

How to synchronize MySQL data to Redis cache?

Reference: https://cloud.tencent.com/developer/article/1805755

There are two options:

1. Redis is automatically refreshed synchronously through MySQL, which is realized by MySQL trigger + UDF function .

The process is roughly as follows:

In MySQL, set the trigger Trigger for the data to be operated, and monitor the operation
When the client writes data to MySQL, the trigger will be triggered, and the MySQL UDF function will be called after the trigger
UDF function can write data into Redis, so as to achieve the effect of synchronization

2. Analyze the binlog of MySQL to synchronize the data in the database to Redis. Can be achieved through canal. Canal is an open source project under Alibaba. It provides incremental data subscription and consumption based on database incremental log analysis.

The principle of canal is as follows:

Canal simulates the interactive protocol of mysql slave, pretends to be mysql slave, and sends dump protocol to mysql master
Mysql master receives dump request and starts to push binary log to canal
Canal parses the binary log object (originally a byte stream), and writes the data to Redis synchronously.

Why does the Ali Java manual prohibit the use of stored procedures?

First look at what is a stored procedure.

A stored procedure is a set of SQL statements to complete a specific function in a large database system. It is stored in the database and is permanently valid after one compilation. Users specify the name of the stored procedure and give parameters (if the stored procedure has parameter) to execute it.

Stored procedures mainly have the following disadvantages.

Stored procedures are difficult to debug . The development of stored procedures has always lacked an effective IDE environment. The SQL itself is often very long, and it is very troublesome to disassemble the sentences and execute them separately in the debugging mode.
Poor portability . The porting of stored procedures is difficult, and general business systems will inevitably use the unique features and syntax of the database. When the database is replaced, this part of the code needs to be rewritten, and the cost is high.
Difficult to manage . The directory of stored procedures is flat, not a tree structure like the file system. It is easy to handle when there are few scripts, but once there are too many, the directory will become chaotic.
The stored procedure is only optimized once . Sometimes with the increase of the data volume or the change of the data structure, the execution plan originally selected by the stored procedure may not be optimal, so manual intervention or recompilation is required at this time.

Finally, I would like to share with you a Github warehouse, which has more than 300 classic computer book PDFs compiled by Dabin, including C language, C++, Java, Python, front-end, database, operating system, computer network, data structure and algorithm, machine learning , programming life , etc., you can star it, next time you look for a book directly search on it, the warehouse is continuously updated~

Github address