[MySQL interview questions (66 questions)]

Article directory

MySQL interview questions (66 questions)

image-20230818214738046

Base

As a SQLBoy, there won’t be anyone who doesn’t know the basics, right? There are not many questions in the interview. Friends who have a good basic knowledge can skip this part. Of course, you may write some SQL statements on the spot. SQ statements can be practiced through websites such as Niuke, LeetCode, and LintCode.

The full name of DDL is Data Definition Language, which is data definition language;

The full name of DML is Data Manipulation Language, which is data manipulation language;

The full name of DQL is Data Query Language, which is data query language;

The full name of DCL is Data Control Language, which is data control language.

1. What are inner joins, outer joins, cross joins, and Cartesian products?

  • Inner join: Obtain records from two tables that satisfy the connection matching relationship.
  • Outer join: not only obtains the records in two tables that satisfy the connection matching relationship, but also includes the records in a certain table (or two tables) that do not satisfy the matching relationship.
  • Cross join: Displays all records of two tables - corresponding, and filters if there is no matching relationship. It is the implementation of Cartesian product in SQL. If table A has m rows and table B has n rows, then A and B The result of the cross connection is m*n rows.
  • Cartesian product: It is a concept in mathematics, for example, set A={a, b}, set B={0, 1, 2}, then AXB={<a, 0>, <a, 1>, <a, 2>, <b, 0>, <b, 1>, <b, 2>}.

2. What are the differences between MySQL's inner join, left join and right join?

MySQL connections are mainly divided into inner connections and outer connections. Commonly used outer connections include left joins and right joins.

image-20230818215640306

MySQL-joins-source novice tutorial

  • Inner join inner join, when two tables are connected for query, only the completely matching result sets in the two tables are retained.
  • When left join performs a connection query between two tables, it will return all rows in the left table, even if there are no matching records in the right table.
  • When right join performs a join query between two tables, it will return all rows in the right table, even if there are no matching records in the left table.

3. Talk about the three major paradigms of database?

  • First normal form: Each column (each field) in the data table cannot be split. For example, in the user table, the user address can also be split into countries, provinces, and cities, so that it conforms to the first paradigm.
  • Second normal form: Based on the first normal form, non-primary key columns completely depend on the primary key and cannot be part of the primary key. For example, in the order table, product information (product price, product type) is stored, so the product ID and order ID need to be used as the joint primary key to satisfy the second paradigm.
  • Third normal form: On the basis of satisfying the second normal form, the non-primary keys in the table only depend on the primary key and do not depend on other non-primary keys. For example, the order table cannot store user information (name, address).

image-20230819085808351

The role of the three major paradigms is to control the redundancy of the database and save space. In fact, the designs of general Internet companies are anti-paradigm. By redundant some data, they avoid crossing tables and databases, and use space to exchange time. Improve performance.

4.What is the difference between varchar and char?

image-20230819090116459

char:

  • char represents a fixed-length string, and the length is fixed;
  • If the length of the inserted data is less than the fixed length of char, it will be filled with spaces;
  • Because the length is fixed, the access speed is much faster than varchar, even 50% faster. However, because of its fixed length, it takes up extra space, which is a way of exchanging space for time;
  • For char, the maximum number of characters that can be stored is 255, regardless of encoding.

varchar

  • varchar represents a variable-length string, and the length is variable;
  • The inserted data will be stored according to its length;
  • Varchar is the opposite of char in terms of access. It is slow to access because the length is not fixed, but because of this, it does not occupy extra space, which is a way of exchanging time for space;
  • For varchar, the maximum number of characters that can be stored is 65532

In daily design, for strings with a relatively fixed length, char can be used. For strings with uncertain lengths, varchar is more appropriate.

5.What is the difference between blob and text?

  • blob is used to store binary data, while text is used to store large strings.
  • Blob does not have a character set, text has a character set, and the values ​​are sorted and compared according to the collation rules of the character set.

6.What are the similarities and differences between DATETIME and TIMESTAMP?

Same point:

  1. The two data types store time in the same format. All are YYYY-MM-DD HH:MM:SS
  2. Both data types contain "date" and "time" components.
  3. Both data types can store fractional seconds in microseconds (6 decimal seconds after seconds)

the difference:

  1. Date range: The date range of DATETIME is 1000-01-01 00:00:00.000000to 9999-12-31 23:59:59.999999; the time range of TIMESTAMP is 1970-01-01 00:00:01.000000 UTC to 2038-01-09 03:14:07.999999UTC
  2. Storage space: The storage space of DATETIME is 8 bytes; the storage space of TIMESTAMP is 4 bytes
  3. Time zone related: DATETIME storage time has nothing to do with time zone; TIMESTAMP storage time is related to time zone, and the displayed value also depends on time zone
  4. Default value: The default value of DATETIME is null; the field of TIMESTAMP is not empty by default (not null), and the default value is the current time (CURRENT_TIMESTAMP)

7.What is the difference between in and exists in MySQL?

The in statement in MySQL performs a hash connection between the external table and the internal table, while the exists statement performs a loop on the external table, and queries the internal table each time the loop loops. We may think that exists is more efficient than in statement, but this statement is actually inaccurate. We need to distinguish between scenarios:

  1. If the two tables queried are of similar size, there is not much difference between using in and exists.
  2. If one of the two tables is smaller and the other is a large table, use exists for the larger subquery table and in for the smaller subquery table.
  3. not in and not exists: If the query statement uses not in, then the entire table will be scanned on both the inner and outer tables without using the index; and the subquery of not extsts can still use the index on the table. So no matter which table is big, using not exists is faster than not in.

8.What field type is better to use to record currency in MySQL?

DecimalCurrency is commonly represented by the and type in MySQL databases Numric, and these two types are implemented as the same type by MySQL. They are used to store currency-related data.

For example, salary DECIMAL(9, 2), 9 (precision) represents the total number of decimal places that will be used to store the value, and 2 (scale) represents the number of digits after the decimal point that will be used to store the value. The range of values ​​stored in the salary column is from -9999999.99 to 9999999.99

DECIMAL and NUMERIC values ​​are stored as strings rather than as binary floating point numbers in order to preserve the decimal precision of those values.

The reason why float or double is not used: Because float and double are stored in binary, there is a certain error .

9.How does MySQL store emoji expressions?

MySQL can directly use strings to store emoji.

However, it should be noted that utf8 encoding is not possible. The utf8 in MySQL is a castrated version of utf8. It only uses up to 3 bytes to store characters, so it cannot store expressions. so what should I do now? utf8mb4 encoding is required .

alter table blogs modify content text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci not null;

10.What is the difference between drop, delete and truncate?

All three mean deletion, but there are some differences between the three:

img

Therefore, when a table is no longer needed, use drop; when you want to delete some data rows, use delete; when you want to retain the table and delete all data, use truncate.

11.What is the difference between UNION and UNION ALL?

  • If UNION is used, duplicate record rows will be filtered out after table linking.
  • If UNION ALL is used, duplicate record rows will not be merged
  • In terms of efficiency, UNION ALL is much faster than UNION. If the merge does not deliberately delete duplicate rows, then use UNION All.

12.What is the difference between count(1), count(*) and count(column name)?

Execution effect:

  • count(*) includes all columns, which is equivalent to the number of rows. When calculating the results, NULL column values ​​will not be ignored.
  • count(1) includes ignoring all columns, using 1 to represent code lines. When counting results, column values ​​​​that are NULL will not be ignored.
  • count (column name) only includes the column with the column name. When counting the results, the count of the column value being empty (the empty here is not just an empty string or 0, but means null), that is, a certain field value is When NULL, no statistics are taken.

Execution speed:

  • The column name is the primary key, and count(column name) will be faster than count(1)
  • Column name is not the primary key, count(1) will be faster than count(column name)
  • If the table has multiple columns and no primary key, count(1) performs better than count(*)
  • If there is a primary key, the execution efficiency of select count (primary key) is optimal
  • If the table has only one field, select count(*) is optimal.

13. What is the execution order of a SQL query statement?

image-20230819093128060

  1. FROM : Perform Cartesian product (Cartesian product) on the left table <left_table> and the right table <right_table> in the FROM clause to generate virtual table VT1
  2. ON : Apply ON filtering to virtual table VT1 , only those rows that meet <join_condition> are inserted into virtual table VT2
  3. JOIN : If OUTER JOIN (such as LEFT OUTER JOIN, RIGHT OUTER JOIN) is specified, the unmatched rows in the reserved table are added to virtual table VT2 as external rows, resulting in virtual table VT3. If the FROM clause contains more than two tables, repeat steps 1) to 3) for the result table VT3 generated by the previous connection and the next table until all tables are processed.
  4. WHERE : Apply WHERE filter conditions to virtual table VT3 . Only records that meet <where_condition> are inserted into virtual table VT4.
  5. GROUP BY : Group the records in VT4 according to the columns in the GROUP BY clause to generate VT5
  6. CUBE|ROLLUP : Perform CUBE or ROLLUP operation on table VT5 to generate table VT6
  7. HAVING : Apply HAVING filter to virtual table VT6 . Only records that meet <having-condition> are inserted into virtual table VT7.
  8. SELECT : Perform the SELECT operation for the second time, select the specified column, and insert it into the virtual table VT8
  9. DISTINCT : remove duplicate data and generate virtual table VT9
  10. ORDER BY : Sort the records in virtual table VT9 according to <order_by_list> to generate virtual table VT10.11)
  11. LIMIT : Take out the records of the specified row , generate virtual table VT11, and return it to the query user

database schema

14. Tell me about the infrastructure of MySQL?

The MySQL logical architecture diagram is mainly divided into three layers:

  • Client: The top-level service is not unique to MySQL. Most network-based client/server tools or services have a similar architecture. Such as connection processing, authorization authentication, security, etc.
  • Server layer: Most of MySQL's core service functions are in this layer, including query parsing, analysis, optimization, caching, and all built-in functions (such as date, time, mathematics, and encryption functions). All cross-storage engine functions are Implemented at this layer: stored procedures, triggers, views, etc.
  • Storage engine layer: The third layer contains the storage engine. The storage engine is responsible for the storage and retrieval of data in MySQL . The Server layer communicates with the storage engine through APIs. These interfaces shield the differences between different storage engines, making these differences transparent to the upper-layer query process.

15. How is a SQL query statement executed in MySQL?

  • First check whether the statement has permission. If there is no permission, an error message will be returned directly. If there is permission, the cache will be queried first (before MySQL 8.0 version).
  • If there is no cache, the analyzer extracts 语法分析key elements such as select in the SQL statement, and then determines whether the SQL statement has syntax errors, such as whether the keywords are correct, etc.
  • After syntax parsing, the MySQL server will optimize the query statement and determine the execution plan.
  • After completing the query optimization, the execution results are returned according to the generated execution plan 调用数据库引擎接口.

What is the execution process of an SQL update statement in MySQL?

img

storage engine

16.What are the common storage engines for MySQL?

InnoDB、MyISAM、MEMORY

The main storage engines and functions are as follows:

image-20230819095201271

Before MySQL 5.5, the default storage engine was MyISAM, and after 5.5 it became InnoDB.

The hash index supported by InnoDB is adaptive. InnoDB will automatically generate a hash index for the table based on the usage of the table. Human intervention is not allowed to generate a hash index in a table.

MySQL 5.6 starts InnoDB supporting full-text indexing.

17. How should I choose a storage engine?

Generally speaking, you can choose:

  • In most cases, using the default InnoDB is sufficient. If you want to provide transaction security (ACID compatibility) capabilities for commit, rollback, and recovery, and require concurrency control, InnoDB is the first choice.
  • If the data table is mainly used to insert and query records, the MyISAM engine provides higher processing efficiency.
  • If the data is only temporarily stored, the amount of data is not large, and high data security is not required, you can choose to save the data in the MEMORY engine in the memory. MySQL uses this engine as a temporary table to store the intermediate results of the query.

Which engine to use can be flexibly selected according to needs, because the storage engine is based on tables, so multiple tables in a database can use different engines to meet various performance and actual needs. Using the appropriate storage engine will improve the performance of the entire database.

18.What are the main differences between InnoDB and MylSAM?

PS: MySQL8.0 is gradually becoming popular. If it is not an interview, you don’t need to know much about MyISAM.

  1. Storage structure : Each MyISAM is stored as three files on the disk ; all InnoDB tables are stored in the same data file (it may also be multiple files, or independent table space files). The size of the InnoDB table is only affected by Limited to the size of the operating system file , generally 2GB.
  2. Transaction support : MyISAM does not provide transaction support ; InnoDB provides transaction support and has transaction security features such as transaction (commit), rollback (rollback) and crash recovery capabilities (crash recovery capabilities).
  3. Minimum lock granularity : MyISAM only supports table-level locks . When updating, the entire table will be locked, causing other queries and updates to be blocked. InnoDB supports row-level locks .
  4. Index type: MyISAM's index is a non-clustered index, and the data structure is a B-tree ; InnoDB's index is a clustered index, and the data structure is a B+ tree .
  5. The primary key is required : MyISAM allows tables without any indexes and primary keys to exist ; InnoDB will automatically generate a 6-byte primary key (invisible to the user) if no primary key or non-empty unique index is set . The data is part of the primary index, and additional The index holds the value of the main index.
  6. The specific number of rows in the table : MyISAM saves the total number of rows in the table. If you select count(*) from table; the value will be taken out directly; InnoDB does not save the total number of rows in the table. If you use select count(*) from table, it will be traversed. The entire table; but after adding the wehre condition, MylSAM and InnoDB handle it in the same way.
  7. Foreign key support : MyISAM does not support foreign keys; InnoDB supports foreign keys.

log

19.What are the MySQL log files? Introduce the functions respectively?

There are many MySQL log files, including:

  • Error log : The error log file records the startup, running, and shutdown processes of MySQL and can help locate MySQL problems.
  • Slow query log (slow query log) : The slow query log is used to record query statements whose execution time exceeds the length defined by the long_query_time variable. Through the slow query log, you can find out which query statements have low execution efficiency for optimization.
  • General query log (general log) : The general query log records all information requested on the MySQL database, regardless of whether the request was executed correctly.
  • Binary log (bin log) : Regarding the binary log, it records all DDL and DML statements executed by the database (except data query statements select, show, etc.), recorded in the form of events and saved in binary files.

There are also two log files specific to the InnoDB storage engine:

  • Redo log : Redo logs are crucial because they record transaction logs for the InnoDB storage engine.
  • Rollback log (undo log) : The rollback log is also a log provided by the InnoDB engine. As the name suggests, the function of the rollback log is to roll back the data. When a transaction modifies the database, the InnoDB engine will not only record the redo log, but also generate the corresponding undo log; if the transaction execution fails or rollback is called, causing the transaction to be rolled back, the information in the undo log can be used to restore the data. Scroll to the way it looked before the modification.

20.What is the difference between binlog and redo log?

  • Bin log will record all log records related to the database , including logs of storage engines such as InnoDB and MyISAM, while redo log only records logs of the InnoDB storage engine.
  • The recorded content is different. Bin log records the specific operation content of a transaction , that is, the log is a logical log. The redo log records the physical changes to each page (Page).
  • The writing time is different. The bin log is only committed before the transaction is committed , that is, it is only written to the disk once. While the transaction is in progress , redo ertry is constantly being written to the redo log.
  • The writing methods are also different. redo log is cyclic writing and erasing , while bin log is appending writing and will not overwrite already written files.

21. Do you understand how to execute an update statement?

img

image-20230819104723477

The execution of the update statement is completed by the cooperation of the server layer and the engine layer. In addition to writing the data to the table, the corresponding logs must also be recorded.

  1. The executor first looks for the engine to get the line ID=2. The ID is the primary key and the storage engine retrieves the data to find this row. If the data page where the row with ID=2 is located is already in the memory, it will be returned directly to the executor; otherwise, it needs to be read into the memory from the disk first and then returned.
  2. The executor gets the row data given by the engine, adds 1 to this value, for example, it used to be N, but now it is N+1, gets a new row of data, and then calls the engine interface to write this new row of data.
  3. The engine updates this new row of data into the memory and records the update operation into the redo log. At this time, the redo log is in the prepare state. Then inform the executor that the execution is completed and the transaction can be submitted at any time.
  4. The executor generates a binlog of this operation and writes the binlog to disk.
  5. The executor calls the engine's commit transaction interface, and the engine changes the redo log just written to the commit state, and the update is completed.

As can be seen from the above figure, when MySQL executes the update statement, it parses and executes the statement in the service layer, extracts and stores data in the engine layer; at the same time, writes the binlog in the service layer and performs redo in InnoDB. Log writing.

Not only that, there are two stages of submission when writing redo log, one is preparethe writing of the ** ** state before the binlog is written, and the second is committhe writing of the state after the binlog is written.

22. Then why is there a two-stage submission?

Why two-stage submission? Can't you just submit it directly?

We can assume that instead of using a two-stage commit method, a "single-stage" commit is used, that is, either the redo log is written first, and then the binlog is written; or the binlog is written first, and then the redo log is written. Submitting in these two ways will cause the state of the original database to be inconsistent with the state of the restored database.

Write the redo log first and then the binlog:

After writing the redo log, the data is crash-safe at this time, so the system crashes and the data will be restored to the state before the transaction started. However, if the system crashes when the redo log is completed and before the binlog is written, the system crashes. At this time, binlog does not save the above update statement, resulting in the above update statement being missing when binlog is used to back up or restore the database. As a result, the data in the row id=2 is not updated.

Write to binlog first, then redo log:

After writing the binlog, all statements are saved, so the data in the row id=2 in the database copied or restored through the binlog will be updated to a=1. However, if the system crashes before the redo log is written, the transaction recorded in the redo log will be invalid, resulting in the data in the row id=2 in the actual database not being updated.

Simply put, both redo log and binlog can be used to represent the commit status of a transaction, and two-stage commit is to keep the two states logically consistent.

23.Do you know how to flash the redo log to the disk?

redo log bufferThe writing of redo log does not fall directly to the disk, but sets up a so-called continuous memory space in the memory , that is redo日志缓冲区.

When will the disk be flushed?

In the following situations, log buffer data will be flushed to disk:

  • When the log buffer space is insufficient

The size of the log buffer is limited. If you keep filling logs into this limited-sized log buffer, it will be filled up soon. If the amount of redo logs currently written to the log buffer has occupied about half of the total capacity of the log buffer, these logs need to be flushed to the disk.

  • When transaction is committed

When a transaction is committed, all logs in the log buffer will be flushed to disk in order to ensure durability. Note that at this time, in addition to the logs of this transaction, the logs of other transactions may also be flushed.

  • Background thread input

There is a background thread that flushes the data to disk approximately every log buffersecond redo log .

  • When shutting down the server normally

  • Trigger checkpoint rules

Redo log cache and redo log files are stored in blocks , called redo log blocks . The block size is fixed at 512 bytes. Our redo log has a fixed size and can be regarded as a logical log group, consisting of a certain number of log blocks.

image-20230819110229609

Its writing method is to start writing from the beginning to the end, write to the end and then return to the beginning to write in a loop.

There are two marked locations:

write posIt is the position of the current record. It moves backward while writing. After writing to the end of file No. 3, it returns to the beginning of file No. 0. checkpointIt is the current position to be erased, and it also moves forward and circulates. Before erasing the record, the record must be updated to the disk.

image-20230819110356185

When write_poscatching up checkpoint , it means that the redo log is full. At this time, you can no longer write data into it, and you need to execute checkpointrules to free up writable space.

The so-called checkpoint rule means that after the checkpoint is triggered, all the log pages in the buffer will be flushed to the disk.

SQL optimization

24. How to locate slow SQL?

Slow SQL is monitored mainly through two ways:

  • Slow query log : Enable MySQL's slow query log, and then use some tools such as mysqldumpslow to analyze the corresponding slow query log. Of course, most cloud vendors now provide visual platforms.
  • Service monitoring : Monitoring of slow SQL can be added to the business infrastructure. Common solutions include bytecode instrumentation, connection pool expansion, and ORM framework processes to monitor and alert slow SQL while the service is running.

25.What are some ways to optimize slow SQL?

The optimization of slow SQL mainly considers two aspects, the optimization of the SQL statement itself , and the optimization of the database design .

image-20230819111425662

avoid unnecessary columns

This is a cliché, but it still happens frequently. When doing SQL queries, you should only query the required columns and not include additional columns. slect *Writing like this should be avoided as much as possible.

Pagination optimization

When the amount of data is relatively large and the paging is relatively deep, paging optimization needs to be considered.
For example:

select * from table where type = 2 and level = 9 order by id asc limit 190289,10;

Optimization:

  • Delayed association
    first extracts the primary key through the where condition, then associates the table with the original data table, and extracts the data rows through the primary key id instead of extracting the data rows through the original secondary index. For example
    :
select a.* from table a,
    (select id from table where type = 2 and level = 9 order by id asc limit 190289,10 ) b
    where a.id = b.id
  • Bookmark method
    The bookmark method is to find the primary key value corresponding to the first parameter of limit, and then filter and limit based on this primary key value. For example:
select * from table where id >
	(select * from table where type = 2 and level = 9 order by id asc limit 190190289,10 )

Index optimization

Properly designing and using indexes is a powerful tool for optimizing slow SQL.

  • Using a covering index
    InnoDB will return to the table when querying data using a non-primary key index. However, if the leaf nodes of the index already contain the fields to be queried, then there is no need to return to the table to query. This is called a covering index. For example, for the following query
    :
select name from test where city='上海'

We build the queried field into a joint index so that the query results can be obtained directly from the index

alter table test add index idx_city_name (city, name);
  • Avoid using or query
    in lower versions. Try to avoid using or query in versions before MySQL 5.0. You can use union or subquery instead, because the use of or query in early MySQL versions may cause index failure. The higher version introduces index merging to solve the problem. solved this problem.
  • Avoid using != or <> operators.
    In SQL, the not equal operator will cause the query engine to abandon the query index and cause a full table scan, even if there is an index on the compared field. Solution: By changing the not equal operator to or, you can Use indexes to avoid full table scans

For example, column<>'aaa',改成column>'aaa' or column<'aaa'you can use the index

  • Use prefix indexes appropriately

    Appropriate use of prefix indexes can reduce the space occupied by the index and improve the query efficiency of the index.
    For example, the suffixes of mailboxes are all fixed " @xxx.com", so fields with fixed values ​​at the end are very suitable to be defined as prefix indexes.

alter table test add index index2(email(6));

PS: It should be noted that prefix indexes also have shortcomings. MySQL cannot use prefix indexes to perform order by and group by operations, nor can it be used as a covering index.

  • Avoid function operations on columns.
    Avoid performing arithmetic operations or other expression operations on column fields. Otherwise, the storage engine may not be able to use the index correctly, thus affecting the efficiency of the query.
select * from test where id + 1 = 50;
select * from test where month(updateTime) = 7;
  • Correct use of joint indexes
    When using joint indexes, pay attention to the leftmost matching principle.

JOIN optimization

  • When optimizing subqueries
    , try to use the Join statement to replace the subquery, because the subquery is a nested query, and the nested query will create a new temporary table, and the creation and destruction of the temporary table will occupy a certain amount of system resources and take a certain amount of time. , and at the same time, for subqueries that return relatively large result sets, they have a greater impact on query performance.
  • Small tables drive large
    tables. When performing related queries, a small table must be used to drive a large table. This is because during association, MySQL will internally traverse the driving table and then connect to the driven table.
    For example, left join, the left table is the driving table. Table A is smaller than table B, so the number of connection establishments is less, and the query speed is accelerated.
select name from A left join B ;
  • Appropriately increase redundant fields.
    Increasing redundant fields can reduce a large number of join table queries, because the performance of join table queries with multiple tables is very low. Therefore, redundant fields can be appropriately added to reduce associated queries with multiple tables. This is based on Optimization strategy of exchanging space for time
  • Avoid using JOIN to associate too many tables.
    "Alibaba Java Development Manual" stipulates that you should not join more than three tables. First, too many joins will reduce the query speed. Second, the join buffer will occupy more memory.
    If it is unavoidable to join multiple tables, you can consider using data heterogeneity to query in ES.

Sorting optimization

  • Using index scanning for sorting
    MySQL has two ways to generate ordered results: one is to sort the result set, and the other is to scan according to the index order and the results obtained are naturally ordered.

    However, if the index cannot cover the columns required for the query, you have to scan a record and query the table once. This read operation is random IO, which is usually slower than a sequential full table scan. Therefore, when designing the index, use the same one as much as possible
    . Indexes satisfy both sorting and finding rows

    For example:

--建立索引(date,staff_id,customer_id)
select staff_id, customer_id from test where date = '2010-01-01' order by staff_id,customer_id;

Only when the column order of the index is completely consistent with the order of the ORDER BY clause, and the sorting direction of all columns is the same, can the index be used to sort the results.

UNION optimization

  • Conditional push down

    MySQL's strategy for handling unions is to first create a temporary table, then fill each query result into the temporary table, and then perform the query. Many optimization strategies will fail in union queries because they cannot use indexes.

    It is best to manually push down clauses such as where and limit to each subquery of the union so that the optimizer can make full use of these conditions for optimization.

    In addition, unless you really need to deduplicate the server, you must use union all. If you do not add the all keyword, MySQL will add the distinct option to the temporary table , which will cause a uniqueness check on the entire temporary table, which is very costly.

26. How to read the execution plan (explain) and understand the meaning of each field in it?

Explain is a powerful tool for sql optimization. In addition to optimizing slow sql, you should also explain first when writing sql and check the execution plan to see if there is room for optimization.

Adding the explain keyword directly before the select statement will return execution plan information.

image-20230819150832458

index

Indexing can be said to be the top priority in the MySQL interview, and it must be thoroughly won.

27. Can you briefly talk about the classification of indexes?

Indexes are classified along three different dimensions:

image-20230819151138374

For example, from the perspective of basic usage:

  • Primary key index: InnoDB primary key is the default index. Data columns are not allowed to be repeated or NULL. A table can only have one primary key.
  • Unique index: Duplication of data columns is not allowed, NULL values ​​are allowed, and a table allows multiple columns to create unique indexes.
  • Ordinary index: Basic index type, no uniqueness restrictions, NULL values ​​allowed.
  • Combined index: Multiple column values ​​form an index for combined search, which is more efficient than index merging

28. Why does using an index speed up queries?

The traditional query method traverses the table in order. No matter how many pieces of data are queried, MySQL needs to traverse the table data from beginning to end.

After we add the index, MySQL generally generates an index file through the BTREE algorithm. When querying the database, we find the index file to traverse, search in the relatively small index data, and then map it to the corresponding data, which can greatly improve the efficiency of the search. .

It’s the same as when we look for the corresponding content through the book’s table of contents.

image-20230819151404454

29.What are the points to note when creating an index?

Although indexes are a powerful tool for optimizing SQL performance, index maintenance also requires costs. Therefore, when creating indexes, you should also pay attention to:

  1. Indexes should be built on fields where queries are frequently used . Create an index on the (on) field used for where judgment, order sorting and join.

  2. The number of indexes should be appropriate . Indexes take up space and need to be maintained when updated.

  3. Do not create indexes for fields with low distinction , such as gender . For fields with too low dispersion, the number of scanned rows will be limited.

  4. Frequently updated values ​​should not be used as primary keys or indexes. Maintaining index files requires costs; it will also lead to page splits and an increase in I0 times.

  5. The combined index puts values ​​with high hashability (high distinction) in front in order to satisfy the leftmost prefix matching principle.

  6. Create a composite index instead of modifying a single column index. A composite index replaces multiple single-column indexes (for single-column indexes, MySQL can basically only use one index, so it is more suitable to use a composite index when multiple condition queries are often used)

  7. For fields that are too long, use prefix indexes. When the field value is relatively long, indexing will consume a lot of space and the search will be very slow. We can create an index by intercepting the previous part of the field, which is called a prefix index.

  8. It is not recommended to use unordered values ​​(such as ID cards and UUIDs) as indexes.

When the primary key is uncertain, it will cause frequent splitting of leaf nodes and fragmentation of disk storage.

30. Under what circumstances will the index fail?

  • The query condition contains or, which may cause index failure
  • If the field type is a string, the where must be enclosed in quotes, otherwise the index will be invalid due to implicit type conversion.
  • like wildcards may cause index failure.
  • In the joint index, the condition column in the query is not the first column in the joint index, and the index becomes invalid.
  • When using MySQL's built-in function on the index column, the index becomes invalid.
  • For operations on indexed columns (such as +, -, *, /), the index becomes invalid.
  • When using (!= or <>, not in) on an index field, it may cause index failure.
  • Using is null or is not nul on index fields may cause index failure.
  • The encoding formats of fields associated with left join queries or right join queries are different, which may cause index failure.
  • The MySQL optimizer estimates that using a full table scan is faster than using an index, so the index is not used.

31. Which scenarios are not suitable for indexing?

  • Tables with relatively small amounts of data are not suitable for indexing

  • Fields that are updated frequently are not suitable for indexing

  • Fields with low discreteness are not suitable for indexing (such as gender)

32. Is it better to build more indexes?

of course not.

  • Indexes take up disk space
  • **Although indexes will improve query efficiency, they will reduce the efficiency of updating tables. **For example, every time a table is added, deleted, or modified, MySQL must not only save the data, but also save or update the corresponding index file.

33.Do you know what data structure MySQL index uses?

MySQL's default storage engine is InnoDB, which uses a B+ tree structure index.

  • B+ tree: Only leaf nodes store data, and non-leaf nodes only store key values. Leaf nodes are connected using bidirectional pointers, and the lowest leaf nodes form a bidirectional ordered linked list.

image-20230819152516910

In this picture, there are several important points:

  • Each square in the figure is called a disk block . You can see that each disk block contains several data items (shown in pink) and pointers (shown in blue). For example, the root node disk contains data items 17 and 35. , including pointers P1, P2, P3;
  • P1 represents disk blocks less than 17, P2 represents disk blocks between 17 and 35, and P3 represents disk blocks greater than 35 .
  • The real data exists in the leaf nodes, namely 3, 4, 5...65. Non-leaf nodes do not store real data, but only data items that guide the search direction. For example, 17 and 35 do not actually exist in the data table .
  • Leaf nodes are connected using two-way pointers, and the lowest leaf node forms a two-way ordered linked list, which can be range-queried.

34. How many pieces of data can a B+ tree store?

image-20230820090411280

  • Assume that the index field is of type bigint and is 8 bytes long. The pointer size is set to 6 bytes in the InnoDB source code, making a total of 14 bytes. Non-leaf nodes (one page) can store 16384/14=1170 such units (key values ​​+ pointers), which means there are 1170 pointers.
  • When the tree depth is 2, there are 1170^2 leaf nodes, and the data that can be stored is 1170 1170 16=2190 2400.
  • When searching for data, one page search represents one search of 10. In other words, for a table of about 20 million, querying data requires up to three disk accesses.
    Therefore, the B+ tree depth in InnoDB is generally 1-3 layers, which can satisfy tens of millions of data storage .

35. Why use B+ tree instead of ordinary binary tree?

You can look at this problem from several dimensions, including whether the query is fast enough, whether the efficiency is stable, how much data is stored, and how many times the disk is searched.

Why not use ordinary binary trees?

Ordinary binary trees are degenerated. If it degenerates into a linked list, it is equivalent to a full table scan. Compared with binary search trees, balanced binary trees have more stable search efficiency and faster overall search speed.

Why not balance a binary tree?

When reading data, it is read from disk to memory. If a data structure like a tree is used as an index, each time you search for data, you need to read a node from the disk, which is a disk block, but a balanced binary tree only stores one key value and data per node . If it is a B+ tree , more node data can be stored, and the height of the tree will also be reduced , so the number of disk reads will be reduced, and the query efficiency will be faster.

36. Why use B+ tree instead of B tree?

Compared with B-tree, B+ has the following advantages:

  • It is a variant of B Tree. It can solve all the problems that B Tree can solve.

Two major problems solved by B Tree: each node stores more keywords; more paths

  • Stronger ability to scan databases and tables

If we want to perform a full table scan on the table, we only need to traverse the leaf nodes, and there is no need to traverse the entire B+Tree to get all the data.

  • B+ Tree has stronger disk read and write capabilities than B Tree and has fewer IO times.

The root node and branch nodes do not save data areas, so one node can save more keywords, load more keywords from the disk at one time, and reduce the number of IOs.

  • Better sorting ability

Because there is a pointer to the next data area on the leaf node, the data forms a linked list.

  • The efficiency is more stable.
    B+Tree always gets data from leaf nodes, so the number of IOs is stable.

37.What is the difference between Hash index and B+ tree index?

  • B+ trees can perform range queries, but Hash indexes cannot.
  • B+ trees support the leftmost principle of joint indexes, but Hash indexes do not.
  • B+ tree supports order by sorting, but Hash index does not support it.
  • Hash index is more efficient than B+ tree in equivalent query.
  • When the B+ tree uses like for fuzzy query, the words after like (such as starting with %) can play an optimization role, and the Hash index cannot perform fuzzy query at all.

38. What is the difference between clustered index and non-clustered index?

First understand that the clustered index is not a new index, but a data storage method .

Clustering means that rows of data and adjacent key values ​​are stored compactly together. The two storage engines we are familiar with - MyISAM uses non-clustered indexes, and InnoDB uses clustered indexes.

You can say that:

The data structure of the index is a tree. The index and data of the clustered index are stored in one tree . The leaf nodes of the tree are the data. The index and data of the non-clustered index are not in one tree .

image-20230820091618406

  • A table can only have one clustered index, while a table can have multiple non-clustered indexes.
  • In a clustered index, the logical order of the key values ​​in the index determines the physical order of the corresponding rows in the table; in a non-clustered index, the logical order of the indexes in the index is different from the physical storage order of the rows on the disk.
  • Clustered index: physical storage is sorted according to the index; non-clustered index: physical storage is not sorted according to the index;

39. Do you understand the return form?

In the InnoDB storage engine, using the auxiliary index query , first find the key value of the primary key index through the auxiliary index , and then use the primary key value to find out that there is no data that meets the requirements in the primary key index . It scans one more tree than the query based on the primary key index. Index tree, this process is called table back.

For example:select \* from user where name = '张三';

image-20230820091853450

40.Do you understand covering index?

In the auxiliary index, whether it is a single column index or a joint index, if the selected data column can only be obtained from the auxiliary index without looking up the primary key index, the index used at this time is called a covering index , which avoids table backing.

like,select name from user where name ='张三';

image-20230820092044585

41.What is the leftmost prefix principle/leftmost matching principle?

Note: The leftmost prefix principle, the leftmost matching principle, and the leftmost prefix matching principle are all the same concept.

Leftmost matching principle: In InnoDB's joint index, when querying, only the previous/left value can be matched before the next one can be matched.

According to the leftmost matching principle, we create a combined index, such as (a1, a2, a3), which is equivalent to creating three indexes (a1), (a1, a2) and (a1, a2, a3).

Why can't we match if we don't search from the far left?

For example, there is a user table, and we create a combined index for name and age.

ALTER TABLE user add INDEX comidx_name_phone (name,age);

The combined index is a composite data structure in B+Tree, which builds the search tree in order from left to right (name on the left, age on the right).

image-20230820092234962

As can be seen from this picture, name is ordered and age is unordered. When the names are equal, the ages are ordered.

At this time, when we use it where name = '张三' and age ='20'to query data, B+Tree will give priority to comparing the name to determine the direction that should be searched next, left or right. If the names are the same, compare the ages. But if the query condition does not have name, we don’t know which node to check next . Because name is the first comparison factor when building the search tree, so the index is not used.

42.What is index pushdown optimization?

Index Condition Pushdown (ICP) is added in MySQL 5.6 and is used to optimize data queries.

  • When push-down optimization without using index conditions, the storage engine retrieves the data through the index and then returns it to MySQL Server, which then determines the filter conditions.
  • When using index condition push-down optimization, if there are certain judgment conditions for indexed columns, MySQL Server will push this part of the judgment conditions down to the storage engine , and then the storage engine will judge whether the index meets the conditions passed by MySQL Server. Only when the index meets the conditions will the data be retrieved and returned to the MySQL server .

For example, for a table, a joint index (name, age) is built, and the query statement is: . select * from t_user where name like '张%' and age=10;Due to name the use of range query, according to the leftmost matching principle:

Without using ICP, name like '张%'the data found by the engine layer will be filtered by the server layer age=10. In this way, the table will be returned twice, wasting another field of the joint index age.

image-20230820092816599

However, index pushdown optimization is used to put the where conditions into the engine layer for execution, and name like '张%' and age=10filter directly based on the conditions, reducing the number of table returns.

image-20230820092921002

Index condition pushdown optimization can reduce the number of times the storage engine queries the underlying table , and can also reduce the number of times the MySQL server receives data from the storage engine .

Lock

43.What kinds of locks are there in MySQL? List them?

If divided by lock granularity, there are the following three types

  • Table lock: low overhead, fast locking; strong locking force, high probability of lock conflict, lowest concurrency; no deadlock.
  • Row lock: high overhead, slow locking; deadlock may occur; small locking granularity, low probability of lock conflict, and high concurrency.
  • Page lock: The cost and locking speed are between table locks and row locks; deadlocks may occur; the locking granularity is between table locks and row locks, and the concurrency is average.

In terms of compatibility, there are two types:

  • Shared lock (S Lock), also called read lock (read lock), does not block each other.
  • Exclusive lock (XLock), also called write lock (write lock), exclusive lock is blocking. Within a certain period of time, only one request can perform writing and prevents other locks from reading the data being written.

44. Talk about the row lock implementation in InnoDB?

We use such a user table to represent row-level locks, in which 4 rows of data are inserted, and the primary key values ​​are 1, 6, 8, and 12. Now we simplify its clustered index structure and only retain data records.

image-20230820094943904

The main implementation of InnoDB's row lock is as follows:

  • Record Lock Record Lock
    Record Lock is to directly lock a row of records. When we use unique indexes (including unique indexes and clustered indexes) to perform equivalent queries and accurately match a record, the record will be locked directly. For example ; the record select * from t where id = 6 for updatewill be locked.id=6

image-20230820095135336

  • Gap Lock Gap Locks
    The gap in Gap Locks refers to the logical part between two records that has not yet been filled with data. It is a left-open and right-open space.

image-20230820095614185

Gap lock is to lock certain gap intervals. When we use equality query or range query and do not hit any record, the corresponding gap interval will be locked. For example, select * from t where id =3 for update;the select * from t where id > 1 and id < 6 for update;(1, 6) interval may be locked.

  • Next-key Lock
    Next-key Lock refers to the left open and right closed interval composed of the gap plus the record to its right. For example, the above-mentioned (1, 6], (6, 8], etc.

image-20230820095722464

The temporary key lock is a combination of record locks (Record Locks) and gap locks (Gap Locks), that is, in addition to locking the record itself, it also locks the gaps between the indexes. When we use range query and hit some records, the critical interval is locked at this time. Note that the range locked by the temporary key lock will include the temporary key range to the right of the last record. For example, select * from t where id > 5 and id <= 7 for update;it will lock (1, 6], (6, 8]. The default row lock type of MySQL is Next-KeyLocks. When a unique index is used and an equal value query matches a record, Next-KeyLocks (Next-Key Locks) will degenerate into record locks; when no records are matched, they will degenerate into gap locks.

Gap Locks and Next-Key Locks are both used to solve the phantom reading problem.

Under the READ COMMITTED isolation level, Gap Locks and Next-Key Locks will become invalid!

The above are three implementation algorithms for row locks. In addition, there are also insertion intention locks on rows.

  • Insert Intention Lock Insert Intention Lock

When a transaction inserts a record, it needs to determine whether the insertion position is locked by another transaction. If so, the insertion operation needs to wait until the transaction holding the gap lock commits. However, while the transaction is waiting, it also needs to generate a lock structure in the memory, indicating that a transaction wants to insert a new record in a certain gap , but is now waiting. This type of lock is named Insert Intention Locks, which means insert intention lock.

If we have a T1 transaction that adds an intention lock to the (1, 6) range, and now we have a T2 transaction that wants to insert a piece of data with an ID of 4, it will acquire an insertion intention lock for the (1, 6) range. There is also a T3 transaction that wants to insert a piece of data with an id of 3. It will also acquire an insertion intention lock in the (1, 6) range. However, these two insertion intention locks will not be mutually exclusive.

image-20230820100320469

45.Do you know what intention lock is?

  • Intention lock is a table-level lock , not to be confused with insert intention lock.
  • Intention locks appear to support InnoDB's multi-granularity locks, which solve the problem of coexistence of table locks and row locks .
  • When we need to add a table lock to a table, we need to judge whether any data rows in the table are locked to determine whether the addition can be successful.
  • If there is no intention lock, then we have to traverse all the data rows in the table to determine whether there is a row lock; with the intention lock, a table-level lock, we can directly judge once to know whether any data rows in the table are locked .
  • After having the intention lock, before transaction A to be executed applies for a row lock (write lock), the database will automatically apply for the intention exclusive lock of the table to transaction A. When transaction B applies for a mutex lock on the table, it will fail because there is an intentional exclusive lock on the table and transaction B will be blocked when it applies for a mutex lock on the table.

46.Do you understand MySQL’s optimistic locking and pessimistic locking?

  • Pessimistic Concurrency Control

Pessimistic locks believe that the data protected by it is extremely unsafe and may be changed at any time. After a transaction obtains a pessimistic lock, no other transaction can modify the data and can only wait for the lock to be released. Can be executed.

Row locks, table locks, read locks, and write locks in the database are all pessimistic locks.

  • Optimistic Concurrency Control

Optimistic locking believes that data changes will not be too frequent.

Optimistic locking is usually implemented by adding a version (version) or timestamp (timestamp) to the table, of which version is the most commonly used.

When a transaction fetches data from the database, it will also fetch the version of the data (v1). When the transaction completes the data changes and wants to update it to the table, it will fetch the version v1 previously and the latest version in the data. Compared with version v2, if v1=v2, it means that during the data change period, no other transactions modify the data. At this time, transactions are allowed to modify the data in the table, and the version will be increased by 1 during the modification. Indicates that the data has been changed.

If v1 is not equal to v2, it means that the data was modified by other transactions during the data change period. At this time, the data is not allowed to be updated into the table . The general solution is to notify the user and let them re-operate. Unlike pessimistic locking, optimistic locking is usually implemented by developers.

47.Has MySQL ever encountered a deadlock problem? How did you solve it?

The general steps for troubleshooting deadlocks are as follows:

(1) Check the deadlock log show engine innodb status;

(2) Find out the deadlock sql

(3) Analyze sql locking situation

(4) Simulate deadlock incident

(5) Analyze deadlock logs

(6) Analyze deadlock results

Of course, this is just a simple process description. In fact, deadlocks in production are all kinds of strange, and it is not that simple to troubleshoot and solve.

affairs

48. What are the four major characteristics of MySQL transactions?

ACID(Atomicity、Consistency、Isolation、Durability)

img
  • Atomicity : The transaction is executed as a whole, and either all or none of the operations on the database contained in it are executed.
  • Consistency : means that the data will not be destroyed before the transaction starts and after the transaction ends. If account A transfers 10 yuan to account B, the total amount of A and B will remain unchanged regardless of success or failure.
  • Isolation : When multiple transactions access concurrently, the transactions are isolated from each other, that is, one transaction does not affect the running effects of other transactions. In short, it means that there is no conflict between affairs.
  • Persistence : means that after the transaction is completed, the operational changes made by the transaction to the database will be permanently saved in the database.

49. So what guarantee does ACID rely on?

  • Transaction isolation is achieved through the database lock mechanism.
  • The consistency of the transaction is guaranteed by the undo log: the undo log is a logical log that records the insert, update, and deltete operations of the transaction. During rollback, the opposite delete, update, and insert operations are performed to restore the data.
  • The atomicity and durability of transactions are guaranteed by redo log: redo log is called redo log, which is a physical log. When a transaction is submitted, all logs of the transaction must first be written to redo log for persistence. Considered complete.

50.What are the isolation levels of transactions? What is MySQL's default isolation level?

img

Four isolation levels of transactions

  • Read Uncommitted
  • Read Committed
  • Repeatable Read
  • Serializable

MySQL's default transaction isolation level is Repeatable Read.

51.What are phantom reads, dirty reads, and non-repeatable reads?

  • Transactions A and B are executed alternately. Transaction A reads the uncommitted data of transaction B. This is dirty reading.
  • Within the scope of a transaction, two identical queries read the same record but return different data. This is a non-repeatable read.
  • Transaction A queries the result set of a range, and another concurrent transaction B inserts/delete data into this range and commits it silently. Then transaction A queries the same range again, and the result sets obtained by the two reads are different. , this is phantom reading.

Different isolation levels, problems that may occur under concurrent transactions:

img

For detailed explanation, please refer to: A detailed explanation of phantom reading, dirty reading and non-repeatable reading_weixin_45483322's blog-CSDN blog

52. How are various isolation levels of transactions implemented?

Read uncommitted

Needless to say, reading uncommitted is based on the principle of reading without locking.

  • Transactional reading does not lock and does not block the reading and writing of other transactions.
  • Transaction writes block other transaction writes, but do not block other transaction reads;

Read Committed & Repeatable Read

Read Committed and Repeatable Read levels take advantage of ReadViewthe MVCCfact that each transaction can only read the version it can see (ReadView).

  • READ COMMITTED: Generate a ReadView every time before reading data
  • REPEATABLE READ: Generate a ReadView when reading data for the first time

serialization

The implementation of serialization adopts the principle of locking for both reading and writing.

In the case of serialization, for the same row of transactions, writes will increase 写锁and reads will increase 读锁. When a read-write lock conflict occurs, the transaction accessed later must wait for the completion of the previous transaction before it can continue to execute.

53.Do you understand MVCC? How is it achieved?

MVCC (Multi Version Concurrency Control), the Chinese name is multi-version concurrency control . Simply put, it solves the read consistency problem in the case of concurrent access by maintaining historical versions of data .

Regarding its implementation, we must grasp several key points, including implicit fields, undo logs, version chains, snapshot reading & current reading, and Read View .

version chain

For the InnoDB storage engine, each row of records has two hidden columns DB_TRX_ID and DB_ROLL_PTR

  • DB_TRX_ID, transaction ID. Each time it is modified, the transaction ID will be copied to DB_TRX_ID;
  • DB_ROLL_PTR, rollback pointer, points to the undo log of the rollback segment.

image-20230820102610404

Suppose there is a user table with only one row of records, and the transaction ID inserted at that time is 80. At this time, the sample image of this record is as follows:

Next, there are two transactions with DB_TRX_ID 100 and 200 respectively to update this record. The whole process is as follows:

image-20230820102659482

Because each change will first record the undo log, and use DB_ROLL_PTR to point to the undo log address. Therefore, it can be considered that the modification logs of this record are concatenated to form a version chain , and the head node of the version chain is the latest value of the current record . as follows:

image-20230820102750297

ReadView

For Read Committedboth Repeatable Readthe isolation level and the isolation level, it is necessary to read the records of submitted transaction modifications. That is to say, if the modifications of a certain version in the version chain are not submitted, then the records of that version cannot be read. Therefore, it is necessary to determine which version in the version chain can be read by the current transaction under the Read Committed and Repeatable Read isolation levels. So the concept of ReadView was introduced to solve this problem.

Read View is the read view generated when a transaction executes snapshot reading, which is equivalent to a snapshot recorded in the schedule at a certain time. Through this snapshot, we can obtain:

image-20230820102936896

  • m_ids: Represents the transaction ID list of active read and write transactions in the current system when the ReadView is generated.
  • min_trx_id: Indicates the smallest transaction id among the active read and write transactions in the current system when the ReadView is generated, which is the minimum value in m_ids.
  • max_trx_id: Indicates the id value that should be assigned to the next transaction in the system when generating ReadView.
  • creator_trx_id: Indicates the transaction id of the transaction that generated the ReadView

With this ReadView, when accessing a record, you only need to follow the steps below to determine whether a certain version of the record is visible:

  • If the DB_TRX_ID attribute value of the accessed version is the same as the creator_trx_id value in ReadView , it means that the current transaction is accessing its own modified records, so this version can be accessed by the current transaction.
  • If the DB_TRX_ID attribute value of the accessed version is less than the min_trx_id value in ReadView , it indicates that the transaction that generated this version has been committed before the current transaction generates ReadView, so this version can be accessed by the current transaction.
  • If the DB_TRX_ID attribute value of the accessed version is greater than the max_trx_id value in ReadView , it means that the transaction that generated this version was opened after the current transaction generated ReadView, so this version cannot be accessed by the current transaction.
  • If the DB_TRX_ID attribute value of the accessed version is between min_trx_id and max_trx_id of ReadView , then you need to determine whether the trx_id attribute value is in the m_ids list . If it is , it means that the transaction that generated this version is still active when the ReadView is created. This version Cannot be accessed; if not , it means that the transaction that generated this version when the ReadView was created has been committed and this version can be accessed.

If a certain version of data is not visible to the current transaction, then follow the version chain to find the next version of data, continue to follow the above steps to determine the visibility, and so on, until the last version in the version chain. If the last version is not visible, it means that the record is completely invisible to the transaction, and the query result does not include the record.

In MySQL, a very big difference between READ COMMITTED and REPEATABLE READ isolation levels is that they generate ReadView at different times.

  • READ COMMITTED generates a ReadView every time before reading data , so that you can ensure that you can read the data submitted by other transactions every time;

  • REPEATABLE READ generates a ReadView when reading data for the first time , thus ensuring that the results of subsequent reads are completely consistent.

High availability/performance

54. Do you understand the separation of database reading and writing?

The basic principle of read and write separation is to distribute database read and write operations to different nodes . The following is the basic architecture diagram:

The basic implementation of read-write separation is:

  • The database server builds a master-slave cluster, either one master and one slave, or one master and multiple slaves.
  • The database host is responsible for read and write operations, and the slave is only responsible for read operations.
  • The database host synchronizes data to the slave machine through replication, and each database server stores all business data .
  • The business server sends write operations to the database host and read operations to the database slave.

55. How to realize the allocation of read and write separation?

There are generally two ways to separate read and write operations, and then access different database servers: program code encapsulation and middleware encapsulation .

  1. Program code encapsulation

Program code encapsulation refers to abstracting a data access layer in the code (some articles also call this method "middle layer encapsulation") to achieve separation of read and write operations and management of database server connections . For example, simple encapsulation based on Hibernate can achieve read-write separation:

Among the current open source implementation solutions, Taobao's TDDL (Taobao Distributed Data Layer, nickname: Big Head) is relatively famous.

  1. Middleware encapsulation

Middleware encapsulation refers to an independent system that realizes the separation of read and write operations and the management of database server connections. The middleware provides a SQL-compatible protocol to the business server, and the business server does not need to separate reading and writing by itself .

For the business server, there is no difference between accessing the middleware and accessing the database. In fact, from the perspective of the business server, the middleware is a database server.

Its basic structure is:

image-20230820104205071

56. Do you understand the principle of master-slave replication?

  • Write master data and update binlog
  • The master creates a dump thread (dump thread) to push the binlog to the slave.
  • When the slave connects to the master, an IO thread will be created to receive the binlog and recorded in the relay log.
  • The slave then starts a SQL thread to read the relay log event and execute it on the slave to complete the synchronization and the slave records its own binglog.

image-20230820104346097

57. How to deal with master-slave synchronization delay?

Reasons for master-slave synchronization delay:

A server opens N links for clients to connect, so there will be large concurrent update operations . However, there is only one thread to read the binlog from the server. When a certain SQL is executed on the slave server, it takes a little longer. Or because a certain SQL needs to lock the table, there will be a large backlog of SQL on the master server and it will not be synchronized to the slave server . This leads to master-slave inconsistency, that is, master-slave delay.

Solution to master-slave synchronization delay:

There are several common ways to solve master-slave replication delays:

  1. The read operation after the write operation is designated to be sent to the database main server

For example, after the account registration is completed, the read operation of reading the account when logging in is also sent to the main database server. This method is strongly bound to the business and has a greater intrusion and impact on the business. If a new programmer does not know how to write code in this way, it will cause a bug.

  1. Read the master again after failing to read from the slave.

This is what is usually called " secondary reading ". The secondary reading is not bound to the business. It only needs to encapsulate the API accessed by the underlying database. The implementation cost is small. The disadvantage is that if there are many secondary readings, Reading will greatly increase the read operation pressure on the host. For example, if a hacker violently cracks an account, it will lead to a large number of secondary read operations. The host may not be able to withstand the pressure of read operations and collapse.

  1. All read and write operations for key businesses are directed to the host, and read and write operations are separated for non-critical businesses.

For example, for a user management system, the registration + login business read and write operations all access the host. The user's introduction, love, level and other services can use read and write separation, because even if the user changes his self-introduction, the read and write operations can be separated. When I inquired, I saw that the self-introduction was still old. Compared with being unable to log in, the business impact was much smaller and tolerable.

58. How do you usually divide the database?

  • Vertical database splitting : Based on tables, different tables are split into different databases according to different business affiliations.

  • Horizontal database splitting : Based on fields, data in one database is split into multiple databases according to certain strategies (hash, range, etc.).

59. So how do you divide the tables?

  • Horizontal table splitting: Split the data in one table into multiple tables based on fields and certain strategies (hash, range, etc.).
  • Vertical table splitting: Based on the fields and according to the activity of the fields, the fields in the table are split into different tables (main table and extended table).

60.What are the routing methods for horizontal table sharding?

What is routing? That is which table the data should be divided into.

There are three main routing methods for horizontal table sharding:

  • Range routing : Select ordered data columns (for example, integer, timestamp, etc.) as routing conditions, and different segments are dispersed into different database tables.

We can observe some payment systems and find that we can only check payment records within a year. This may be because the payment company has divided the records according to time.

image-20230820105809461

The complexity of range routing design is mainly reflected in the selection of segment size. If a segment is too small, it will lead to too many sub-tables after segmentation, which increases maintenance complexity; if a segment is too large, there may still be performance problems in a single table. It is generally recommended that the segment size is between 1 million and 20 million. The appropriate segment size needs to be selected based on the specific business.

The advantage of range routing is that new tables can be expanded smoothly as data increases . For example, if the current number of users is 1 million, if the number increases to 10 million, you only need to add a new table, and the original data does not need to be moved. A relatively implicit disadvantage of range routing is uneven distribution . If the table is divided according to 10 million, it is possible that the actual amount of data stored in a segment is only 1,000, while the actual amount of data stored in another segment is 900. Ten thousand.

  • Hash routing : Select the value of a certain column (or a combination of certain columns) for Hash operation, and then distribute it to different database tables based on the Hash result.

Taking the order ID as an example, if we plan 4 database tables from the beginning, the routing algorithm can simply use the value of id%4 to represent the database table number to which the data belongs. The order with id 12 is placed in the table numbered 50. In the subtable, the order with ID 13 is placed in the word table numbered 61.

image-20230820105918225

The complexity of Hash routing design is mainly reflected in the selection of the initial number of tables. Too many tables are troublesome to maintain, and too few tables may cause performance problems with a single table. After using Hash routing, it is very troublesome to increase the number of sub-tables, and all data must be redistributed . The advantages and disadvantages of Hash routing are basically opposite to those of range routing. The advantage of Hash routing is that the tables are relatively evenly distributed. The disadvantage is that it is troublesome to expand new tables and all data must be redistributed.

  • Configure routing : Configuring routing is a routing table, which uses an independent table to record routing information . Taking the order id as an example, we add a new order_router table. This table contains two columns: orderjd and tablejd. The corresponding table_id can be queried based on orderjd.

The routing configuration is simple in design and very flexible to use, especially when expanding the table. You only need to migrate the specified data and then modify the routing table.

image-20230820110145704

The disadvantage of configuring routing is that it must be queried more than once, which will affect the overall performance ; and if the routing table itself is too large (for example, hundreds of millions of data), the performance may also become a bottleneck. If we divide the routing table into databases and tables again, it will Facing an infinite loop routing algorithm selection problem.

61. How to achieve capacity expansion without downtime?

In fact, expansion without downtime is a very troublesome and risky operation in practice. Of course, the interview is much simpler to answer.

  • The first stage: online double writing, query the old database

    1. Establish a new database table structure. When data is written to the long-term database, it is also written to the split new database.
    2. Data migration, use data migration program to migrate historical data in the old database to the new database
    3. Use scheduled tasks to compare data between the old and new databases and fill in the differences.

  • The second stage: online double writing, querying the new database

  1. Completed synchronization and verification of historical data
  2. Switch the reading of data to the new library

  • The third stage: the old database is offline
    1. The old database will no longer write new data
    2. After a period of time, after confirming that there are no requests from the old database, the old database can be offline.

62. What are the commonly used middleware for database and table sub-database?

  • sharding-jdbc
  • Mycat

63. So what problems do you think the sub-database and sub-table will bring?

From the perspective of sub-library:

  • business issues

    • A big advantage of using a relational database is that it guarantees transactional integrity.
    • After the database is divided, single-machine transactions are no longer needed and must be solved using distributed transactions.
  • Cross-database JOIN problem

    • When we are in one database, we can also use JOIN to query connected tables, but after crossing databases, we cannot use JOIN.
    • The solution at this time is to perform correlation in the business code , that is, first check the data of one table, then check another table through the obtained results, and then use the code to correlate to get the final result. This method is slightly more complicated to implement, but it is acceptable.
    • There are also some fields that can be appropriately redundant. For example, the previous table stored a related ID, but the business often required the corresponding Name or other fields to be returned. At this time, these fields can be redundantly added to the current table to remove operations that require association.
    • Another way is data heterogeneity . Through binlog synchronization and other methods, the data that needs cross-database join is heterogeneous into a storage structure such as ES, and is queried through ES.

From a sub-table perspective:

  • Cross-node count, order by, group by and aggregate function issues

    • It can only be implemented by business code or by using middleware to summarize, sort, page and return the data in each table.
  • Data migration, capacity planning, expansion and other issues

    • Data migration, how to plan capacity, whether capacity expansion may be needed again in the future, etc., are all issues that need to be considered.
  • ID question

    • After the database table is split, it can no longer rely on the database's own primary key generation mechanism, so some means are needed to ensure that the global primary key is unique.
    1. It is still auto-increment, but the auto-increment step size is set. For example, there are three tables now, the step size is set to 3, and the initial ID values ​​of the three tables are 1, 2, and 3 respectively, so the ID growth of the first table is 1, 4, and 7. The second table is 2, 5, 8, and the third table is 3, 6, 9, so there will be no duplication.
    2. UUID is the simplest, but discontinuous primary key insertion will cause serious page splits and poor performance.
    3. Distributed ID, the more famous one is Twitter’s open source sonwflake snowflake algorithm

Operation and maintenance

64. How to delete data exceeding one million levels?

Regarding the index: Since the index requires additional maintenance costs, because the index file is a separate file, when we add, modify, or delete data, additional operations on the index file will occur, and these operations require additional IO. It will reduce the execution efficiency of addition/modification/deletion.

Therefore, when we delete millions of data in the database, we consult the MySQL official manual and learn that the speed of deleting data is directly proportional to the number of indexes created .

  1. So when we want to delete millions of data, we can delete the index first
  2. Then delete the useless data
  3. Re-create the index after the deletion is completed, and index creation is also very fast

65. How to add fields to a large table with millions of levels?

When the amount of online database data reaches millions or tens of millions, adding a field is not that simple because the table may be locked for a long time.

Adding fields to a large table usually involves the following methods:

Convert through the intermediate table:

  • Create a temporary new table, completely copy the structure of the old table, add fields, copy the data from the old table, delete the old table, and name the new table the name of the old table. This method may lose some data.

For pt-online-schema-change:

  • pt-online-schema-change is a tool developed by Percona Company. It can modify the table structure online. Its principle is also through intermediate tables.

First add it to the slave library and then switch between master and slave:

  • If a table has a large amount of data and is a hot table (reading and writing are particularly frequent), you can consider adding it to the slave database first, then performing a master-slave switch, and then adding fields to several other nodes after the switch.

66. What should be done if the MySQL database CPU surges?

Investigation process:

(1) Use the top command to observe and determine whether it is caused by mysqld or other reasons.

(2) If it is caused by mysqld, show processlist, check the session status, and determine whether there is any resource-consuming SQL running.

(3) Find out the SQL statements with high consumption and see whether the execution plan is accurate, whether the index is missing, and whether the amount of data is too large.

deal with:

(1) Kill these threads (and observe whether the CPU usage decreases),

(2) Make corresponding adjustments (such as adding indexes, changing SQL, changing memory parameters)

(3) Re-run these SQLs.

Other cases:

It is also possible that each SQL statement does not consume much resources, but suddenly, a large number of sessions are connected, causing the CPU to surge. In this case, you need to analyze with the application why the number of connections surges, and then make corresponding adjustments. For example, limit the number of connections, etc.

Source of information: Counterattack of noodle scum: MySQL sixty-six questions, 20,000 words + fifty pictures detailed explanation! A little six!

Guess you like

Origin blog.csdn.net/weixin_45483322/article/details/132390048