mysql interview questions (the most complete)

1. What are the three paradigms of database?

What is a paradigm?
Paradigm is a specification followed in database design, and different specifications require different paradigms to be followed.

The three most commonly used paradigms

  • First Normal Form (1NF): Attributes are indivisible , that is, each attribute is an indivisible atomic item. (The attribute of the entity is the column in the table)

  • Second Normal Form (2NF): It satisfies the first normal form; and there is no partial dependence , that is, non-main attributes must completely depend on main attributes. (The primary attribute is the primary key; full dependence is for the case of the joint primary key, and non-primary key columns cannot only depend on part of the primary key)

  • Third Normal Form (3NF): It satisfies the second normal form; and there is no transitive dependency , that is, non-key attributes cannot have dependencies with non-key attributes, and non-key attributes must directly depend on the main attribute and cannot indirectly depend on the main attribute. (A -> B, B -> C, A -> C)

Give an example of 3NF:
1NF
attributes cannot be further divided, that is, each column in the table cannot be further divided.

The following student information form (student):

id, name (name), sex_code (gender code), sex_desc (gender description), contact (contact information)

primary key (id)
insert image description here
If the student's phone number is often used when querying the student table, the contact column should be divided into two columns: phone number and address, so that it conforms to the first normal form .

After modifying the table to satisfy 1NF:

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-3TNQOBcj-1676683703056)(../../images/image-20230217143559491.png)]

Judging whether the table conforms to the first normal form and whether the columns can be divided depends on the requirements. If the telephone number and address are separated to meet the query and other requirements, then the previous table design does not meet 1NF. If the telephone number and address are concatenated as When a field can also meet the requirements of query, storage, etc., then it satisfies 1NF.


Under the premise of satisfying 1NF in 2NF , there is no partial dependency in the table, and the non-primary key columns must completely depend on the primary key. (mainly to say that in the case of a joint primary key, non-primary key columns cannot only depend on part of the primary key)

The following student score table (score):

stu_id (student id), kc_id (course id), score (score), kc_name (course name)

primary key (stu_id, kc_id)
insert image description here
curriculum (kc) primary key (kc_id)
insert image description here
splits the original grade table (score) into grade table (score) and curriculum (kc), and both tables conform to 2NF.

3NF:
Under the premise of satisfying 2NF, there is no transitive dependency. (A -> B, B -> C, A -> C)

The following student information form (student):


insert image description here
In the primary key (id) table, sex_desc depends on sex_code, and sex_code depends on id (primary key), so that sex_desc depends on id (primary key); sex_desc does not directly depend on the primary key, but depends on the primary key by relying on non-primary key columns , which is a transitive dependency and does not conform to 3NF.

After modifying the table to satisfy 3NF:

Student table (student) primary key (id)
insert image description here
gender code table (sexcode) primary key (sex_code)

insert image description here
After splitting the original student table, both tables satisfy 3NF.

What kind of table is easier to conform to 3NF?
Tables with fewer non-primary key columns. (1NF emphasizes that columns cannot be further divided; 2NF and 3NF emphasize the relationship between non-main attribute columns and main attribute columns)

For example, in the code table (sexcode), there is only one sex_desc for the non-primary key column;

Or design the primary key of the student table as primary key (id, name, sex_code, phone), so that the non-primary key column only has address, which is easier to conform to 3NF.

ps:

In addition to the three paradigms, there are BC paradigm and the fourth paradigm, but their specifications are too strict and are often not used in production.

2. What are paradigms and anti-paradigms, and what are their advantages and disadvantages?

A normal form is a collection of relational schemas conforming to a certain level. Constructing a database must follow certain rules. In a relational database, this rule is the paradigm.
insert image description here
Therefore, in our daily work, we usually use paradigms and anti-paradigms in combination.

3. Index

1. How many types or categories of indexes are there?

  • From the physical structure , it can be divided into clustered index and non-clustered index:

    • A clustered index means that the logical order of the key values ​​of the index is consistent with the physical order of the corresponding rows in the table, that is, each table can only have one clustered index, which is what we often call the primary key index;
    • The logical order of non-clustered indexes is inconsistent with the physical order of data rows.
  • Applications can be divided into the following categories:

    • Ordinary index : The basic index type in MySQL, without any restrictions, allows to insert duplicate values ​​and null values ​​in the columns defining the index, purely to improve query efficiency. pass ALTER TABLE table_name ADD INDEX index_name (column) create;
    • Unique index : The values ​​in the indexed column must be unique, but null values ​​are allowed. passALTER TABLE table_name ADD UNIQUE index_name (column)create;
    • Primary key index : a special unique index, which is also a clustered index, does not allow null values, and is automatically created by the database for us;
    • Composite index : The index created by multiple fields in the composite table follows the leftmost prefix matching rule;
    • Full-text index : it can only be used on the MyISAM engine, and it can only be used on fields of CHAR, VARCHAR, and TEXT types.

2. What are the advantages and disadvantages of indexing?

Let me talk about **advantages first:** Creating an index can greatly improve the performance of the system.

  • By creating a unique index, the uniqueness of each row of data in the database table can be guaranteed.

  • It can greatly speed up the retrieval of data, which is the main reason for creating indexes.

  • It can speed up the connection between tables, especially in realizing the referential integrity of data.

  • When using grouping and sorting clauses for data retrieval, it can also significantly reduce the time for grouping and sorting in queries.

  • By using the index, you can use the optimization hider during the query process to improve the performance of the system.

Since there are so many advantages to adding an index, why not create an index for every column in the table? This is because indexes also have disadvantages :

  • Creating and maintaining indexes takes time, and this time increases as the amount of data increases, which reduces the speed of data maintenance.

  • Indexes need to occupy physical space. In addition to data tables occupying data space, each index also occupies a certain amount of physical space. If you want to build a clustered index, the space required will be even larger.

3. Principles of index design

  • Select a unique index;

    • The value of the unique index is unique, and a certain record can be determined more quickly through the index.
  • Create indexes for fields that are often used as query conditions;

    • If a field is often used as a query condition, the query speed of this field will affect the query speed of the entire table. Therefore, indexing such fields can improve the query speed of the entire table.
  • Build indexes for fields that often require sorting, grouping, and union operations;

    • For fields that often require operations such as ORDER BY, GROUP BY, DISTINCT, and UNION, the sorting operation will waste a lot of time. If you index it, you can effectively avoid sorting operations.
  • limit the number of indexes;

    • Each index needs to occupy disk space. The more indexes, the more disk space is required. When modifying the table, it is very troublesome to reconstruct and update the index.
  • Indexing is not recommended for small tables (for example, the order of magnitude is within a million);

    • Due to the small size of the data, the query may take less time than traversing the index, and the index may not be optimized.
  • Try to use indexes with less data;

    • If the indexed value is very long, then the speed of the query will be affected. At this time, try to use the prefix index.
  • Drop indexes that are no longer used or are rarely used.

4. What is the data structure of the index?

The data structure of the index is related to the implementation of the specific storage engine. Hash and B+ tree indexes are commonly used in MySQL .

  • The bottom layer of the Hash index is the Hash table. When querying, call the Hash function to obtain the corresponding key value (corresponding address), and then query back to the table to obtain the actual data.
  • The underlying implementation principle of the B+ tree index is a multi-way balanced search tree. For each query, it starts from the root node. Only when the query reaches the leaf node can the query key value be obtained. Finally, the query determines whether it needs to return to the table query.

The difference between Hash and B+ tree index

  • Hash

    • Hash is faster for equivalence queries, but cannot perform range queries. Because after the index is built through the Hash function, the order of the index cannot be consistent with the original order, so the range query cannot be supported. Similarly, sorting by index is also not supported.
    • Hash does not support fuzzy queries and leftmost prefix matching of multi-column indexes, because the value of the Hash function is unpredictable, such as the calculated values ​​of AA and AB have no correlation.
    • Hash cannot avoid returning to the table to query data at any time.
    • Although the query efficiency on the equivalent value is high, the performance is unstable, because when there is a large number of repetitions in a certain key value, Hash collision occurs, and the query efficiency may be reduced at this time.
  • B+ Tree

    • B+ tree is essentially a search tree, which naturally supports range query and sorting.
    • When certain conditions (clustered index, covering index, etc.) are met, the query can be completed only through the index without returning to the table.
    • The query efficiency is relatively stable, because each query is from the root node to the leaf node, and is the height of the tree.

5. Why use B+ tree instead of B tree as index?

5.1. Let's first understand the difference between B+ tree and B tree:

  • Both non-leaf nodes and leaf nodes of the B-tree store data, so when querying data, the time complexity is at best O(1) and at worst O(log n). The B + tree only stores data in leaf nodes, and non-leaf nodes store keywords , and the keywords of different non-leaf nodes may be repeated, so when querying data, the time complexity is fixed at O(log n).

  • The leaf nodes of the B+ tree are connected with each other by a linked list , so it is only necessary to scan the linked list of the leaf nodes to complete a traversal operation, and the B tree can only be traversed through inorder.

5.2. Why is B+ tree more suitable for database index than B tree?

  • The B+ tree reduces the number of IOs.

    • Because the index file is very large, the index file is stored on the disk, and the non-leaf nodes of the B+ tree only store keywords but not data, so a single page can store more keywords, that is, the keys that need to be searched for are read into memory at one time There are more words, and the number of random I/O reads from the disk is relatively reduced.
  • B+ tree query efficiency is more stable

    • Since the data only exists on the leaf nodes, the search efficiency is fixed at O(log n), so the query efficiency of the B+ tree is more stable than that of the B tree.
  • B+ trees are better suited for range lookups

    • The leaf nodes of the B+ tree are connected in an orderly manner with a linked list, so to scan all the data, you only need to scan the leaf nodes once, which is convenient for database scanning and range query; since the B tree also stores data in non-leaf nodes, it can only be traversed in inorder Scan in order. That said, B+ trees are more efficient for range queries and ordered traversals.

6. What is a covering index?

Covering index (covering index) means that the execution of a query statement can only be obtained from the index without reading from the data table . It can also be said that index coverage is achieved. If an index contains (or covers) data that satisfies the fields and conditions in the query statement, it is called a covering index. When a query statement meets the covering index conditions, SQL can return the data required by the query only through the index, which avoids the operation of returning to the table after finding the index, reducing I/O and improving efficiency

7. What is index pushdown?

Index condition pushdown (Index condition pushdown) is referred to as ICP, which is introduced on the Mysql 5.6 version foroptimize queryTechnology.

In the case of not using index pushdown, when using a non-primary key index for query, the storage engine retrieves the data through the index, and then returns it to the MySQL server, and the server judges whether the data meets the conditions.

After the index is pushed down, if there are some judgment conditions for the indexed columns, the MySQL server will pass this part of the judgment conditions to the storage engine, and then the storage engine will judge whether the index meets the conditions passed by the MySQL server. Only when the conditions are met will the data be retrieved and returned to the MySQL server.

Index condition push-down optimization can reduce the number of times the storage engine queries the underlying table, and can also reduce the number of times the MySQL server receives data from the storage engine.

4. Storage

4.1. What common storage engines are there?

  1. MyISAM: This engine was first provided by mysql. This engine can be divided into static MyISAM, dynamic MyISAM and compressed MyISAM. No matter what kind of MyISAM table, it currently does not support transactions, row-level locks and foreign key constraints.
  2. MyISAM Merge engine: This type is a variant of the MyISAM type. Merge table is to merge several same MyISAM tables into one virtual table. Often used in logs and data warehouses.
  3. InnoDB : The InnoDB table type can be regarded as a further updated product of MyISAM, which provides functions of transactions, row-level lock mechanism and foreign key constraints, and is currently the default storage engine of MySQL.
  4. Memory(heap): This type of data table only exists in memory. It uses hash indexes, so data access is very fast. Because it exists in memory, this type is often used in temporary tables.
  5. archive: This type only supports select and insert statements, and does not support indexes. Often used in logging and aggregation analysis.

4.2. What is the difference between MyISAM and InnoDB?

1) InnoDB supports transactions, but MyISAM does not.

2) InnoDB supports foreign keys, but MyISAM does not. Therefore converting an InnoDB table with foreign keys to a MyISAM table will fail.

3) Both InnoDB and MyISAM support the index of B+ Tree data structure. But InnoDB is a clustered index, while MyISAM is a non-clustered index.

4) InnoDB does not save the number of data rows in the table, and a full table scan is required when executing select count(*) from table. However, MyISAM uses a variable to record the number of rows in the entire table, and the speed is quite fast (note that there cannot be a WHERE clause).

**Then why doesn't InnoDB use such a variable? **Because of the transactional nature of InnoDB, the number of rows in the same timetable is different for different transactions.

5) InnoDB supports table and row (default) level locks, while MyISAM supports table level locks.

InnoDB's row locks are implemented based on indexes, not on physical row records. That is, if the access does not hit the index, the row lock cannot be used, and it will degenerate into a table lock.

6) InnoDB must have a unique index (such as a primary key). If it is not specified, it will automatically find or generate a hidden column Row_id to serve as the default primary key, while Myisam can have no primary key.

4.3. Four characteristics of InnoDB engine

  • Insert buffer insert buffer)
  • Second write (double write)
  • Adaptive Hash Index (ahi)
  • Read ahead

4.4. Why does InnoDB recommend using auto-increment primary keys?

The auto-increment ID can ensure that the B+ tree index is expanded from the right every time it is inserted, so compared with custom IDs (such as UUID), it can avoid frequent merging and splitting of the B+ tree. If a string primary key and a random primary key are used, the data will be randomly inserted, and the efficiency is relatively poor.

4.5. Storage structure

4.5.1. What are InnoDB pages, regions, and segments?

  • Page

    • First, InnoDB divides the physical disk into pages. The default size of each page is 16 KB. A page is the smallest storage unit. Pages are divided into many formats according to the needs of upper-layer applications, such as indexes and logs. We mainly talk about data pages, that is, pages that store actual data.
  • Area (Extent)

    • If there is only one level of pages, the number of pages is very large, and the allocation and recovery of storage space will be very troublesome, because it is very troublesome to maintain the status of so many pages.
    • Therefore, InnoDB introduced the concept of Extent. By default, an area is composed of 64 consecutive pages, which is 1MB. It is easier to allocate and reclaim storage space through Extent.
  • Segment

    • Why do you want to introduce a segment? It starts with the index. We all know that the purpose of the index is to speed up the search, which is a typical method of exchanging space for time.
    • The leaf nodes of the B+ tree store our specific data, and the non-leaf nodes are index pages. Therefore, the B+ tree divides the data into two parts, the leaf node part and the non-leaf node part, which is the segment we want to introduce, that is to say, each index in InnoBD will create two segments to store the corresponding two parts of data.
    • Segment is a logical organization, and its hierarchical structure is Segment, Extent, and Page from top to bottom.

5. Affairs

5.1. What are the four characteristics of transactions (ACID)?

  • Atomicity : All operations in a transaction are either completed or not completed , and will not end in a middle link. If an error occurs during the execution of the transaction, it will be restored (Rollback) to the state before the transaction started, as if the transaction had never been executed.
  • Consistency ( consistency ): Before the transaction begins and after the transaction ends, the integrity of the database has not been violated. This means that the written data must fully comply with all preset rules, including data accuracy, seriality, and the subsequent database can spontaneously complete the scheduled work.
  • Isolation : The ability of the database to allow multiple concurrent transactions to read, write and modify its data at the same time. Isolation can prevent data inconsistency caused by cross-execution when multiple transactions are executed concurrently . Transaction isolation is divided into different levels, including read uncommitted (Read uncommitted), read committed (read committed), repeatable read (repeatable read) and serialization (Serializable).
  • Durability ( persistence ): After the transaction processing ends, the modification to the data is permanent, even if the system fails, it will not be lost.

5.2. Transaction concurrency issues?

Dirty reads, phantom reads, and non-repeatable reads.

5.3. What are dirty reads, phantom reads and non-repeatable reads

  • Dirty read : One transaction reads data that has not been committed by another transaction. Transaction A reads the data updated by transaction B, and then B rolls back the operation, then the data read by A is dirty data.

  • Non-repeatable read : The content of the data read twice in a transaction is inconsistent. Transaction A reads the same data multiple times, and transaction B updates and submits the data during the multiple reads of transaction A, resulting in inconsistent results when transaction A reads the same data multiple times.

  • Phantom read : The amount of data read twice in a transaction is inconsistent. System administrator A changes the grades of all students in the database from specific scores to ABCDE grades, but system administrator B inserts a record of specific scores at this time. When system administrator A completes the change, he finds that there is still a record missing If you change it, it's like hallucinations, which is called phantom reading.

Non-repeatable reading and phantom reading are easy to confuse. Non-repeatable reading focuses on modification, and phantom reading focuses on adding or deleting . To solve the problem of non-repeatable read, you only need to lock the rows that meet the conditions, and to solve the phantom read, you need to lock the table.

5.4. What are the transaction isolation levels?

insert image description here
The isolation level of serialization is the highest, and the level of read uncommitted is the lowest. The higher the level, the lower the execution efficiency. Therefore, the actual situation should be considered when selecting the isolation level.

MySQL supports the above four isolation levels, and the default is Repeatable read (repeatable read); while Oracle only supports Serializeble (serialization) level and Read committed (read committed), of which the default is read committed.

6. lock

6.1. What are the functions of database locks and what kind of locks are there?

When the database has concurrent transactions, data inconsistencies may occur. At this time, some mechanisms are needed to ensure the order of access. The lock mechanism is such a mechanism. That is, the role of the lock is to solve the concurrency problem.

  • From the granularity of locks , locks can be divided into table locks, row locks, and page locks.

    • Row-level lock : It is a kind of lock with the finest locking granularity, which means that only the row currently being operated is locked. Row-level locks can greatly reduce conflicts in database operations. Its locking granularity is the smallest, but the locking overhead is also the largest.
      Row-level locks are expensive, slow to lock, and deadlocks may occur. But the locking granularity is the smallest, the probability of lock conflicts is the lowest, and the concurrency is the highest.

    • Table-level lock : It is a kind of lock with the largest granularity, which means to lock the entire table currently being operated. It is simple to implement, consumes less resources, and is supported by most MySQL engines.

    • Page-level lock : It is a lock with a granularity between row-level locks and table-level locks. Table-level locks are fast, but have more conflicts, and row-level locks have fewer conflicts, but are slower. Therefore, a compromised page level is taken to lock a group of adjacent records at a time. The overhead and locking time are between table locks and row locks, and deadlocks will occur. The locking granularity is between table locks and row locks, and the concurrency is average.

  • From the nature of use , it can be divided into shared locks, exclusive locks and update locks.

    • Shared lock (Share Lock): S lock , also known as read lock , is used for all read-only data operations.
      S locks are not exclusive, and multiple concurrent transactions are allowed to lock the same resource, but X locks are not allowed while adding S locks, that is, resources cannot be modified. The S lock is usually released immediately after the end of the read, without waiting for the end of the transaction.

    • Exclusive Lock : X lock , also known as write lock , means to write data.
      The X lock only allows one transaction to lock the same resource, and it will not be released until the end of the transaction. Any other transaction must wait until the X lock is released to access the page. Use the select * from table_name for update; statement to generate an X lock.

    • Update lock : U lock, which is used to schedule X locks on resources, allowing other transactions to read, but not allowing U locks or X locks to be applied.
      When the read page is about to be updated, it is upgraded to an X lock, and the U lock cannot be released until the end of the transaction. Therefore, the U lock is used to avoid the deadlock phenomenon caused by the use of shared locks.

  • Subjectively divided, it can be divided into optimistic lock and pessimistic lock.

    • Optimistic Lock : As the name suggests, it is subjectively determined that the resource will not be modified, so the data is read without locking, and only when updating, the version number mechanism is used to confirm whether the resource has been modified.
      Optimistic locking is suitable for multi-read application types, which can improve the throughput of the system.

    • Pessimistic Lock (Pessimistic Lock): As the name suggests, it has strong exclusive and exclusive characteristics. Every time data is read, it is considered to be modified by other transactions, so each operation needs to be locked.

6.2. What is the relationship between isolation level and lock?

1) At the Read Uncommitted level, reading data does not require a shared lock, so that it will not conflict with the exclusive lock on the modified data;

2) At the Read Committed level, the read operation requires a shared lock, but the shared lock is released after the statement is executed;

3) At the Repeatable Read level, the read operation needs to add a shared lock, but the shared lock is not released before the transaction is committed, that is, the shared lock must be released after the transaction is completed;

4) At the SERIALIZABLE level, it is the most restrictive, because this level locks the entire range of keys and holds the lock until the transaction completes.

6.3. Locking algorithm in InnoDB?

  • Record lock : a lock on a single row record
  • Gap lock : Gap lock, locking a range, excluding the record itself
  • Next-key lock : record + gap locks a range, including the record itself

6.4. What are snapshot read and current read?

Snapshot read is to read the snapshot data, and the simple Select without lock belongs to the snapshot read.

SELECT * FROM player WHERE ...

The current read is to read the latest data, not the historical data. Locked SELECT, or adding, deleting, and modifying data will all be read currently.

SELECT * FROM player LOCK IN SHARE MODE;
SELECT FROM player FOR UPDATE;
INSERT INTO player values ...
DELETE FROM player WHERE ...
UPDATE player SET ...

6.5. What is MVCC and its implementation?

The full English name of MVCC is Multiversion Concurrency Control, which means multi-version concurrency control in Chinese, which can prevent reading and writing from blocking each other, and is mainly used to improve concurrency efficiency when solving non-repeatable reading and phantom reading problems.

The principle is to realize the concurrency control of the database through the management of multiple versions of the data row. Simply put, it is to save the historical version of the data. You can determine whether the data is displayed by comparing the version number. There is no need to lock when reading data to ensure the isolation effect of the transaction.

7. View

7.1. Why use views? What is a view?

  • In order to improve the reusability of complex SQL statements and the security of table operations, the MySQL database management system provides view features. The so-called view is essentially a virtual table that does not exist physically, and its content is similar to a real table, including a series of column and row data with names. However, views do not exist as stored data values ​​in the database. The row and column data come from the basic table referenced by the query of the custom view, and are dynamically generated when the view is specifically referenced.
  • Views allow developers to only care about certain specific data of interest and specific tasks they are responsible for, and can only see the data defined in the view, rather than the data in the tables referenced by the view, thereby improving the security of data in the database .

7.2. What are the characteristics of views?

View features are as follows:

  • The columns of the view can come from different tables, which are the abstraction of the table and the new relationship established in the logical sense.
  • A view is a table (virtual table) generated from a basic table (real table).
  • The establishment and deletion of views does not affect the base table.
  • Updates (additions, deletions and modifications) to the contents of a view directly affect the underlying tables.
  • Adding and deleting data is not allowed when the view is from more than one base table.

View operations include creating a view, viewing a view, deleting a view, and modifying a view.

7.3. What are the usage scenarios of views?

视图根本用途:简化sql查询,提高开发效率。如果说还有另外一个用途那就是兼容老的表结构。

The following are common usage scenarios for views:

  • Reuse SQL statements;
  • Simplify complex SQL operations. After writing a query, it can be easily reused without knowing its basic query details;
  • Use table components rather than the entire table;
  • Protect data. Users can be granted access to specific parts of a table rather than to the entire table;
  • Change data format and presentation. Views can return data in a different representation and format than the underlying table.

7.4. Advantages of Views

  1. Queries are simplified. Views simplify user operations
  2. data security. Views enable users to view the same data from multiple perspectives, providing security protection for confidential data
  3. Logical data independence. Views provide a degree of logical independence for refactoring the database

7.5. Disadvantages of Views

  1. performance. The database must convert the query of the view into a query of the basic table. If the view is defined by a complex multi-table query, then even if it is a simple query of the view, the database will turn it into a complex combination. It takes a certain amount of time.

  2. Modify restrictions. When a user tries to modify some rows of the view, the database must translate it into a modification of some rows of the base table. In fact, the same is true when inserting or deleting from a view. This is convenient for simple views, but may not be modifiable for more complex views

    These views have the following characteristics: 1. Views with set operators such as UNIQUE. 2. A view with a GROUP BY clause. 3. Views with aggregate functions such as AVG\SUM\MAX. 4. Views using the DISTINCT keyword. 5. Views of joined tables (with some exceptions)

7.6. What is a cursor?

  • The cursor is a data buffer created by the system for the user to store the execution result of the SQL statement, and each cursor area has a name. Users can obtain records one by one through the cursor and assign them to the main variable, which will be further processed by the main language.

8. Stored procedures and functions

What are stored procedures?

  • A stored procedure is a precompiled SQL statement. The advantage is that it allows a modular design, that is, it only needs to be created once, and it can be called multiple times in the program later. If an operation needs to execute multiple SQLs, using stored procedures is faster than pure SQL statements.

What are the pros and cons?

advantage

  1. The stored procedure is pre-compiled and has high execution efficiency.
  2. The code of the stored procedure is directly stored in the database and called directly through the stored procedure name to reduce network communication.
  3. High security, users with certain permissions are required to execute stored procedures.
  4. Stored procedures can be reused, reducing the workload of database developers.

shortcoming

  1. Debugging is troublesome, but it is very convenient to debug with PL/SQL Developer! Make up for this shortcoming.
  2. For porting issues, the database-side code is of course database-related. But if it is an engineering project, there is basically no transplant problem.
  3. The problem of recompilation, because the back-end code is compiled before running, if the object with reference relationship changes, the affected stored procedures and packages will need to be recompiled (but it can also be set to automatically compile at runtime).
  4. If a large number of stored procedures are used in a program system, when the program is delivered for use, the data structure will change as the user's needs increase, and then there are system-related problems. Finally, if the user wants to maintain the system, it can be said that it is very difficult. It is very difficult, and the cost is unprecedented, and it is more troublesome to maintain.

9. Triggers

What is a trigger? What are the usage scenarios of triggers?

  • A trigger is a special event-driven stored procedure defined by a user on a relational table. A trigger is a piece of code that is automatically executed when an event is triggered.

scenes to be used

  • Cascading changes can be achieved through related tables in the database.
  • Real-time monitoring of changes in a field in a table requires corresponding processing.
  • For example, some business numbers can be generated.
  • Be careful not to abuse it, otherwise it will cause difficulties in maintaining the database and applications.
  • You need to keep in mind the above basic knowledge points, the key point is to understand the difference between the data type CHAR and VARCHAR, and the difference between the table storage engine InnoDB and MyISAM.

What are the triggers in MySQL?

There are six types of triggers in the MySQL database:

  • Before Insert
  • After Insert
  • Before Update
  • After Update
  • Before Delete
  • After Delete

10. Commonly used SQL statements

What are the main types of SQL statements?

  • Data definition language DDL (Data Ddefinition Language) CREATE, DROP, ALTER

    Mainly, the above operations are operations on logical structures, including table structures, views and indexes.

  • Data Query Language DQL (Data Query Language) SELECT

    This is easier to understand, that is, the query operation, with the select keyword. Various simple queries, join queries, etc. all belong to DQL.

  • Data Manipulation Language DML (Data Manipulation Language) INSERT, UPDATE, DELETE

    It is mainly for the above operations, that is, to operate on the data. Corresponding to the query operations mentioned above, DQL and DML jointly construct the addition, deletion, modification and query operations commonly used by most junior programmers. The query is a special kind that is divided into DQL.

  • Data control function DCL (Data Control Language) GRANT, REVOKE, COMMIT, ROLLBACK

    The main reason is that the above operations have operations on the security and integrity of the database, which can be simply understood as authority control.

The syntax order of SQL statements:

  1. SELECT

  2. FROM

  3. JOIN

  4. ON

  5. WHERE

  6. GROUP BY

  7. HAVING

  8. UNION : Merge multiple query results (duplicate records are removed by default)

  9. ORDER BY

  10. LIMIT

    insert image description here

What are super keys, candidate keys, primary keys, and foreign keys?

  • Superkey: The set of attributes that can uniquely identify a tuple in a relationship is called the superkey of the relational schema. An attribute can be used as a super key, and a combination of multiple attributes can also be used as a super key. Superkeys include candidate keys and primary keys.
  • Candidate key: It is the smallest superkey, that is, a superkey without redundant elements.
  • Primary key: A combination of data columns or attributes that uniquely and completely identify stored data objects in a database table. A data column can only have one primary key, and the value of the primary key cannot be missing, that is, it cannot be a null value (Null).
  • Foreign key: The primary key of another table that exists in one table is called the foreign key of this table.

What are the types of SQL constraints?

What are the types of SQL constraints?

  • NOT NULL: The content used for the control field must not be empty (NULL).
  • UNIQUE: The content of the control field cannot be repeated, and a table is allowed to have multiple Unique constraints.
  • PRIMARY KEY: It is also used for the content of the control field cannot be repeated, but only one is allowed in a table.
  • FOREIGN KEY: It is used to prevent the action of breaking the connection between tables, and it can also prevent illegal data from being inserted into the foreign key column, because it must be one of the values ​​in the table it points to.
  • CHECK: Used to control the value range of the field.

Six kinds of related queries

  • Cross connection (CROSS JOIN)

  • Inner join (INNER JOIN)

  • Outer join (LEFT JOIN/RIGHT JOIN)

  • Joint query (UNION and UNION ALL)

  • Full connection (FULL JOIN)

  • Cross connection (CROSS JOIN)

    SELECT * FROM A,B(,C)或者SELECT * FROM A CROSS JOIN B (CROSS JOIN C)
    #没有任何关联条件,结果是笛卡尔积,结果集会很大,没有意义,很少使用内连接(INNER JOIN)SELECT * FROM A,B WHERE A.id=B.id或者SELECT * FROM A INNER JOIN B ON A.id=B.id多表中同时符合某种条件的数据记录的集合,INNER JOIN可以缩写为JOIN复制代码
    

There are three types of inner joins

  • Equivalent connection: ON A.id=B.id
  • Unequal join: ON A.id > B.id
  • 自连接:SELECT * FROM A T1 INNER JOIN A T2 ON T1.id=T2.pid

Outer join (LEFT JOIN/RIGHT JOIN)

  • Left outer join: LEFT OUTER JOIN, mainly based on the left table, first query the left table, match the right table according to the association conditions after ON, and fill the unmatched ones with NULL, which can be abbreviated as LEFT JOIN
  • Right outer join: RIGHT OUTER JOIN, mainly based on the right table, first query the right table, match the left table according to the association conditions after ON, and fill the unmatched ones with NULL, which can be abbreviated as RIGHT JOIN

Joint query (UNION and UNION ALL)

SELECT * FROM A UNION SELECT * FROM B UNION 
  • It is to gather multiple result sets together, and the result before UNION is the benchmark. It should be noted that the number of columns in the joint query must be equal, and the same record rows will be merged
  • Duplicate rows will not be merged if UNION ALL is used
  • Efficiency UNION is higher than UNION ALL

Full connection (FULL JOIN)

SELECT * FROM A LEFT JOIN B ON A.id=B.id UNIONSELECT * FROM A RIGHT JOIN B ON A.id=B.id复制代码
  • MySQL does not support full joins
  • You can use LEFT JOIN and UNION and RIGHT JOIN in combination

11. Master-slave replication

1. What is master-slave replication?

Master-slave replication is used to establish a database environment exactly the same as the master database, that is, the slave database. The main database is generally a quasi-real-time business database.

2. What is the role of master-slave replication?

  • Read-write separation enables the database to support greater concurrency.
  • High availability, hot backup of data, as a backup database, after the primary database server fails, it can switch to the secondary database to continue working to avoid data loss.

3. What is the architecture of master-slave replication?

  • When the request pressure of one master and one slave or one master and multiple
    slaves is very high, you can configure a master and multiple slaves replication architecture to achieve read-write separation, and distribute a large number of requests that do not require high real-time performance to multiple databases through load balancing. Read data from the library to reduce the reading pressure of the main library. And when the main library is down, a slave library can be switched to the main library to continue to provide services.

  • Master-master replication
    The dual-master replication architecture is suitable for scenarios that require master-slave switching. The two databases are mutually master-slave. When the master database recovers from a downtime, it will still copy the data on the new master database because it is still the slave of the original slave database (now the master database). So no matter how the role of the main library is switched, the original main library will not be separated from the replication environment.

  • **Multiple masters and one slave (**Supported from 5.7)

  • Cascade replication
    Because each slave library will have an independent Binlog Dump thread on the main library to push binlog logs, so as the number of slave libraries increases, the IO pressure and network pressure of the main library will also increase. At this time , the cascaded replication architecture came into being.

The cascade replication architecture is only based on one master and multiple slaves, and a second-level master library Master2 is added between the master library and each slave library. This second-level master library is only used to push the first-level master library to its Binlog The log is then pushed to each slave library, so as to reduce the push pressure of the first-level master library.
insert image description here

4. What is the principle of master-slave replication?

The database has a binlog binary file, which records all SQL statements executable by the data. The goal of master-slave synchronization is to copy the SQL statements in the binlog file of the master database to the slave database, and let it execute these SQL statements again in the relaylog file of the slave data.

The specific implementation requires three threads:

  • Binlog output thread : Whenever a slave library is connected to the main library, the main library will create a thread and send the binlog content to the slave library.
    In the slave library, when the replication starts, the slave library will create two threads for processing:

  • IO thread from the library : When the START SLAVE statement is executed from the library, an IO thread is created from the library, which connects to the main library and requests the main library to send the update records in the binlog to the slave library. Read the updates sent by the binlog output thread of the main library from the library IO thread and copy these updates to local files, including relaylog files.

  • From the library SQL thread : create a SQL thread from the library, this thread reads and executes the update events written from the library IO thread to the relaylog.

5. What are asynchronous replication and semi-synchronous replication?

MySQL's master-slave replication has two replication methods, namely asynchronous replication and semi-synchronous replication :

  • Asynchronous replication
    MySQL's default master-slave replication method is asynchronous replication, because the Master does not consider whether the data reaches the Slave or whether the Slave is successfully executed.

    If it is necessary to achieve full synchronization, that is, the Master needs to wait for one or all Slaves to execute successfully before responding successfully, then the cluster efficiency can be imagined. Therefore, a compromise method appeared after MySQL 5.6 - semi-synchronization.

  • In semi-synchronous replication
    , one master and one slave, and one master and multiple slaves, the Master node can return a successful operation to the requesting client as long as it confirms that at least one Slave has received the transaction. At the same time, the Master does not need to wait for the Slave to successfully execute the transaction. The Slave node receives the transaction and writes it to the local relay log successfully .

In addition, during semi-synchronous replication, if a transaction of the master library is successfully committed, during the process of pushing it to the slave library, the slave library is down or the network fails, resulting in the slave library not receiving the Binlog of this transaction. The main library will wait for a period of time (this time is determined by the number of milliseconds of rpl_semi_sync_master_timeout). If it cannot be pushed to the slave library after this time, MySQL will automatically switch from semi-synchronous replication to asynchronous replication. When the slave library resumes normal connection to the main library After that, the main library will automatically switch back to semi-synchronous replication.

The "half" of semi-synchronous replication is reflected in that although the Binlog of the master-slave library is synchronized, the master library will not wait for the slave library to execute the Relay-log before returning, but confirms that the slave library has received the Binlog and reaches the master-slave Binlog After the purpose of synchronization, it returns, so the data of the slave library is still delayed for the master library, and this delay is the time for the slave library to execute Relay-log. So it can only be called semi-synchronous.

6. Common problems and solutions in master-slave?

Problems:
1) Data may be lost after the main database goes down.

2) There is only one sql thread in the slave library, and the writing pressure of the master library is high, and the replication may be delayed.

**Solution: **
1) Semi-synchronous replication: Ensure that the binlog is transferred to at least one slave database after the transaction is committed to solve the problem of data loss.

2) Parallel replication: Multi-thread apply binlog from the library to solve the problem of replication delay from the library.

12. Tuning

Tell me some experience in database optimization?

  1. If there are foreign key constraints, it will affect the performance of adding, deleting and modifying. If the application can guarantee the integrity of the database, then remove the foreign key
  2. Sql statements are all capitalized, especially the column names, because the mechanism of the database is like this. When the sql statement is sent to the database server, the database will first compile the sql into uppercase and execute it. If it is compiled into uppercase at the beginning, it is unnecessary. The step of compiling sql into uppercase
  3. If the application can guarantee the integrity of the database, it is not necessary to design the database according to the three paradigms
  4. In fact, many indexes can be created unnecessarily. Indexes can speed up queries, but indexes consume disk space
  5. If it is jdbc, use PreparedStatement instead of Statement to create SQl. The performance of PreparedStatement is faster than that of Statement. SQL statements using PreparedStatement object will be precompiled in this object, and PreparedStatement object can be executed efficiently multiple times.

How to optimize SQL query statement?

  1. To optimize the query, you should try to avoid full table scans. First, you should consider building indexes on the columns involved in where and order by
  2. Indexes can improve queries
  3. Avoid using the * sign in the SELECT clause, and try to use all uppercase SQL
  4. Try to avoid judging whether the field is null in the where clause, otherwise it will cause the engine to give up using the index and perform a full table scan. Use IS NOT NULL
  5. The use of or in the where clause to connect the conditions will also cause the engine to give up using the index and perform a full table scan
  6. In and not in should also be used with caution, otherwise it will cause a full table scan

How do you know whether the SQL statement performance is high or low

  1. Check the execution time of SQL
  2. Use the explain keyword to simulate the optimizer to execute SQL query statements, so as to know how MYSQL handles your SQL statements. Analyze the performance bottleneck of your query statement or table structure.

How to optimize large table data query

  1. Optimize shema, sql statement + index;
  2. The second plus cache, memcached, redis;
  3. Master-slave replication, read-write separation;
  4. Vertical splitting, according to the coupling degree of your modules, divide a large system into multiple small systems, that is, distributed systems;
  5. Horizontal sharding, for tables with a large amount of data, this step is the most troublesome and can test the technical level. A reasonable sharding key must be selected. In order to have good query efficiency, the table structure must also be changed, and certain redundancy should be applied. It should also be changed, try to include sharding key in sql, and locate the data on a limited table to check, instead of scanning all tables;

How to deal with super large pages?

Large paging is generally solved from two directions.

  • At the database level, this is what we mainly focus on (although the effect is not so great). Similar to select * from table where age > 20 limit 1000000,10this kind of query, there is actually room for optimization. This statement needs to load 1000000 data and basically discard all of them. Only 10 of them are of course more Slow. At that time, we can modify it to select * from table where id in (select id from table where age > 20 limit 1000000,10). Although it also loads a million data, but due to the index coverage, all the fields to be queried are in the index, so the speed will be very fast. At the same time, if the ID is continuous, we can also select * from table where id > 1000000 limit 10, the efficiency is also good, there are many possibilities for optimization, but the core idea is the same, which is to reduce the load data.
  • Reducing such requests from the perspective of demand...mainly not to do similar demands (directly jump to a specific page after millions of pages. It is only allowed to view page by page or follow a given route, which is predictable and predictable cache) and prevent ID leakage and continuous malicious attacks.

To solve super-large paging, in fact, it mainly relies on caching, predictably finds the content in advance, caches it in kV databases such as redis, and returns it directly

Why try to set a primary key?

  • The primary key is the guarantee for the database to ensure the uniqueness of the data rows in the entire table. Even if the business table does not have a primary key, it is recommended to add a self-increasing ID column as the primary key. After setting the primary key, it may be faster and ensure the security of the operation data range in the subsequent deletion, modification and query.

Does the primary key use auto-increment ID or UUID?

  • It is recommended to use auto-increment ID instead of UUID.
  • Because in the InnoDB storage engine, the primary key index exists as a clustered index, that is to say, the primary key index and all data (in order) are stored on the B+ tree leaf nodes of the primary key index. If the primary key index is an auto-increment ID, Then you only need to keep arranging it backwards. If it is a UUID, because the incoming ID and the original size are uncertain, it will cause a lot of data insertion and data movement, which will cause a lot of memory fragments, which will cause a decline in insertion performance. .

In short, in the case of a larger amount of data, the performance of using the auto-increment primary key will be better.
The primary key is a clustered index. If there is no primary key, InnoDB will choose a unique key as the clustered index. If there is no unique key, an implicit primary key will be generated.

If I want to store a user's password hash, what field should I use to store it?

  • Fixed-length strings such as password hashes, salts, and user ID numbers should be stored using char instead of varchar, which can save space and improve retrieval efficiency.

How to optimize data access during query

  • Accessing too much data leads to query performance degradation
  • Determine if the application is retrieving more data than needed, possibly too many rows or columns
  • Confirm whether the MySQL server is analyzing a large number of unnecessary data rows
  • Avoid making the following SQL statement errors
  • Avoid querying data you don't need. Solution: use limit to solve
  • Multi-table association returns all columns. Workaround: specify column names
  • Always return all columns. Workaround: Avoid using SELECT *
  • Query the same data repeatedly. Solution: You can cache the data and read the cache directly next time
  • Use explain for analysis. If you find that the query needs to scan a large amount of data but only returns a small number of rows, you can optimize it by the following techniques:
  • Use index coverage scan to put all the columns into the index, so that the storage engine can return the result without going back to the table to get the corresponding row.
  • Change the structure of the database and tables, and modify the data table paradigm
  • Rewrite the SQL statement so that the optimizer can execute the query in a more optimal way.

How to optimize long and difficult query statements

  • Analyzing whether one complex query or multiple simple queries is fast
  • MySQL can scan millions of rows of data in memory per second. In contrast, it is much slower to respond to data to the client.
  • It is good to use the smallest possible query, but sometimes it is necessary to break a large query into many smaller queries.
  • Divide a large query into multiple smaller identical queries
  • Deleting 10 million data at one time will consume more server overhead than deleting 10,000 at a time and pausing for a while.
  • Decompose the associated query to make the cache more efficient.
  • Executing a single query reduces contention for locks.
  • It is easier to split the database by doing associations at the application layer.
  • The query efficiency will be greatly improved.
  • Queries with fewer redundant records.

Optimize specific types of query statements

  • count(*) ignores all columns and directly counts all columns, do not use count(column name)
  • In MyISAM, count(*) without any where condition is very fast.
  • When there is a where condition, MyISAM's count statistics are not necessarily faster than other engines.
  • You can use explain to query approximate values, replacing count(*) with approximate values
  • Add summary table
  • use cache

Optimizing relational queries

  • Determine if there is an index in the ON or USING clause.
  • Make sure that the GROUP BY and ORDER BY only have columns from one table, so that it is possible for MySQL to use the index.

Optimizing subqueries

  • Use a relational query instead
  • Optimizing GROUP BY and DISTINCT
  • These two query data can be optimized using indexes, which is the most effective optimization method
  • In associated queries, it is more efficient to use identity column grouping
  • If ORDER BY is not required, add ORDER BY NULL when performing GROUP BY, and MySQL will not sort files.
  • WITH ROLLUP super aggregation can be moved to application processing

Optimize LIMIT pagination

  • When the LIMIT offset is large, the query efficiency is low
  • The maximum ID of the last query can be recorded, and the next query can be queried directly based on this ID

Optimizing UNION queries

  • UNION ALL is more efficient than UNION

Optimizing the WHERE clause

  • Most databases process conditions in order from left to right, put the conditions that can filter more data in front, and put the conditions that filter less in the back

Some Methods of SQL Statement Optimization

  • 1. To optimize the query, you should try to avoid full table scans. First, you should consider building indexes on the columns involved in where and order by.

  • 2. Try to avoid judging the null value of the field in the where clause, otherwise the engine will give up using the index and perform a full table scan, such as:

    select id from t where num is null
    -- 可以在num上设置默认值0,确保表中num列没有null值,然后这样查询:
    select id from t where num=0
    复制代码
    
  • 3. Try to avoid using the != or <> operator in the where clause, otherwise the engine will give up using the index and perform a full table scan.

  • 4. Try to avoid using or to connect conditions in the where clause, otherwise it will cause the engine to give up using the index and perform a full table scan, such as:

    select id from t where num=10 or num=20
    -- 可以这样查询:
    select id from t where num=10 union all select id from t where num=20
    复制代码
    
  • 5. In and not in should also be used with caution, otherwise it will cause a full table scan, such as:

    select id from t where num in(1,2,3) 
    -- 对于连续的数值,能用 between 就不要用 in 了:
    select id from t where num between 1 and 3
    复制代码
    
  • 6. The following query will also result in a full table scan: select id from t where name like '%Li%' If you want to improve efficiency, you can consider full-text search.

  • 7. If parameters are used in the where clause, it will also cause a full table scan. Because SQL resolves local variables only at runtime, the optimizer cannot defer the choice of an access plan until runtime; it must choose at compile time. However, if the access plan is established at compile time, the value of the variable is still unknown and thus cannot be used as an input for index selection. The following statement will perform a full table scan:

    select id from t where num=@num
    -- 可以改为强制查询使用索引:
    select id from t with(index(索引名)) where num=@num
    复制代码
    
  • 8. Try to avoid performing expression operations on fields in the where clause, which will cause the engine to give up using indexes and perform full table scans. like:

    select id from t where num/2=100
    -- 应改为:
    select id from t where num=100*2
    复制代码
    
  • 9. Try to avoid performing function operations on fields in the where clause, which will cause the engine to give up using indexes and perform full table scans. like:

    select id from t where substring(name,1,3)=’abc’
    -- name以abc开头的id应改为:
    select id from t where name like ‘abc%’
    复制代码
    
  • 10. Do not perform functions, arithmetic operations or other expression operations on the left side of "=" in the where clause, otherwise the system may not be able to use the index correctly.

database optimization

why optimize

  • The throughput bottleneck of the system often occurs in the access speed of the database
  • As the application runs, there will be more and more data in the database, and the processing time will slow down accordingly
  • The data is stored on the disk, and the read and write speed cannot be compared with that of the memory
优化原则:减少系统瓶颈,减少资源占用,增加系统的反应速度。

Database structure optimization

  • A good database design scheme will often have a multiplier effect on the performance of the database.
  • It is necessary to consider many aspects such as data redundancy, query and update speed, and whether the data type of the field is reasonable.

Split a table with many fields into multiple tables

  • For a table with many fields, if some fields are used infrequently, these fields can be separated to form a new table.
  • Because when a table has a large amount of data, it will slow down due to the existence of low-frequency fields.

Add intermediate table

  • For tables that need to be queried frequently, an intermediate table can be established to improve query efficiency.
  • By establishing an intermediate table, insert the data that needs to be queried through the joint into the intermediate table, and then change the original joint query into a query on the intermediate table.

Add redundant fields

  • When designing data tables, try to follow the rules of paradigm theory, reduce redundant fields as much as possible, and make the database design look refined and elegant. However, adding redundant fields reasonably can improve query speed.
  • The higher the degree of normalization of the table, the more relationships between tables and tables, the more cases that require join queries, and the worse the performance.

Notice:

冗余字段的值在一个表中修改了,就要想办法在其他表中更新,否则就会导致数据不一致的问题。

What should he do if the CPU of the MySQL database soars to 500%?

  • When the cpu soars to 500%, first use the top command of the operating system to check whether it is caused by the occupation of mysqld. If not, find out the process with high occupation and deal with it.
  • If it is caused by mysqld, show processlist to see if there is any resource-consuming sql running. Find the sql with high consumption, and check whether the execution plan is accurate, whether the index is missing, or it is caused by too much data.
  • Generally speaking, you must kill these threads (while observing whether the cpu usage rate drops), and after making corresponding adjustments (such as adding indexes, changing sql, and changing memory parameters), run these SQLs again.
  • It is also possible that each sql does not consume much resources, but suddenly, a large number of sessions are connected and the cpu soars. In this case, it is necessary to analyze why the number of connections will surge with the application, and then make corresponding adjustments. For example, limit the number of connections, etc.

How to optimize the large table? How to divide the database and table? What's the problem with sub-table sub-database? Is middleware useful? Do you know their rationale?

When the number of records in a single MySQL table is too large, the CRUD performance of the database will drop significantly. Some common optimization measures are as follows:

  1. Limit the range of data: Be sure to prohibit query statements without any conditions to limit the range of data. For example: when users query the order history, we can control it within a month. ;
  2. Read/write separation: classic database splitting scheme, the main library is responsible for writing, and the slave library is responsible for reading;
  3. Cache: Use MySQL's cache, and consider using application-level cache for heavyweight, less-updated data;

In addition, it is optimized by sub-database sub-table, mainly including vertical partition, vertical sub-table and horizontal partition, horizontal sub-table

1. Vertical partition

  • Split according to the relevance of the data tables in the database. For example, if the user table contains both the user's login information and the user's basic information, the user table can be split into two separate tables, or even placed in a separate library as a sub-database.
  • Simply put, vertical splitting refers to the splitting of data table columns, splitting a table with more columns into multiple tables. As shown in the figure below, it should be easier for everyone to understand.

insert image description here

  • Advantages of vertical splitting: it can make row data smaller, reduce the number of blocks read during query, and reduce the number of I/O times. In addition, vertical partitioning can simplify the structure of the table and make it easier to maintain.
  • Disadvantages of vertical splitting: the primary key will be redundant, and redundant columns need to be managed, which will cause Join operations, which can be solved by performing Join at the application layer. In addition, vertical partitioning will make transactions more complicated;

2. Vertical sub-table

  • Put the primary key and some columns in one table, then put the primary key and other columns in another table

insert image description here

Applicable scene

  • 1. If some columns in a table are commonly used, other columns are not commonly used
  • 2. The data row can be made smaller, a data page can store more data, and the number of I/O times can be reduced during query

shortcoming

  • Some table-splitting strategies are based on the logic algorithm of the application layer. Once the logic algorithm changes, the entire table-splitting logic will change, and the scalability is poor.
  • For the application layer, logic algorithms increase development costs
  • Manage redundant columns, querying all data requires a join operation

3. Horizontal partition

  • Keep the structure of the data table unchanged, and store data fragments through a certain strategy. In this way, each piece of data is dispersed into different tables or libraries, achieving the purpose of distribution. Horizontal splitting can support a very large amount of data.
  • Horizontal splitting refers to the splitting of data table rows. When the number of rows in a table exceeds 2 million, it will slow down. At this time, the data in one table can be split into multiple tables for storage. For example: we can split the user information table into multiple user information tables, so as to avoid the performance impact caused by the large amount of data in a single table.

insert image description here

  • Water splitting can support a very large amount of data. One thing to note is that table splitting only solves the problem of too large data in a single table, but because the data in the table is still on the same machine, it does not make much sense to improve the concurrency of MySQL, so horizontal splitting is the best way to split the database .
  • Horizontal splitting can support the storage of a very large amount of data, and there is little modification on the application side , but it is difficult to solve the fragmentation transaction , the join performance of cross-border points is poor, and the logic is complicated.
《Java工程师修炼之道》的作者推荐 尽量不要对数据进行分片,因为拆分会带来逻辑、部署、运维的各种复杂度 ,一般的数据表在优化得当的情况下支撑千万以下的数据量是没有太大问题的。如果实在要分片,尽量选择客户端分片架构,这样可以减少一次和中间件的网络I/O。

4. Horizontal table:

  • The table is very large. After splitting, it can reduce the number of data and index pages that need to be read during query, and also reduce the number of index layers and increase the number of queries.

insert image description here

Applicable scene

  • 1. The data in the table itself is independent. For example, the data in the table is divided into tables to record the data of various regions or data of different periods, especially some data are commonly used and some are not commonly used.
  • 2. Data needs to be stored on multiple media.

Disadvantages of split horizon

  • 1. Add complexity to the application. Usually, multiple table names are required for query, and UNION operation is required to query all data.
  • 2. In many database applications, this complexity will exceed the advantages it brings, and the query will increase the number of disks to read an index layer

Two common schemes for database sharding:

  • Client agent: the sharding logic is on the application side, encapsulated in the jar package, and implemented by modifying or encapsulating the JDBC layer. Dangdang's Sharding-JDBC and Ali's TDDL are two commonly used implementations.
  • Middleware proxy: A proxy layer is added between the application and the data. The sharding logic is uniformly maintained in the middleware service. The Mycat we are talking about , 360's Atlas, Netease's DDB, etc. are all implementations of this architecture.

Problems faced after sub-database sub-table

  • Transaction support After sub-database and sub-table, it becomes a distributed transaction. If you rely on the distributed transaction management function of the database itself to execute the transaction, you will pay a high performance price; if the application program assists in the control and forms a program logic transaction, it will cause a programming burden.

  • Cross-library join

    As long as the segmentation is performed, the problem of cross-node Join is inevitable. But good design and segmentation can reduce the occurrence of such situations. A common way to solve this problem is to implement it in two queries. Find the id of the associated data in the result set of the first query, and initiate a second request to obtain the associated data based on these ids. Sub-database and sub-table solution products

  • Cross-node count, order by, group by, and aggregate function problems These are a class of problems because they all need to be calculated based on the entire data set. Most proxies don't handle merges automatically. Solution: Similar to solving the cross-node join problem, merge the results on the application side after getting the results on each node. The difference from join is that the query of each node can be executed in parallel, so in many cases its speed is much faster than that of a single large table. But if the result set is large, the consumption of application memory is a problem.

  • Issues such as data migration, capacity planning, and capacity expansion come from the Taobao integrated business platform team, which utilizes the forward-compatible characteristics of taking the remainder of multiples of 2 (for example, taking the remainder of 4 and getting 1 is also 1 for taking the remainder of 2) to allocate data , to avoid row-level data migration, but table-level migration is still required, and there are restrictions on the scale of expansion and the number of sub-tables. Generally speaking, these solutions are not very ideal, and there are more or less shortcomings, which also reflects the difficulty of Sharding expansion from one aspect.

  • ID problem

  • Once the database is split into multiple physical nodes, we can no longer rely on the database's own primary key generation mechanism. On the one hand, the self-generated ID of a partition database cannot be guaranteed to be globally unique; on the other hand, the application needs to obtain the ID before inserting data for SQL routing. Some common primary key generation strategies

    • UUID Using UUID as the primary key is the simplest solution, but the disadvantages are also very obvious. Since the UUID is very long, in addition to occupying a large amount of storage space, the main problem is in the index, and there are performance problems in both index creation and index-based query. Twitter's distributed self-incrementing ID algorithm Snowflake in a distributed system needs to generate a global UID in many occasions. Twitter's snowflake solves this need, and the implementation is still very simple. Except for configuration information, the core code is milliseconds Class time 41 bits machine ID 10 bits ms sequence 12 bits.
  • Sorting and paging issues across shards

    Generally speaking, paging needs to be sorted according to the specified field. When the sorting field is a sharding field, we can easily locate the specified sharding through the sharding rules, but when the sorting field is not a sharding field, the situation becomes more complicated. For the accuracy of the final result, we need to sort and return the data in different shard nodes, summarize and re-sort the result sets returned by different shards, and finally return them to the user. As shown below:

    insert image description here

Guess you like

Origin blog.csdn.net/QRLYLETITBE/article/details/129096940