Big data - play with data - mysql specification

overall map

insert image description here

body part

1. Database command specification

  • All database object names must be in lowercase and separated by underscores

  • All database object names are prohibited from using mysql reserved keywords (if the table name contains keyword queries, it needs to be enclosed in single quotes)

  • The name of the database object should be able to see the name, and the last should not exceed 32 characters

  • The temporary library table must be prefixed with tmp_ and suffixed with date, and the backup table must be prefixed with bak_ and suffixed with date (time stamp)

  • All column names and column types that store the same data must be consistent (generally as an associated column, if the associated column type is inconsistent during query, the data type will be automatically converted implicitly, which will cause the index on the column to fail, resulting in reduced query efficiency)

Second, the basic design specifications of the database

1. All tables must use the Innodb storage engine

In the absence of special requirements (that is, functions that Innodb cannot meet, such as: column storage, storage space data, etc.), all tables must use the Innodb storage engine (Myisam was used by default before mysql5.5, and Innodb was used by default after 5.6). Innodb supports transactions , support row-level locks, better recovery, better performance under high concurrency

2. The character set of the database and table uniformly uses UTF8

Better compatibility, unified character set can avoid garbled characters caused by character set conversion, different character sets need to be converted before comparison, which will cause index failure

3. All tables and fields need to add comments

Use the comment clause to add table and column comments to maintain the data dictionary from the beginning

4. Try to control the data size of a single table, it is recommended to control it within 5 million

5 million is not the limit of the MySQL database. If it is too large, it will cause big problems in modifying the table structure, backup, and recovery.

You can use historical data archiving (applied to log data), sub-database sub-table (applied to business data) and other means to control the size of the data

5. Use MySQL partition table carefully

Physically, a partitioned table appears as multiple files, but logically, it appears as a table. Carefully select the partition key, and cross-partition queries may be less efficient. It is recommended to use physical partitioning to manage large data.

6. Try to separate hot and cold data and reduce the width of the table

MySQL limits each table to store up to 4096 columns, and the size of each row of data cannot exceed 65535 bytes to reduce disk IO and ensure the memory cache hit rate of hot data (the wider the table, the memory occupied when loading the table into the memory buffer pool The bigger it is, the more IO will be consumed) More effective use of cache to avoid reading useless cold data and putting columns that are often used together into a table (avoiding more association operations)

7. It is forbidden to create reserved fields in the table

The naming of reserved fields is difficult to see the meaning of the name. The reserved field cannot confirm the stored data type, so it is impossible to select the appropriate type. Modifying the reserved field type will lock the table

8. It is forbidden to store large binary data such as pictures and files in the database

Usually the file is very large, which will cause a rapid increase in the amount of data in a short period of time. When the database reads the database, it usually performs a large number of random IO operations. file address information

9. It is forbidden to do database stress testing online

10. It is forbidden to directly connect to the generation environment database from the development environment and test environment

3. Database Field Design Specifications

1. Prioritize the selection of the smallest data type that meets storage needs

reason

The larger the field of the column, the larger the space required to create the index, so that the number of index nodes that can be stored in a page is less and less, and the number of IOs required for traversal is also more , the performance of the index will be worse

method

1) Convert character strings to numeric types for storage, such as converting IP addresses to integer data.

mysql provides two methods to handle ip address:

inet_aton converts ip to unsigned integer (4-8 bits)

inet_ntoa Convert integer ip to address

Before inserting data, use inet_aton to convert the ip address into an integer, which can save space. When displaying data, use inet_ntoa to convert the integer ip address into an address display.

2) For non-negative data (such as auto-increment ID, integer IP), it is better to use unsigned integer to store

Because: unsigned can double the storage space compared to signed

SIGNED INT -2147483648~2147483647

UNSIGNED INT 0~4294967295

The N in VARCHAR(N) represents the number of characters, not the number of bytes

Use UTF8 to store 255 Chinese characters Varchar(255)=765 bytes. Excessive length will consume more memory

2. Avoid using TEXT and BLOB data types. The most common TEXT type can store 64k data

It is recommended to separate BLOB or TEXT columns into separate extension tables

Mysql memory temporary tables do not support large data types such as TEXT and BLOB. If such data is included in the query, memory temporary tables cannot be used for operations such as sorting, and disk temporary tables must be used.

And for this kind of data, Mysql still needs to perform secondary queries, which will make the performance of SQL very poor, but it does not mean that such data types must not be used.

If you must use it, it is recommended to separate the BLOB or TEXT column into a separate extended table. Do not use select * when querying, but only need to take out the necessary columns. Do not query the column when you do not need the data in the TEXT column.

TEXT or BLOB types can only use prefix indexes

Because MySQL has restrictions on the length of the index field, the TEXT type can only use the prefix index, and the TEXT column cannot have a default value.

3. Avoid using ENUM types

Modifying ENUM values ​​requires the use of the ALTER statement

The ORDER BY operation of ENUM type is inefficient and requires additional operations

Prohibit the use of numeric values ​​as ENUM enumeration values

4. Define all columns as NOT NULL as much as possible

reason:

Indexing NULL columns requires additional space to save, so it takes up more space;

Special handling of NULL values ​​is required for comparisons and calculations

5. Use TIMESTAMP (4 bytes) or DATETIME type (8 bytes) to store time

The time range stored in TIMESTAMP is 1970-01-01 00:00:01 ~ 2038-01-19-03:14:07.

TIMESTAMP occupies 4 bytes and is the same as INT, but is more readable than INT

Values ​​beyond the value range of TIMESTAMP are stored in DATETIME type.

People often use strings to store date-type data (incorrect practice):

Disadvantage 1: Unable to calculate and compare with date functions

Disadvantage 2: Storing dates with strings takes up more space

6. Amount data related to finance must use the decimal type

Non-precision floating point: float, double

Precision floating point: decimal

The Decimal type is a precise floating-point number, which will not lose precision during calculation. The occupied space is determined by the defined width, every 4 bytes can store 9 digits, and the decimal point occupies one byte. Can be used to store integer data larger than bigint.

4. Index Design Specifications

1. Limit the number of indexes on each table. It is recommended that there are no more than 5 indexes on a single table

More indexes are not better! Indexes can increase efficiency as well as decrease efficiency.

Indexes can increase query efficiency, but also reduce the efficiency of insertion and update, and even reduce query efficiency in some cases.

Because when the mysql optimizer chooses how to optimize the query, it will evaluate each available index according to the unified information to generate the best execution plan. If there are many indexes that can be used for query at the same time, It will increase the time for the mysql optimizer to generate an execution plan, which will also reduce query performance.

2. It is forbidden to create a separate index for each column in the table

Before version 5.6, one SQL could only use one index in one table. After 5.6, although there is an optimization method of merging indexes, it is still far from the query method of using a combined index

3. Each Innodb table must have a primary key

Innodb is an index-organized table: the logical order of data storage and the order of indexes are the same.

Each table can have multiple indexes, but the storage order of the table can only have one Innodb organizes the table according to the order of the primary key index.

Do not use frequently updated columns as primary keys, and multi-column primary keys are not applicable (equivalent to joint indexes). Do not use UUID, MD5, HASH, and string columns as primary keys (the sequential growth of data cannot be guaranteed).

It is recommended to use the auto-increment ID value for the primary key.

Five, common index column recommendations

Columns that appear in the WHERE clause of SELECT, UPDATE, and DELETE statements

Fields included in ORDER BY, GROUP BY, DISTINCT

Do not create an index for the columns that match the fields in 1 and 2. Usually, it is better to build a joint index for the fields in 1 and 2

The associated column of multi-table join

6. How to choose the order of index columns

The purpose of indexing is to search for data through the index, reduce random IO, and increase query performance. The less data the index can filter out, the less data will be read from the disk.

The one with the highest discrimination is placed on the far left of the joint index (distinction = number of different values ​​in the column/total number of rows in the column);

Try to put columns with small field lengths on the leftmost side of the joint index (because the smaller the field length, the greater the amount of data that can be stored in one page, and the better the IO performance);

The most frequently used columns are placed on the left side of the joint index (so that fewer indexes can be built).

7. Avoid creating redundant and duplicate indexes

Because this will increase the time for the query optimizer to generate an execution plan.

Duplicate index example: primary key(id), index(id), unique index(id)

Redundant index example: index(a,b,c), index(a,b), index(a)

8. Prioritize Covering Indexes

Covering indexes are preferred for frequent queries.

Covering index: It is an index that contains all query fields (where, select, ordery by, group by fields)

Benefits of covering indexes:

Avoid secondary query of Innodb table index

Innodb is stored in the order of the clustered index. For Innodb, the secondary index stores the primary key information of the row in the leaf node.

If the data is queried using the secondary index, after finding the corresponding key value, a secondary query must be performed through the primary key to obtain the data we actually need. In the covering index, all the data can be obtained in the key value of the secondary index, avoiding the secondary query of the primary key, reducing IO operations, and improving query efficiency.

Can turn random IO into sequential IO to speed up query efficiency

Since the covering index is stored in the order of key values, for IO-intensive range search, the data IO of each row is much less than randomly reading each row from the disk. Read IOs are converted to sequential IOs for index lookups.

9. Index SET specification

Try to avoid using foreign key constraints

It is not recommended to use foreign key constraints (foreign key), but be sure to create an index on the associated key between tables;

Foreign keys can be used to ensure the referential integrity of data, but it is recommended to implement it on the business side;

Foreign keys can affect write operations on parent and child tables, reducing performance.

10. Database SQL development specification

1. It is recommended to use precompiled statements for database operations

Precompiled statements can reuse these plans, reduce the time required for SQL compilation, and can also solve the problem of SQL injection caused by dynamic SQL. Only passing parameters is more efficient than passing SQL statements. The same statement can be parsed once and used multiple times. Improve processing efficiency.

2. Avoid implicit conversion of data types

Implicit conversions invalidate the index. Such as: select name,phone from customer where id = '111';

3. Make full use of the existing indexes on the table

Avoid query conditions with double % signs.

Such as a like '%123%', (if there is no leading %, only the trailing %, the index on the column can be used)

One SQL can only use one column in the composite index for range query

For example, if there is a joint index of columns a, b, and c, and there is a range query of column a in the query condition, the index on columns b and c will not be used. When defining the joint index, if column a is required If range search is used, column a must be placed on the right side of the joint index.

Use left join or not exists to optimize not in operation

Because not in also usually uses index invalidation.

4. When designing the database, future expansion should be considered

5. The program connects to different databases and uses different accounts, and the system cross-database query

Leave room for database migration and sub-database sub-table

Reduce business coupling

Avoid security risks caused by excessive permissions

6. It is forbidden to use SELECT * You must use SELECT <field list> query

reason:

Consume more CPU and IO and network bandwidth resources

Can't use covering index

Can reduce the impact of table structure changes

7. It is forbidden to use INSERT statements without field lists

如:insert into values (‘a’,’b’,’c’);

should use insert into t(c1,c2,c3) values ​​('a','b','c');

8. Avoid using subqueries, you can optimize subqueries as join operations

Usually the subquery is in the in clause, and the subquery is simple SQL (excluding union, group by, order by, limit clauses), then the subquery can be converted into an associated query for optimization.

Reasons for poor subquery performance:

The result set of a subquery cannot use an index. Usually, the result set of a subquery will be stored in a temporary table. Neither the memory temporary table nor the disk temporary table will have an index, so the query performance will be affected to a certain extent;

Especially for subqueries that return a relatively large result set, the impact on query performance is greater;

Since subqueries will generate a large number of temporary tables and have no indexes, they will consume too much CPU and IO resources and generate a large number of slow queries.

9. Avoid using JOIN to associate too many tables

For Mysql, there is an associated cache, and the size of the cache can be set by the join_buffer_size parameter.

In Mysql, for the same SQL to join more than one table, one more association cache will be allocated. If there are more tables associated in one SQL, the memory occupied will be larger.

If a large number of multi-table association operations are used in the program, and the join_buffer_size setting is not reasonable, it is easy to cause server memory overflow, which will affect the stability of server database performance.

At the same time, for association operations, temporary table operations will be generated, which will affect query efficiency. Mysql allows up to 61 tables to be associated, and it is recommended not to exceed 5.

10. Reduce the number of interactions with the database

The database is more suitable for processing batch operations and merging multiple identical operations together, which can improve processing efficiency

11. When making or judgments corresponding to the same column, use in instead of or

The value of in should not exceed 500. The in operation can make more efficient use of the index, or the index can be rarely used in most cases.

12. It is forbidden to use order by rand() for random sorting

All eligible data in the table will be loaded into memory, and then all data will be sorted in memory according to randomly generated values, and a random value may be generated for each row. If the data set that meets the conditions is very large, It will consume a lot of CPU, IO and memory resources.

It is recommended to get a random value in the program and then get the data from the database

13. It is forbidden to perform function conversion and calculation on columns in the WHERE clause

Indexes cannot be used when performing function conversions or calculations on columns.

Not recommended:

where date(create_time)=’20190101′

recommend:

where create_time >= ‘20190101’ and create_time < ‘20190102’

14. Use UNION ALL instead of UNION when there are obviously no duplicate values

UNION will put all the data of the two result sets into the temporary table and then perform the deduplication operation

UNION ALL will no longer deduplicate the result set

15. Split complex large SQL into multiple small SQL

Big SQL: SQL that is logically complex and requires a large amount of CPU for calculation

MySQL: One SQL can only use one CPU for calculation

After SQL is split, it can be executed in parallel to improve processing efficiency

11. Code of Conduct for Database Operations

1. The batch write (UPDATE, DELETE, INSERT) operation of more than 1 million rows needs to be performed multiple times in batches

Large batch operations can cause serious master-slave delays

In the master-slave environment, large-scale operations may cause serious master-slave delays. Generally, large-scale write operations need to be executed for a certain period of time, and only after the execution of the master library is completed, it will be executed on other slave libraries. Therefore, It will cause a long delay between the master library and the slave library

When the binlog log is in row format, a large number of logs will be generated

A large number of write operations will generate a large number of logs, especially for binary data in the row format. Since the modification of each row of data is recorded in the row format, the more data we modify at one time, the more logs will be generated. The time required for log transmission and recovery is also longer, which is also a reason for the master-slave delay.

Avoid large transaction operations

Modifying data in large batches must be carried out in one transaction, which will cause a large amount of data in the table to be locked, resulting in a large amount of blocking, which will have a very large impact on MySQL performance.

In particular, long-term blocking will occupy all available connections of the database, which will prevent other applications in the production environment from connecting to the database. Therefore, it is necessary to pay attention to the batching of large-scale write operations.

2. For large tables, use pt-online-schema-change to modify the table structure

Avoid master-slave delays caused by large table modifications

Avoid locking the table when modifying table fields

You must be cautious when modifying the data structure of large tables, which will cause serious table locking operations, especially in production environments, which cannot be tolerated.

pt-online-schema-change will first create a new table with the same structure as the original table, and modify the table structure on the new table, then copy the data in the original table to the new table, and in the original table Add some triggers.

Copy the newly added data in the original table to the new table. After copying all the data in the row, name the new table the original table and delete the original table.

Break down the original DDL operation into multiple small batches.

3. It is forbidden to grant super permission to the account used by the program

When the maximum number of connections is reached, there is still one user with super privileges running to connect. The super privileges can only be reserved for the account of the DBA to handle the problem.

4. For the program to connect to the database account, follow the principle of least privilege

The database account used by the program can only be used under one DB, and the account that is not allowed to be used by cross-database programs is not allowed to have drop permission in principle.

Guess you like

Origin blog.csdn.net/s_unbo/article/details/129460333