[Optimization] MySQL ten million large table optimization solutions

Overview of issues

Use Ali cloud rds for MySQL database (that is, MySQL5.6 version), there are Internet users the amount of data recorded in Table 6 months of nearly 20 million, the most recent year data retention of up to 40 million queries extremely slow, stuck daily . Serious impact on business.

Question the premise: the old system, the system was designed for people probably never graduated from college, table design and write sql statement is not just garbage, simply could not look. The original developers have left, to me to maintain that this is the legendary road maintenance can not run, then I'm out of that pit! ! !

I tried to solve this problem, so, there is this log.

Program Overview

Option One: optimizing existing mysql database. Advantages: does not affect the existing business, without modifying the source code and the lowest cost. Cons: There are bottlenecks to optimize the amount of data billions of dollars have gone on.
Option II: upgrade database type, for a 100% compatible mysql database. Advantages: does not affect the existing business, without modifying the source code, you almost do not need to do anything you can to enhance database performance, Disadvantages: more money
Option Three: one step, big data solutions, replace newsql / nosql database. Advantages: no data capacity bottlenecks disadvantages: need to modify the source code, the business impact, the highest total cost.

These three programs, in order to use, no need to change the amount of data in nosql about one hundred million level, the development costs are too high. Three programs I tried again, and have formed a floor solutions. The process that several developers hearts condolences on foot a thousand times :-)

A detailed description of the program: optimizing existing database mysql

Ali cloud database with the telephone communication and Google Gangster solutions and ask the group chiefs, are summarized as follows (all cream):

1. must consider performance database design and table creation
Note that the optimization of the preparation 2.sql
4. Partition
4. The sub-table
5. branch warehouse

1. must consider performance database design and table creation

mysql database itself is highly flexible, resulting in insufficient performance, rely heavily on human capacity development. That is a developer of high capacity, the high performance mysql. This is also a common problem many relational databases, so the company dba usually pay huge high.

To pay attention to the design table:

Field avoid null values appear, and query optimization null value is difficult to take up extra space index, suggesting default number 0 instead of null.
Try to use INT instead of BIGINT, if the non-negative plus UNSIGNED (the value of such capacity will be doubled), of course, can use TINYINT, SMALLINT, MEDIUM_INT better.
Instead of using the integer string or an enumeration type
Try to use TIMESTAMP instead of DATETIME
Do not have too many single table field, within the recommended 20
With integer to store IP

index

The index is not possible, to create targeted based on the query, consider listed on WHERE and ORDER BY commands related to index, the index can be used whether or EXPLAIN to see the full table scan according to
Should be avoided fields NULL value judgment in the WHERE clause, it will cause the engine to give up using the index and full table scan
Value distribution is sparse field is not suitable for building an index, such as "gender" only two or three values of this field
Character field to build only the prefix index
Character is best not to field a primary key
Without foreign keys, bound by the procedural guarantees
Try not UNIQUE, bound by the procedural guarantees
Idea and query sequence is consistent when using a multi-column index, delete unnecessary single index

In short, the use of the appropriate data type, select the appropriate index

# 选择合适的数据类型
（1）使用可存下数据的最小的数据类型，整型 < date,time < char,varchar < blob
（2）使用简单的数据类型，整型比字符处理开销更小，因为字符串的比较更复杂。如，int类型存储时间类型，bigint类型转ip函数 （3）使用合理的字段属性长度，固定长度的表会更快。使用enum、char而不是varchar （4）尽可能使用not null定义字段 （5）尽量少用text，非用不可最好分表 # 选择合适的索引列 （1）查询频繁的列，在where，group by，order by，on从句中出现的列 （2）where条件中<，<=，=，>，>=，between，in，以及like 字符串+通配符（%）出现的列 （3）长度小的列，索引字段越小越好，因为数据库的存储单位是页，一页中能存下的数据越多越好 （4）离散度大（不同的值多）的列，放在联合索引前面。查看离散度，通过统计不同的列值来实现，count越大，离散程度越高：

The original developers already running, the table is already established, I can not modify it: the wording to be unenforceable, give up!

Note that the optimization of the preparation 2.sql

Limit the use of the recording is defined query results
Avoid select *, the fields will need to find listed
Using a connection (join) instead subquery
Split large delete or insert statement
SQL can to find out by opening a slower slow query log
Do not do arithmetic column: SELECT id WHERE age + 1 = 10, any operation on the columns will cause a table scan, which includes a database tutorial function, evaluate expressions, etc., when a query to the operation moved to the right side of the equal sign as far as possible
sql statement as simple as possible: a sql only a cpu operation; big statement demolition small statement, reducing the lock time; a big sql entire library can be blocked
OR rewritten IN: OR efficiency level is n, IN efficiency is log (n) level, in the number of recommended control in 200
Do not function and triggers in applications to achieve
Avoid% xxx-style inquiry
JOIN less
Were compared using the same type, such as a '123' and '123' ratio, ratio of 123 and 123
Avoid the use in the WHERE clause! = Or <> operator, otherwise the engine to give up using the index and a full table scan
For continuous values, without using BETWEEN IN: SELECT id FROM t WHERE num BETWEEN 1 AND 5
Do not take a full list of data tables, to use LIMIT to pagination, page number and do not much

The original developers have been running, the program has been completed on the line, I can not edit sql, it is: the wording to be unenforceable, give up!

engine

Engine
widely used is the MyISAM and InnoDB two kinds of engines:

1.MyISAM
MyISAM engine is MySQL 5.1 and earlier versions of the default engine, which is characterized by:

It does not support row lock, while reading all the tables needs to read lock, then the table plus an exclusive lock when writing
It does not support transactions
Does not support foreign keys
It does not support security after the crash recovery
In the table has read queries at the same time, support the insertion of a new record to the table
Support for BLOB and TEXT indexes the first 500 characters, and supports full-text indexing
Support delay update the index, which greatly enhance the write performance
For not modify tables, table supports compression, which greatly reduce the disk space occupied

2.InnoDB
InnoDB in the MySQL 5.5 as the default index, which is characterized by:

Support row lock, using MVCC to support high-concurrency
Support Services
Support foreign keys
It supports secure recovery after a crash
It does not support full-text indexing

Overall, MyISAM table for intensive SELECT, INSERT and UPDATE fit and InnoDB table intensive

MyISAM speed may be fast, take up storage space is small, but the program requires transaction support, it is necessary InnoDB, so the program can not be executed, give up!

3. partition

MySQL version 5.1 is introduced in a simple horizontal partition split, the user needs to add partitions to build the table when parameters are transparent to the application without modifying the code
for the user, independent of the partition table is a logical table, However, the bottom by a plurality of physical sub-tables, partition code is actually realized by a set of object wrapper to the underlying table, but the SQL layer is fully encapsulated underlying a black box. MySQL achieve partitioning way also means that the index is in accordance with sub-partition table definition, there is no global index
user's SQL statement is optimized for the partition table needs to be done, SQL conditions to bring the column partitioning criteria so that a small number of queries to locate partition, otherwise it will scan all the partitions, you can view a SQL statement will fall on those partitions by EXPLAIN pARTITIONS, to perform SQL optimization, I tested, the column without a partition query conditions will improve speed, so this measure is worth a try.

Zoning benefits are:

Allows a single table to store more data
Data partition tables easier to maintain, the entire partition can bulk delete large amounts of data through clearly, you can also add a new partition to support the newly inserted data. It is also possible to optimize, inspection, repair and other operations on a separate partition
Part of the query can be determined only on a small number of falls partition from the query, the speed will soon
Partition table data may also be distributed on different physical devices, with a plurality of hardware devices so funny
You can use some special partition table Lai avoid bottlenecks, such as inode lock contention exclusive access InnoDB single index, ext3 file system
You can back up and restore a single partition

Zoning limitations and disadvantages:

A table can have a maximum of 1024 partitions
If there are columns in the primary key or unique index partition field, then all primary key columns and column must contain a unique index to come in
Partition table can not use foreign key constraints
NULL values will filter invalid partition
All partitions must use the same storage engine

Partition type:

RANGE Partitioning: based on a column belonging to a given continuous value range, the multi-line assigned to the partition
LIST partitioning: RANGE partition according to similar, except that LIST partition is a value based on column values match a set of discrete values to be selected
HASH Partitioning: user-defined based on the expression of the return value to the selected partition, the value of this expression using the column to be inserted into the table of these lines are calculated. This function can include any expression MySQL valid, generating non-negative integer value
Subdivision KEY: similar to HASH by partition, except that only support partition KEY calculating one or more columns, and MySQL server provides its own hash function. There must be one or more columns comprise integer values
Mysql specific concept of partition of your own google or check the official document, I am here just a start a discussion.

My first month according to online records table RANGE partition 12 copies, query efficiency increased by about six times, the effect is not obvious, it is: HASH change id for the partition, divided 64 partitions, query speed increase significantly. problem solved!

The results are as follows:

PARTITION BY HASH (id)PARTITIONS 64
select count(*) from readroom_website; --11901336行记录 
/* 受影响行数: 0 已找到记录: 1 警告: 0 持续时间 1 查询: 5.734 sec. */ select * from readroom_website where month(accesstime) = 11 limit 10; /* 受影响行数: 0 已找到记录: 10 警告: 0 持续时间 1 查询: 0.719 sec. */

4. The sub-table

Sub-table is to a large table, in accordance with the above processes are optimized, or check card dead, then put this table into multiple tables, put a query into multiple queries, and then combined the results returned to the user.
Sub-table is divided into horizontal and vertical split split split item usually done in a field. For example, to id field is split into 100: Table named tableName_id% 100
but: sub-table need to modify the source code, the development will bring a lot of work, greatly increase the cost of development, it is: only suitable for early development to take into account the existence of large amounts of data, do the points table processing is not suitable for use on the line to do modifications, the cost is too high! ! ! Low and select this option, not as I choose the second and third program provided cost! It is not recommended.

5. branch warehouse

Put a database into multiple, separate read and write recommendations to be on the line, do the real sub-library will bring a lot of development costs, more harm than good! Not recommended.

Scheme II is described in detail: upgrade the database, for a 100% compatible database mysql

mysql performance not, then another. To ensure that does not modify the source code, to ensure a smooth migration of existing business, it is required for a 100% compatible mysql database.

1. Select the Open Source

tiDBhttps://github.com/pingcap/tidb
CUBRID https://www.cubrid.org/
open source database will bring a lot of operation and maintenance costs and the quality of its industrial and MySQL are still gaps, there are many pit to step on, if your company requirements must be self-built database, then select the type product.

2. Select the cloud data

Ali cloud POLARDB
https://www.aliyun.com/product/polardb?spm=a2c4g.11174283.cloudEssentials.47.7a984b5cS7h4wH

Official description language: POLARDB Ali cloud self-development of the next generation of distributed relational database cloud native, 100% compatible with MySQL, storage capacity up to 100T, the highest performance of up to 6 times MySQL. POLARDB combines both commercial databases stable, reliable, high-performance features, but also has open-source database simple, scalable, continuous iteration advantage, and cost only 1/10 of a commercial database.

I opened tested that support free mysql data migration, non-operating costs, enhance the performance of about 10 times, almost the same price with rds, is a good alternative solution!

Ali cloud OcenanBase
Taobao use, Go On two-eleven, outstanding performance, but in the beta, I can not try, but worth the wait
阿里云HybridDB for MySQL (原PetaData)
https://www.aliyun.com/product/petadata?spm=a2c4g.11174283.cloudEssentials.54.7a984b5cS7h4wH

Official description: cloud database HybridDB for MySQL (formerly known as PetaData) is to support the massive data online transaction (OLTP) and online analytical (OLAP) of HTAP (Hybrid Transaction / Analytical Processing) relational database.

I also tested a bit, it is a olap and oltp compatible solutions, but the price is too high, up to 10 dollars per hour, used to store too wasteful for storage and analysis of business together.

Tencent cloud DCDB
https://cloud.tencent.com/product/dcdb_for_tdsql

Official description: DCDB known TDSQL, MySQL protocol and syntax in a compatible, high-performance distributed database automatically split level - i.e., service logic table shown as complete, but the data points evenly split into a plurality of sheets ; each slice using the default standby architecture provides disaster recovery, monitoring, and other non-stop expansion complete solution for TB or PB grade mass data scene.

I do not like to use Tencent, not much to say. The reason is that people can not find a problem, the problem can not be solved online headache! But he is cheap, suitable for ultra small companies, play.

Detailed Description three options: remove mysql, engine data for large data processing

The amount of data billions of dollars, and did not have a choice, and the only big data.

1. open source solutions
hadoop family. hbase / hate it wants the hive. But there is a very high operation and maintenance costs, the company is generally not afford, did not put one hundred thousand is not a very good output!
2. The cloud solution
that is more and more, but also a future trend, big data is provided by specialized professional services company, small company or individual purchasing services, large data like water / electricity and other public facilities as exist in society aspects.

China to do the best was undoubtedly Ali cloud.

I chose Ali cloud MaxCompute with DataWorks, super comfortable, pay by volume, very low cost.
MaxCompute Hive can be understood as open source, providing sql / mapreduce / ai algorithm / python script / shell scripts manipulate data, the data show in table form, is stored in a distributed fashion, a timed tasks and batch processes the data. DataWorks provides a way to manage the workflow of your data processing tasks and schedule monitoring.
Of course, you can also choose to Ali cloud hbase and other products, I am here mainly to off-line processing, so choose MaxCompute, basically graphical interface operation, probably write a 300 line sql, cost no more than 100 dollars to solve data processing problems.

Transfer: https://www.cnblogs.com/yliucnblogs/p/10096530.html

[Optimization] MySQL ten million large table optimization solutions

Guess you like