Million level breakdown MYSQL optimization strategy

How ten million large table optimization, which is a matter of very technical content, usually our intuitive thinking will jump to the split or data partition, and I wanted to do something to add and sort out, and we want to do something in this regard the lessons learned, we also welcome suggestions. 

 

_

Pictures from Pexels

 

Now fire four from the outset mind, to continue to self-criticism, and later also made reference to the experience of some of the team, I compiled the following syllabus.

If it wants to thoroughly understand this issue, we are bound to return to the origin, I put this question into three parts: "ten million", the "big table", "optimization", also correspond to "the amount of data," we identified in figure "objects" and "target."

 

I gradually started to explain, to give a series of solutions.

 

The amount of data: ten million

 

In fact, just a ten million digital senses, our impression is that the data is greater.

 

Here we need to refine the concept, because as the business changes and time, amount of data that will change, we should take a dynamic thinking to look at this index, so that for different scenarios that we should have handled differently strategy.

 

① the amount of data is ten million, may reach one hundred million or more

 

Some data is usually water logging business, which data growth over time will gradually increase, more than ten million threshold is very easy thing.

 

② the amount of data the amount of data is ten million, is a relatively stable

 

If the amount of data is relatively stable, generally biased in favor of the data in some states, such as 10 million users, then those users of information corresponding row of data records in the table have, as the business grows, the relative order of magnitude is relatively stable.

 

③ the amount of data is ten million, should not have so much data

 

This situation is mostly found in our passive, usually found when it was too late, for example, you see a configuration table, the amount of tens of millions of data; or that some of the table data has been stored for a long time, 99% of the data fall outdated data or garbage data.

 

The amount of data is an overall understanding, we need to make the data more recent level of understanding, which may lead to a second portion of content. 

 

Object: Data Sheet

 

Data manipulation process like in the presence of a plurality of database than the pipe, these pipes are flowing data to be processed, the usefulness of these data and the home is not the same.

 

The data is generally divided into three service types:

 

① water Data

 

Data flow is stateless, there is no correlation between the multi-pen business, each business over time will produce new documents.

 

Such as transaction flow, pay water, as long as the documents can be done to insert a new business, characterized by the following data does not depend on previous data, all data in chronological flow into the database.

 

② Status Data

 

Data status is stateful, multi-pen between business depends on stateful data, but also to ensure the accuracy of the data, it must get the balance of the original such as when recharging, in order to pay for success.

 

③ Data Configuration

 

This type of data size is small, simple structure, usually static data, low frequency changes.

 

At this point, we can have on the overall context of an understanding, if to do optimization, in fact, to face such a matrix of 3 * 3, if you want to consider the proportion of literacy (reading and writing small table, read write more less ...), it will be 3 * 3 * 4 = 24 species apparently do exhaustive is not displayed, but also completely unnecessary, you can specify different business strategies for different data storage properties and operational characteristics.

 

In this regard we take the way to seize the key, put some common optimization ideas to sort it out, especially inside the core idea is to optimize the design of our entire ruler, and the difficulty we decided to do dynamic and risks of this matter.

 

For optimization program, I want to adopt a business-oriented dimension to elaborate. 

 

Goal: Optimization

 

At this stage, we are saying to optimize the program, summarize a bit more, it is relatively a whole. Whole is divided into five parts:

In fact, we usually refer to the sub-library sub-table and other programs are just a small part, then if you expand on the rich.

Understandably, the amount of data we have to watch is that it must support level is relatively large, DBA sure you want to maintain more than one table, how to better manage, while being able to support the expansion in business development, while ensuring performance, which is placed in front of us a few mountains.

 

We are speaking about these five improvements:

  • Design specifications

  • Business layer optimization

  • Architecture layer optimization

  • Database Optimization

  • Management Optimization

 

Design specifications

 

Here we first mentioned specification design, not the design on the other tall.

 

Hegel said: order is the first condition of freedom. Especially important in the division of labor work in the scene, or contain too many teams with each other, a lot of problems.

 

I would like to mention a few of the following specification, it is only part of the development part of the specification can be used as a reference.

The nature of norms is not the solution, but effective way to eliminate some of the potential problems, large table for ten million specification to follow, I brushed some of the details are as follows, and can cover some of the basic design and use our common problems.

 

For example, just-field table design are varchar (500), it is in fact a very irregular implementation, we have to expand the talk about these specifications.

 

Configuration specifications:

  • Use InnoDB storage engine MySQL database by default.

     

  • Ensure unified character set, MySQL database-related system, database, table using the UTF8 character set, application access, display and other places can be set character sets are also unified set UTF8 character set.

    Note: UTF8 expression format is not stored in the class data need UTF8MB4, MySQL character set may be disposed inside. 8.0 is already in default UTF8MB4, can be unified or customized settings according to the company's business situation.

     

  • MySQL database transaction isolation level defaults to RR (Repeatable-Read), a unified set to RC (Read-Committed), better suited for OLTP transactions Suggested initialization.

     

  • Table in the database must be reasonable planning, control the amount of data a single table for the MySQL database, the proposed single table records the number of control in less than 2000W.

     

  • Examples of the MySQL database, the smallest possible number table; database is generally not more than 50, under each database, the number of data tables is generally no more than 500 (including the partition table).

 

Built table specifications:

  • InnoDB prohibit the use of foreign key constraints, we can ensure that by program level.

  • Precision floating-point store must use the alternative DECIMAL FLOAT and DOUBLE.

  • No definition display width integer definition, such as: the use of INT, instead of INT (4).

  • Not recommended to use ENUM type, TINYINT can be used instead.

  • Do not use as TEXT, BLOB type, if you must use, it is recommended to split the table into other fields too large or not commonly used to describe a large field type; Also, do not use a database to store pictures or documents.

  • When stored in YEAR (4), do not use YEAR (2).

  • Suggestion field is defined as NOT NULL.

  • SQL DBA is recommended to provide the auditing tools, build normative tables need to be reviewed by the audit tool.

 

Naming conventions:

  • Databases, tables, fields used in all lowercase.

  • Database names, table names, field names, index names are lowercase letters, and "_" split.

  • Database names, table names, field names recommended no more than 12 characters. (Database names, table names, column names up to 64 characters, but for a standardized, easy to identify and reduce the volume, unified no more than 12 characters)

  • Database names, table names, field names intended to see to know the name, you do not need to add comments.

 

For objects of a naming convention shown in the following table briefly summarized, for reference:

Name list

 

Index specification:

  • Index naming suggestions: idx_col1_col2 [_colN], uniq_col1_col2 [_colN] (if the field is too long recommended abbreviation).

  • The number of fields in the index is not recommended over five.

  • The number of index controlled within a single table 5.

  • InnoDB tables generally recommended have a primary key column, especially in the high availability cluster solution is as essential items.

  • When you create a composite index, a high selectivity priority field on the front.

  • UPDATE, DELETE statements need to add an index based on the WHERE condition.

  • Not recommended% prefix fuzzy queries, such as LIKE "% weibo", can not use the index, it will lead to a full table scan.

  • Rational use of covering indexes, such as: SELECT email, uid FROM user_email WHERE uid = xx, if uid is not a primary key, you can create a covering index idx_uid_email (uid, email) to improve query efficiency.

  • Avoid using functions in the index field, otherwise it will lead to failure of the index query.

  • To contact the DBA to confirm whether the index needs to be changed.

 

Application Specifications:

  • Avoid using stored procedures, triggers, and other custom functions, business logic and DB easily coupled together, do become a bottleneck when the late distributed scheme.

  • Consider using UNION ALL, reduce UNION, UNION ALL because not heavy, the sorting operation and less, the relative speed is faster than UNION, if there is no need to weight, preferably used UNION ALL.

  • Consider using limit N, less limit M, N, M or large tables in particular when large.

  • Reduce or avoid sort, such as: group by ordering If no statement can be increased order by null.

  • Using the number of COUNT (*) recorded in tables, instead of COUNT (primary_key) and COUNT (1).

    InnoDB tables to avoid using COUNT (*) operation, counting statistics Memcache real-time requirements may be used or the Redis strong, non-real-time statistics may be used alone statistics, updated regularly.

  • Do field changing operation (modify column / change column) must be added when the original note properties otherwise modified, the comment will be lost.

  • Use prepared statement can improve performance and avoid SQL injection.

  • IN contains the value of the SQL statement should not be too much.

  • UPDATE, DELETE statement must have a clear WHERE condition.

  • WHERE field value required to meet the condition of the data type of the field, to avoid MySQL implicit type conversion.

  • SELECT, INSERT statement must explicitly specify the field name, or prohibit the use of SELECT * INSERT INTO table_name values ​​().

  • INSERT statement using batch submission (INSERT INTO table_name VALUES (), (), () ......), the number of values ​​should not be too much.

 

Business layer optimization

 

Business layer optimization should be the highest yielding optimized the way, but also for the business layer completely visible, there are business split, split the data, and two common types of optimization scenarios (reading and writing less, read less writing and more)!

① Business Split

 

Service splitter divided into the following two aspects:

  • The mixed traffic is split independent service

  • The separation of state and historical data

 

Service splitter is actually a hybrid of the divestiture become clearer independent business, such a service, service 2 ...... independent business service that the total amount is still large, but each part is relatively independent reliability is still guaranteed.

 

For the separation of state and historical data, I can give an example to illustrate.

 

For example: We have a table Account, the balance is assumed that the user 100.

We need to take place after the data changes, data changes can trace the historical information, and if the account status data is updated, increasing the balance of 100, so that the balance of 200.

 

This process may correspond to an update statement, an insert statement. Which we can transform into two different data sources, account and account_hist.

 

In account_hist two insert would be recorded as follows:

In the account statement is an update, as follows:

This is also a very basic separation of hot and cold, can greatly reduce maintenance complexity, improve service response efficiency.

 

② data Split

 

Split by date: this use is relatively common, especially in light of the split date dimension, in fact, changes in the level of the program is small, but great benefits of scalability.

  • Resolution data by date dimension, such as test_20191021.

  • Data is split according to the circumferential dimensions months, as test_201910.

  • According to quarterly data, the dimensions of the split, as test_2019.

 

Using partitioning scheme: partitioning scheme is also a common use, the use of hash, range, etc. will be more.

 

In MySQL I use it is not recommended partition table, because with a storage capacity of data while doing a vertical split, but in the final analysis, the data is actually difficult to achieve the level of expansion, is better in MySQL extension the way.

 

③ reading and writing less optimized scene

 

The use of cache, Redis technology, the read request to play at the level of the cache, which can greatly reduce the hotspot data query MySQL pressure level.

 

④ read less write more optimized scene

 

Read less write more optimization scenarios, you can use three steps:

  • Submit asynchronous mode, asynchronous application layer is the most straightforward is to enhance performance, cause minimal synchronization wait.

  • Use queuing techniques, a large number of write request queue can be extended by way of the data written in batches.

  • Reduce the frequency of writing, the more difficult to understand, I give an example:

 

For business data, such as integration class, compared to the amount of business priorities for the scene slightly lower, if the data is updated too frequently, you can appropriately adjust the range of the data updates (such as every minute adjustment from the original 10 minutes) to reduce the frequency of updates.

 

For example: status data is updated, the integral 200, as shown below:

It can be transformed into, as shown below:

If the business data updated too frequently in a short time, such as one minute update 100, points from 100 to 10000, it can batch submitted in accordance with the frequency of the time.

 

For example: status data is updated, the integral is 100, as shown below:

100 transactions without generating (200 SQL statements) may be engineered to 2 SQL statement, as shown below:

For business indicators, such as updating frequency details can be discussed and decided upon according to specific business scenarios.

 

Architecture layer optimization

 

Architecture layer optimization in fact, we think of the kind of high-tech work, we need to introduce some new tricks to the architectural level based on business scenarios.

① system level extensions scene

 

The use of middleware technology: you can achieve data routing, horizontal expansion, common middleware has MyCAT, ShardingSphere, ProxySQL and so on.

Using separate read and write technique: This is an extension for a read demand, more focused on the state table, in the case of allowing a certain delay, multiple copies of the pattern can be employed to achieve the level of demand for extended reading can be realized using middleware, such as myCAT, ProxySQL, MaxScale, MySQL Router and so on.

Using load balancing technology: common are Consul LVS technology or technology-based services such as domain name.

 

② both OLTP + OLAP business scenarios

 

Can be used NewSQL, priority HTAP compatible MySQL protocol stack technology, such as TiDB.

 

③ offline business scenarios statistics

 

There are several types of options to choose from:

  • NoSQL system employed, there are two types, one suitable protocol compatible MySQL data warehouse system, a common or Infobright ColumnStore, another type of storage is based on columns, belonging to heterogeneous direction, as HBase technology.

  • Using the number of system positions, MPP-based architecture, such as the use Greenplum statistics, statistics such as the T + 1.

 

Database Optimization

 

Database optimization, in fact, can play a lot of cards, but relatively speaking, space is not so big, we come one by one to say it.

① Transaction Optimization

 

Business scene selection model based on the transaction, whether the transaction is dependent on strong. For transactions dimensionality reduction strategy, we give a few small examples.

 

Dimension Reduction Strategy 1: stored procedure call is converted to SQL calls transparent

 

For new business, the use of stored procedures is obviously not a good idea, MySQL stored procedures and other commercial databases compared to the functionality and performance have yet to be verified, and in the current lightweight business processes, handling stored procedure too " It's heavy.

 

Some appear to be a distributed application architecture deployed, but the way in calling the database layer is based on a stored procedure, because the stored procedure encapsulates a lot of logic, it is difficult to debug, and portability is not high.

 

Such business logic and performance pressure are at the database level, the database layer makes it easy to become a bottleneck, but also difficult to achieve truly distributed.

 

So there is a clear direction for improvement is to transform the stored procedure, the way it transformed into SQL calls, can greatly improve the processing efficiency of the business, simple and clear enough on the database interface calls controllable.

 

Dimension Reduction Strategy 2: DDL operation is converted to the DML operation

 

Some businesses often have an emergency need always need a table to add a field, and made the DBA and business students are tiring, imagine a table with hundreds of fields, and are basically name1, name2 ...... name100, which kind of design itself is a problem, not to consider the performance.

 

The reason is because the dynamic changes in business requirements, such as a game equipped with 20 properties, may be too after a month increased to 40 properties, so that all the equipment has 40 properties, did not work used, and this way there is a lot of redundancy.

 

We are inside the design specification also mentions some of the basic elements of design, these bases should be added that, to maintain a limited field, if you want to achieve the expansion of these functions, in fact, can be realized by the configuration of the way, such as some fields added dynamically convert some configuration information.

 

Configuration information may be performed by the DML modify and supplement the data entry may be more dynamic, easy to expand.

 

Dimension Reduction Strategy 3: Delete operations into efficient operation

 

Some business needs on a regular basis to clean up some periodic data, such as data table to retain only one month, then beyond the time frame of data is necessary to clean out.

 

And under the table if the order is relatively large, the cost of the Delete operation is too high, we can have two types of solutions to convert the Delete operation to a more efficient way. 

 

The first is a business establishment of the periodic table, such as table according months, weeks table, schedules of other design dimensions, so that the data is clean a relatively controlled and efficient manner of. 

 

The second option is to use MySQL rename mode of operation, such as a large table 20 million to clean up 99% of the data, then 1% of the data we need to keep makeup can quickly filter according to the conditions, to achieve "transitional transposition".

 

②SQL optimization

 

In fact, require relatively minimalist design, many points are standard design which, if compliance with standards, will put an end to the problem sorta out.

 

In this add a few points:

  • SQL statements to simplify, simplify SQL optimization is a great tool, because simple, so superior.

  • As far as possible to avoid or eliminate the complexity associated multi-table, large table associated with processing large table is a nightmare, once opened this door, growing demand need to be associated, performance optimization is no turning back, not to mention the big table is associated with the MySQL weaknesses, although Hash Join was launched, not as absolute master of the big kill, as already exists in commercial databases, problems still abound.

  • SQL as much as possible to avoid anti connection, avoid half-connection, which is done optimizer weak hand, what is anti connection, semijoin?

    In fact, better understand, for example: not in, not exists is anti-join, in, exists is semi-connected, this problem in ten million large table, the performance difference is several orders of magnitude. 

 

③ Index Tuning

 

Should be a large table optimization requires a degree of certainty:

  • First, there must be a primary key, that is, the first design specification, here not received refute.

  • Second, SQL queries based on an index or a unique index that the query model as simple as possible.

  • Finally, as far as possible to eliminate the range of data queries, range scan or minimize the large table in the ten million cases.

 

Management Optimization

 

This should be part of the solution in all the most neglected part of, and I placed at the end, in this operation and maintenance also pay tribute to my colleagues, always thought a lot of this problem should be normal due diligence (scapegoat).

Data ten million cleaning up a large table in general is more time-consuming, in this suggestion in the design of the need to improve the separation of hot and cold data strategy may sound a mouthful, I'll give you an example, the operation Drop large table DDL conversion reversible operation.

 

Drop the default action is submitted, and is not reversible, in database operations are synonymous with foot, MySQL There is no corresponding level Drop operation recovery, unless to restore the backup, but we can consider Drop operations into a kind of reversible DDL operations.

 

MySQL default ibd each table has a corresponding file, in fact, can be converted to a rename operation Drop operation, that is to migrate files from testdb testdb_arch below.

 

From the permissions for, testdb_arch service is not visible, the rename operation can smoothly achieve this delete function, if it is confirmed can be cleaned after a certain time, the data for the existing business process cleaning is not visible, as shown below:

In addition, there are two additional suggestions, a large table for change, consider online change-peak hours as much as possible, such as using pt-osc tool change or maintenance period, will not repeat them.

 

To sum up, in fact, in one sentence: to optimize ten million large table is based on business scenarios to cost optimization for the price, is definitely not an isolated one level of optimization.

Source: Huperzine architecture notes

 

Published 277 original articles · won praise 65 · views 380 000 +

Guess you like

Origin blog.csdn.net/ailiandeziwei/article/details/104659171