Provide three solutions for the optimization process of a 2000w data table

Using Alibaba Cloud rds for MySQL database (that is, MySQL 5.6 version), there is a user's online record table with a data volume of nearly 20 million for 6 months, and the data volume of the most recent year has reached 40 million. The query speed is extremely slow and daily stuck. . Seriously affect the business.

The problem premise: the old system, the person who designed the system at that time probably did not graduate from university, and the table design and sql statement were not just rubbish, they could not be looked at directly. The original developers have all left and come to me for maintenance. This is the legendary one who runs away without maintenance, and then I am the one who fell into the hole! ! !

I try to solve the problem, so, there is this log.

Program overview

Option 1: Optimize the existing mysql database.

Advantages: does not affect the existing business, the source program does not need to modify the code, and the cost is the lowest; Disadvantages: there is an optimization bottleneck, and the data volume is over 100 million.

Option 2: Upgrade the database type to a database that is 100% compatible with mysql.

Advantages: Does not affect the existing business, the source program does not need to modify the code, you hardly need to do any operation to improve the performance of the database; Disadvantages: more money

Option 3: One step, big data solution, replacement of newsql/nosql database.

Advantages: strong scalability, low cost, no data capacity bottleneck; disadvantages: need to modify the source code

The above three schemes can be used in order. If the data volume is below 100 million, there is no need to change to nosql, and the development cost is too high. I tried all three options, and they all formed a landing solution. In the process, I expressed condolences to the developers who ran away 10,000 times :)

Option 1: Optimize the existing mysql database

Communicate with Alibaba Cloud database bosses, Google solutions, and ask the group bosses, summarized as follows (all the essence):

  1. Performance should be considered when database design and table creation

  2. The preparation of sql requires attention to optimization

  3. Partition

  4. Sub-table

  5. Sub-library

1. Consider performance when database design and table creation

The mysql database itself is highly flexible, resulting in insufficient performance and heavily dependent on the ability of developers. In other words, if the developer's ability is high, mysql performance is high. This is also a common problem with many relational databases, so the company's dba usually has a huge salary.

Pay attention to when designing the table:

  • Table fields avoid the occurrence of null values. Null values ​​are difficult to query and optimize and occupy additional index space. The default number 0 is recommended to replace null.

  • Try to use INT instead of BIGINT. If it is non-negative, add UNSIGNED (so that the value capacity will be doubled). Of course, it is better to use TINYINT, SMALLINT, and MEDIUM_INT.

  • Use enum or integer instead of string type

  • Try to use TIMESTAMP instead of DATETIME

  • Do not have too many fields in a single table, it is recommended to be within 20

  • Use integer to store IP

index

  • Indexes are not as many as possible. They should be created in a targeted manner according to the query. Consider creating indexes on the columns involved in the WHERE and ORDER BY commands. You can check whether indexes are used or full table scans according to EXPLAIN.

  • Should try to avoid the NULL value judgment of the field in the WHERE clause, otherwise it will cause the engine to give up using the index and perform a full table scan

  • Fields with sparse value distribution are not suitable for indexing, such as "gender" fields with only two or three values

  • Only prefix index for character fields

  • Character fields are best not to be the primary key

  • No foreign keys, guaranteed by the program

  • Try not to use UNIQUE, it is bound by program guarantee

  • When using multi-column indexes, the order of ideas and query conditions are consistent, and unnecessary single-column indexes are deleted.

In short, use the appropriate data type and select the appropriate index

Choose the right data type

  • (1) Use the smallest data type that can store data, integer <date, time <char, varchar <blob

  • (2) Use simple data types. Integers are less expensive than character processing because the comparison of strings is more complicated. For example, int type storage time type, bigint type to ip function

  • (3) Use a reasonable field attribute length, a fixed-length table will be faster. Use enum, char instead of varchar

  • (4) Use not null to define fields as much as possible

  • (5) Use text as little as possible, and it is best to separate tables

Choose the appropriate index column

  • (1) Query frequent columns, columns appearing in the where, group by, order by, and on clauses

  • (2) The columns where <, <=, =, >, >=, between, in, and like string + wildcard (%) appear in the where condition

  • (3) For columns with small length, the smaller the index field, the better, because the storage unit of the database is a page, and the more data that can be stored in a page, the better

  • (4) Columns with large dispersion (many different values) are placed in front of the joint index. Check the degree of dispersion, which is achieved by counting different column values. The larger the count, the higher the degree of dispersion:

The original developer has run away. The table has already been created and I cannot modify it. Therefore: the wording cannot be implemented, give up!

2. The preparation of sql requires attention to optimization

  • Use limit to limit the record of query results

  • Avoid select *, list the fields you need to find

  • Use joins instead of subqueries

  • Split large delete or insert statements

  • You can find out the slower SQL by turning on the slow query log

  • Do not perform column calculations: SELECT id WHERE age + 1 = 10, any operation on the column will result in a table scan, which includes database tutorial functions, calculation expressions, etc., when querying, move the operation to the right of the equal sign as much as possible

  • The sql statement is as simple as possible: one sql can only be operated on one cpu; the large statement splits the small statement to reduce the lock time; one large sql can block the entire library

  • OR is rewritten as IN: the efficiency of OR is n level, the efficiency of IN is log(n) level, and the number of in is recommended to be controlled within 200

  • Without functions and triggers, implemented in the application

  • Avoid %xxx-style queries

  • Use JOIN less

  • Use the same type for comparison, such as the ratio of '123' and '123', and the ratio of 123 and 123

  • Try to avoid using the != or <> operator in the WHERE clause, otherwise the engine will give up using the index and perform a full table scan

  • For continuous values, use BETWEEN without IN: SELECT id FROM t WHERE num BETWEEN 1 AND 5

  • Do not take the entire table for the list data, use LIMIT to page, and the number of pages should not be too large

The original developer has run away, the program has been completed and online, and I cannot modify the sql, so: the wording cannot be executed, give up!

engine

Currently, two engines, MyISAM and InnoDB, are widely used:

MyISAM

MyISAM engine is the default engine of MySQL 5.1 and earlier versions. Its characteristics are:

  • Row locks are not supported, all tables that need to be read are locked when reading, and exclusive locks are added to the table when writing

  • Does not support transactions

  • Does not support foreign keys

  • Does not support safe recovery after a crash

  • While the table has read queries, it supports inserting new records into the table

  • Support the first 500 character index of BLOB and TEXT, support full-text index

  • Supports delayed index update, which greatly improves write performance

  • For tables that will not be modified, compressed tables are supported, which greatly reduces disk space usage

InnoDB

InnoDB became the default index after MySQL 5.5. Its characteristics are:

  • Support row lock, use MVCC to support high concurrency

  • Support affairs

  • Support foreign keys

  • Support safe recovery after crash

  • Does not support full-text indexing

In general, MyISAM is suitable for SELECT-intensive tables, while InnoDB is suitable for INSERT and UPDATE-intensive tables

MyISAM may be super fast and take up small storage space, but the program requires transaction support, so InnoDB is necessary, so the program cannot be implemented, give up!

3. Partition

The partitioning introduced by MySQL in version 5.1 is a simple horizontal split. Users need to add partition parameters when creating tables, which is transparent to applications without modifying the code.

For users, the partition table is an independent logical table, but the bottom layer is composed of multiple physical sub-tables. The code to achieve partitioning is actually encapsulated by a set of objects in the bottom table, but it is a complete table for the SQL layer. Encapsulate the bottom black box. The way MySQL implements partitioning also means that the index is also defined according to the sub-table of the partition, and there is no global index

The user's SQL statement needs to be optimized for the partition table. The SQL condition should include the column of the partition condition, so that the query is located on a small number of partitions, otherwise all partitions will be scanned. You can view a SQL statement through EXPLAIN PARTITIONS It will fall on those partitions for SQL optimization. I tested that columns without partitioning conditions when querying will also increase the speed, so this measure is worth a try.

The benefits of partitioning are:

  • Can make a single table store more data

  • The data of the partition table is easier to maintain. You can delete a large amount of data in batches by clearing the entire partition, or you can add new partitions to support newly inserted data. In addition, you can optimize, check, and repair an independent partition

  • Part of the query can be determined from the query conditions to only fall on a few partitions, and the speed will be very fast

  • The data of the partition table can also be distributed on different physical devices, thus making use of multiple hardware devices.

  • Partition table can be used to avoid some special bottlenecks, such as mutual exclusive access of InnoDB single index, inode lock competition of ext3 file system

  • Can backup and restore a single partition

Limitations and disadvantages of partitions:

  • A table can only have 1024 partitions at most

  • If there are primary key or unique index columns in the partition field, then all primary key columns and unique index columns must be included

  • Partitioned tables cannot use foreign key constraints

  • NULL value will invalidate partition filtering

  • All partitions must use the same storage engine

Type of partition:

  • RANGE partition: assign multiple rows to partitions based on column values ​​belonging to a given continuous interval

  • LIST partition: Similar to partition by RANGE, the difference is that LIST partition is selected based on the column value matching a value in a discrete value set

  • HASH partition: a partition that is selected based on the return value of a user-defined expression, which is calculated using the column values ​​of these rows to be inserted into the table. This function can contain any expression that is valid in MySQL and produces a non-negative integer value

  • KEY partition: Similar to partition by HASH, the difference is that KEY partition only supports calculation of one or more columns, and MySQL server provides its own hash function. One or more columns must contain integer values

For details about the concept of mysql partition, please google it yourself or inquire official documents, I am just trying to make a comment.

I first partitioned the Internet record table RANGE into 12 copies according to the month, and the query efficiency was increased by about 6 times, but the effect was not obvious. Therefore: I changed the id to HASH partition and divided into 64 partitions, and the query speed increased significantly. problem solved!

The result is as follows: PARTITION BY HASH (id)PARTITIONS 64

select count() from readroom_website; --11901336 rows

/ Number of rows affected: 0 Records found: 1 Warning: 0 Duration 1 Query: 5.734 sec. /

select * from readroom_website where month(accesstime) =11 limit 10;

/ Number of rows affected: 0 Records found: 10 Warning: 0 Duration 1 Query: 0.719 sec. */

4. Sub-table

Splitting a table is to optimize a large table according to the above process, or the query is stuck, then divide the table into multiple tables, divide a query into multiple queries, and then return the result combination to the user.

The sub-table is divided into vertical split and horizontal split. Usually, a certain field is used as the split item. For example, split the id field into 100 tables: the table name is tableName_id%100

However: the source code of the sub-table needs to be modified, which will bring a lot of work to the development and greatly increase the development cost. Therefore: it is only suitable for the existence of a large amount of data in the early stage of development, and the sub-table processing is not suitable for application. It is too expensive to make changes after it is online! ! ! And choosing this plan is not as low as choosing the second and third plans I provided! It is not recommended.

5. Sub-library

Divide a database into multiple, it is recommended to do a read-write separation, the real division of the database will also bring a lot of development costs, the gain is not worth the loss! Not recommended.

Option 2: Upgrade the database to a database that is 100% compatible with mysql

Mysql performance is not good, then change it. In order to ensure that the source code is not modified and to ensure the smooth migration of existing businesses, it is necessary to change to a database that is 100% compatible with mysql.

Open source option

  • tiDB https://github.com/pingcap/tidb

  • Cubrid https://www.cubrid.org/

Open source database will bring a lot of operation and maintenance costs and there is still a gap between its industrial quality and MySQL. There are many pits to step on. If your company requires you to build your own database, then choose this type of product.

Cloud data selection

Alibaba Cloud PolarDB

https://www.aliyun.com/product/polardb?spm=a2c4g.11174283.cloudEssentials.47.7a984b5cS7h4wH

Official introduction: PolarDB is the next-generation relational distributed cloud native database self-developed by Alibaba Cloud. It is 100% compatible with MySQL, with a storage capacity of up to 100T, and performance up to 6 times that of MySQL. POLARDB not only combines the stable, reliable, and high-performance characteristics of commercial databases, but also has the advantages of simple, scalable, and continuous iteration of open source databases, and the cost is only 1/10 of that of commercial databases.

I opened the test, support free mysql data migration, no operating cost, performance improvement of about 10 times, the price is similar to rds, it is a good alternative solution!

Alibaba Cloud OcenanBase

The one used by Taobao can survive Double Eleven and has outstanding performance, but in the public beta, I can’t try it, but it’s worth looking forward to

Alibaba Cloud HybridDB for MySQL (formerly PetaData)

https://www.aliyun.com/product/petadata?spm=a2c4g.11174283.cloudEssentials.54.7a984b5cS7h4wH

Official introduction: HybridDB for MySQL (formerly PetaData) is a HTAP (Hybrid Transaction/Analytical Processing) relational database that supports both online transaction (OLTP) and online analysis (OLAP) of massive data.

I also tested it. It is a olap and oltp compatible solution, but the price is too high, as high as 10 yuan per hour, it is too wasteful for storage, suitable for storage and analysis.

Tencent Cloud DCDB

https://cloud.tencent.com/product/dcdb_for_tdsql

Official introduction: DCDB, also known as TDSQL, is a high-performance distributed database that is compatible with MySQL protocol and syntax and supports automatic horizontal splitting-that is, the business is displayed as a complete logical table, but the data is evenly split into multiple shards ; Each shard adopts a master-backup architecture by default, providing a full set of solutions for disaster recovery, recovery, monitoring, and non-stop expansion, suitable for TB or PB-level massive data scenarios.

I don't like to use Tencent, so I won't say more. The reason is that no one can be found if there is a problem, and the online problem cannot be solved. But he is cheap and suitable for super small companies.

Solution 3: Remove mysql, change to big data engine to process data

The amount of data is over 100 million, and there is no choice but to use big data.

Open source solution

hadoop family. Hbase/hive is up. But there is a high cost of operation and maintenance, and the average company can't afford it. Without an investment of 100,000, there will be no good output!

Cloud solution

This is more, and it is also a future trend. Big data is provided by professional companies and small companies or individuals purchase services. Big data is like water/electricity and other public facilities, and it exists in all aspects of society.

The best in China is Alibaba Cloud.

I chose Alibaba Cloud's MaxCompute with DataWorks, which is super comfortable to use, pay-as-you-go, and very low cost.

MaxCompute can be understood as an open source Hive, which provides sql/mapreduce/ai algorithms/python scripts/shell scripts to manipulate data. The data is displayed in the form of tables, stored in a distributed manner, and processed by timed tasks and batch processing. DataWorks provides a workflow way to manage your data processing tasks and schedule monitoring.

Of course, you can also choose other products such as Alibaba Cloud hbase. I mainly use offline processing here, so I choose MaxCompute, which is basically a graphical interface operation. About 300 lines of sql are written, and the cost is not more than 100 yuan to solve the data processing problem.

 

Guess you like

Origin blog.csdn.net/qq_46388795/article/details/108753906