Database Performance Optimization Solution

foreword

It is no exaggeration to say that our back-end engineers, no matter which company, which team we stay in, or which system we work on, the first headache we encounter is definitely the database performance problem . If we have a mature methodology that allows everyone to quickly and accurately select an appropriate optimization solution, I believe we can quickly prepare to solve 80% or even 90% of the performance problems we encounter every day.

From the perspective of solving problems, we must first understand the cause of the problem; secondly, we must have a set of thinking and judging process methods, so that we can choose a reasonable level to choose a solution; finally, choose one from many solutions Appropriate solutions are used to solve problems. The premise of finding a suitable solution is that we have enough understanding of the advantages, disadvantages and scenarios of various solutions. No solution is completely universal, and there is no silver bullet in software engineering.

The following are the eight major solutions that I have used since I have been working for many years, combined with some materials collected by my own study, and organized this blog post in a systematic and comprehensive way. I also hope that some colleagues in need can work, Provide some help in growth.

Why is the database slow?

slow nature
lookup time complexity search algorithm
storage data structure
total data data splitting
high load CPU, disk busy

Whether it is a relational database or NoSQL, any storage system depends on its query performance mainly in three ways:

  • lookup time complexity
  • total data
  • high load

There are two main factors that determine the search time complexity :

  • search algorithm
  • storage data structure

Regardless of the type of storage, the smaller the amount of data, the higher the natural query performance. As the amount of data increases, resource consumption (CPU, disk read and write busy), and time-consuming will also increase.

From the perspective of a relational database, the index structure is basically fixed as B+Tree, the time complexity is O(log n), and the storage structure is row-based storage. Therefore, what we can optimize for relational databases is generally only the amount of data.

The high load is caused by high concurrent requests, complex queries, etc., which lead to busy CPU and disk, and insufficient server resources will lead to slow queries and other problems. For this type of problem, clusters and data redundancy are generally used to share the pressure.

img

At what level should we think about optimization?

img

As can be seen from the above figure, there are four layers from top to bottom, which are hardware, storage system, storage structure, and specific implementation. Layers are closely related, and the upper layer of each layer is the carrier of the layer; therefore, the upper limit of performance can be determined as the upper layer goes, and the cost of optimization will be relatively higher, and the cost performance will be lower . Taking the bottom-level specific implementation as an example, the cost of index optimization should be the smallest. It can be said that after adding indexes, both CPU consumption and response time will be reduced immediately; however, a simple statement, no matter how optimized and indexed There are limitations. When there is no room for optimization in this layer, you have to think about the upper layer [storage structure], and think about whether to optimize from the level of physical table design (such as sub-database sub-table, compressed data volume, etc.) , if it is a document-type database, you have to think about the result of document aggregation; if the optimization of the storage structure layer is ineffective, you have to continue to think about it last time. Should relational databases not be suitable for use in current business scenarios? If you want to change storage, how do you get NoSQL?

Therefore, our optimization idea is to give priority to the specific implementation out of cost-effectiveness, and there is really no room for optimization to consider at the next level. Of course, if the company has money, it can directly use the banknote capacity and bypass the first three layers. This is also a convenient emergency treatment method.

This article does not discuss the optimization of the top and bottom levels, but mainly discusses it from the perspective of the storage structure and the middle two layers of the storage system .

Summary of the eight major programs

Program overview
Program type Program description type of data income type coping scenarios
reduce data volume Data serialization storage static data short term gain Large amount of data
data archiving dynamic data interim earnings Large amount of data
intermediate table generation static data long term gain Large amount of data, high load
Sub-library and sub-table dynamic data long term gain Large amount of data, high load
trade space for performance distributed cache static data short term gain high load
One master and many slaves dynamic data interim earnings high load
Choose the right storage system CQRS dynamic data long term gain Large amount of data, high load
Replace storage system dynamic data long term gain Large amount of data, high load

There are three core essences of the database optimization scheme: reducing the amount of data , exchanging space for performance , and choosing an appropriate storage system, which also corresponds to the three reasons for the slowness explained at the beginning : the total amount of data, high load, and time complexity of * search . *

Here is a general explanation of the types of benefits: short-term benefits, low processing costs, can be dealt with urgently, and there will be technical debt after a long time; long-term benefits are the opposite of short-term benefits, the processing cost is high in the short term, but the effect can be used for a long time, and the scalability will be better good.

Static data means that the relative change frequency is relatively low, and there is no need for too many tables, and there is less where filtering. On the contrary, dynamic data has a high update frequency and is filtered by dynamic conditions.

reduce data volume

There are four solutions for reducing data volume types: data serialization storage, data archiving, intermediate table generation, and sub-database sub-table.

As mentioned above, no matter what kind of storage, the smaller the amount of data, the higher the natural query performance. As the amount of data increases, the consumption of resources (CPU, disk read and write busy), and time-consuming will also increase. high. At present, NoSQL on the market basically supports fragmented storage, so its natural distributed writing capability can provide a very good solution in terms of data volume. For relational databases, search algorithms and storage structures have less space for optimization, so we generally start from the perspective of how to reduce the amount of data to select and optimize. Therefore, this type of optimization scheme is mainly for relational databases .

imgdata archiving

data archiving
practice Scenes advantage shortcoming
Use database jobs to regularly move historical data to historical tables or libraries local hotspot data No structural changes required, less invasive Too much hotspot data can still cause performance issues

Points to note: Don’t migrate too much at one time, it is recommended to migrate with low frequency and multiple times. For example, MySQL will not release space after deleting data. You can execute the command OPTIMIZE TABLE to release storage space, but it will lock the table. If the storage space is still sufficient, you can not execute it.
  It is recommended to give priority to this solution, mainly to migrate non-hot data to the history table through database operations. If you need to check the history data, you can add a new business entry route to the corresponding history table (library).

img

Intermediate table (result table)

Intermediate table (result table)
practice Scenes advantage shortcoming
Aggregate and group a certain business in multiple dimensions by scheduling task timing Static data such as reports and leaderboards high compression ratio Developers are required to develop for scenario business

The intermediate table (result table) actually uses scheduling tasks to run out the results of complex queries and store them in an additional physical table, because this physical table stores the data summarized by running batches, so it can be understood as based on the original The business has carried out a high degree of data compression. Taking the report as an example, if there are hundreds of thousands of source data in a month, we generate the monthly dimension by scheduling tasks, which is equivalent to compressing the original data by one-hundred-thousandth; the next quarterly and annual reports can be based on The monthly report *N is used for statistics. The data processed in this way can be within the acceptable range even for three, five or even ten years, and can be calculated accurately.

So is the lower the compression ratio of the data, the better? Here is a formula:

  • The more fields, the finer the granularity, and the higher the flexibility, you can use the intermediate table to process different business joint tables.
  • The fewer fields, the coarser the granularity, and the lower the flexibility, it is usually queried as a result table.

Data serialization storage

Data serialization storage
practice Scenes advantage shortcoming
Store one-to-many data through serialized strings Does not require all fields to be stored as structured High compression ratio Serialized fields cannot join tables

img

​ The way of serialized storage in the database is a good way to reduce the amount of data for some businesses that do not require structured storage, especially for some business scenarios with M*N data volume. If M is used as If the main table is optimized, then the amount of data can be maintained at most in the order of M. In addition, such as the address information of the order, this kind of business generally does not need to be retrieved based on the fields inside, and it is more suitable.

​I think this kind of scheme is a temporary optimization scheme, whether it is the loss of the query ability of some fields after serialization, or the optimizeability of this scheme is limited.

Sub-library and sub-table

As a very classic optimization scheme for database optimization, sub-database and sub-table, especially in the era when NoSQL was not very mature, this scheme existed like a lifesaver.

Nowadays, many peers will also choose this optimization method, but from my point of view, sub-database sub-table is a costly optimization solution. Here I have a few suggestions:

  1. There is really no way to sub-database and sub-table, and it should be the last choice.
  2. NoSQL is preferred instead, because NoSQL was born basically for scalability and high performance.
  3. Is it sub-database or sub-table? If the volume is large, it will be divided into tables, and if the concurrency is high, it will be divided into databases
  4. Regardless of capacity expansion, one can be done in place. Because technology is updated too fast, it changes every 3-5 years.

Split method

Sub-database sub-table - split method
Split method angle advantage
vertical split Split according to business Reduce business coupling
Reduce the number of fields, and the number of rows owned by the physical page will increase
split horizontally Fragmentation at the physical level radically reduce data volume

As long as this split is involved, whether it is microservices, sub-database or table, there are mainly two ways of splitting: vertical splitting and horizontal splitting .

Vertical splitting is more split from the business point of view , mainly to **reduce business coupling; **In addition, taking SQL Server as an example, one page is 8KB storage, if there are more fields in a table, one row of data The larger the space naturally occupied, the fewer the number of rows stored in a page of data, the higher the IO required for each query, and the slower the performance; therefore, conversely, reducing fields can also improve performance. . I heard before that some peers have tables with 80 fields, and the data of several million starts to slow down.

Horizontal splitting is more about splitting from a technical point of view . After splitting, the structure of each table is exactly the same. In short, it is to split the data of the original table into multiple tables for storage through technical means . It fundamentally solves the problem of data volume.

imgimg

routing method

routing method
algorithm advantage shortcoming
Range It is easier to locate It is easy to cause uneven data (hot data)
It's easy to forget to create new tables
Hash Shard evenly Must have a partition key, without a partition key, all tables will be scanned once
The characteristics of relational databases (Join, aggregation calculation, paging) cannot be used in the case of sub-database
Shard Mapping Table supplementary program secondary query

After horizontal splitting, according to the partition key (sharding key), the data that should be in the same table should be disassembled and written into different physical tables, then the query must also locate the corresponding physical table according to the partition key to send the data to the query come out.

There are generally three routing methods : interval range, Hash, and fragmentation mapping table . Each routing method has its own advantages and disadvantages, and you can choose according to the corresponding business scenario.

区间范围根据某个元素的区间的进行拆分,以时间为例子,假如有个业务我们希望以月为单位拆分那么表就会拆分像 table_2022-04,这种对于文档型、ElasticSearch这类型的NoSQL也适用,无论是定位查询,还是日后清理维护都是非常的方便的。那么缺点也明显,会因为业务独特性导致数据不平均,甚至不同区间范围之间的数据量差异很大。

Hash也是一种常用的路由方式,根据Hash算法取模以数据量均匀分别存储在物理表里,缺点是对于带分区键的查询依赖特别强,如果不带分区键就无法定位到具体的物理表导致相关所有表都查询一次,而且在分库的情况下对于Join、聚合计算、分页等一些RDBMS的特性功能还无法使用。

img

一般分区键就一个,假如有时候业务场景得用不是分区键的字段进行查询,那么难道就必须得全部扫描一遍?其实可以使用分片映射表的方式,简单来说就是额外有一张表记录额外字段与分区键的映射关系。举个例子,有张订单表,原本是以UserID作为分区键拆分的,现在希望用OrderID进行查询,那么得有额外得一张物理表记录了OrderID与UserID的映射关系。因此得先查询一次映射表拿到分区键,再根据分区键的值路由到对应的物理表查询出来。可能有些朋友会问,那这映射表是否多一个映射关系就多一张表,还是多个映射关系在同一张表。我优先建议单独处理,如果说映射表字段过多,那跟不进行水平拆分时的状态其实就是一致的,这又跑回去的老问题。

用空间换性能

该类型的两个方案都是用来应对高负载的场景,方案有以下两种:分布式缓存、一主多从。

与其说这个方案叫用空间换性能,我认为用空间换资源更加贴切一些。因此两个方案的本质主要通数据冗余、集群等方式分担负载压力。

对于关系型数据库而言,因为他的ACID特性让它天生不支持写的分布式存储,但是它依然天然的支持分布式读

img

分布式缓存

分布式缓存
做法 场景 缺点
Cache Aside 应对高并发读 动态条件比较多的业务场景,缓存命中低
伪静态数据(业务配置、低时效的数据) 实时性要求高的数据场景,处理起来比较花功夫

缓存层级可以分好几种:客户端缓存API服务本地缓存分布式缓存,咱们这次只聊分布式缓存。一般我们选择分布式缓存系统都会优先选择NoSQL的键值型数据库,例如Memcached、Redis,如今Redis的数据结构多样性,高性能,易扩展性也逐渐占据了分布式缓存的主导地位。

缓存策略也主要有很多种:Cache-AsideRead/Wirte-ThroughWrite-Back,咱们用得比较多的方式主要**Cache-Aside,**具体流程可看下图:

img

我相信大家对分布式缓存相对都比较熟悉了,但是我在这里还是有几个注意点希望提醒一下大家:

避免滥用缓存

缓存应该是按需使用,从28法则来看,80%的性能问题由主要的20%的功能引起。滥用缓存的后果会导致维护成本增大,而且有一些数据一致性的问题也不好定位。特别像一些动态条件的查询或者分页,key的组装是多样化的,量大又不好用keys指令去处理,当然我们可以用额外的一个key把记录数据的key以集合方式存储,删除时候做两次查询,先查Key的集合,然后再遍历Key集合把对应的内容删除。这一顿操作下来无疑是非常废功夫的,谁弄谁知道。

img

避免缓存击穿

当缓存没有数据,就得跑去数据库查询出来,这就是缓存穿透。假如某个时间临界点数据是空的例如周排行榜,穿透过去的无论查找多少次数据库仍然是空,而且该查询消耗CPU相对比较高,并发一进来因为缺少了缓存层的对高并发的应对,这个时候就会因为并发导致数据库资源消耗过高,这就是缓存击穿。数据库资源消耗过高就会导致其他查询超时等问题。

该问题的解决方案也简单,对于查询到数据库的空结果也缓存起来,但是给一个相对快过期的时间。有些同行可能又会问,这样不就会造成了数据不一致了么?一般有数据同步的方案像分布式缓存、后续会说的一主多从、CQRS,只要存在数据同步这几个字,那就意味着会存在数据一致性的问题,因此如果使用上述方案,对应的业务场景应允许容忍一定的数据不一致。

不是所有慢查询都适用

一般来说,慢的查询都意味着比较吃资源的(CPU、磁盘I/O)。举个例子,假如某个查询功能需要3秒时间,串行查询的时候并没什么问题,我们继续假设这功能每秒大概QPS为100,那么在第一次查询结果返回之前,接下来的所有查询都应该穿透到数据库,也就意味着这几秒时间有300个请求到数据库,如果这个时候数据库CPU达到了100%,那么接下来的所有查询都会超时,也就是无法有第一个查询结果缓存起来,从而还是形成了缓存击穿。

一主多从

一主多从
场景 优点 缺点
分担数据库读压力 应急调整方便,单以运维直接解决。 高硬件成本
还没找到更好的降低数据库负载的临时方案 扩展性有限

常用的分担数据库压力还有一种常用做法,就是读写分离、一主多从。咱们都是知道关系型数据库天生是不具备分布式分片存储的,也就是不支持分布式写,但是它天然的支持分布式读。一主多从是部署多台从库只读实例,通过冗余主库的数据来分担读请求的压力,路由算法可有代码实现或者中间件解决,具体可以根据团队的运维能力与代码组件支持视情况选择。

一主多从在还没找到根治方案前是一个非常好的应急解决方案,特别是在现在云服务的年代,扩展从库是一件非常方便的事情,而且一般情况只需要运维或者DBA解决就行,无需开发人员接入。当然这方案也有缺点,因为数据无法分片,所以主从的数据量完全冗余过去,也会导致高的硬件成本。从库也有其上限,从库过多了会主库的多线程同步数据的压力。

img

选择合适的存储系统

NoSQL主要以下五种类型:键值型、文档型、列型、图型、搜素引擎,不同的存储系统直接决定了查找算法存储数据结构,也应对了需要解决的不同的业务场景。NoSQL的出现也解决了关系型数据库之前面临的难题(性能、高并发、扩展性等)。

​ 例如,ElasticSearch的查找算法是倒排索引,可以用来代替关系型数据库的低性能、高消耗的Like搜索(全表扫描)。而Redis的Hash结构决定了时间复杂度为O(1),还有它的内存存储,结合分片集群存储方式以至于可以支撑数十万QPS。

因此本类型的方案主要有两种:**CQRS、替换(选择)存储,**这两种方案的最终本质基本是一样的主要使用合适存储来弥补关系型数据库的缺点,只不过切换过渡的方式会有点不一样。

img

CQRS

CQS(命令查询分离)指同一个对象中作为查询或者命令的方法,每个方法或者返回的状态,要么改变状态,但不能两者兼备

CQRS
场景 优点 缺点
需要保留关系型数据库的使用,又要使用NoSQL的高性能与可扩展性 原应用改动范围比较小,兼容旧业务,只需要替换读的底层。 高硬件成本
允许非实时的数据场景 即保留了关系型数据库的ACID特性,又使用NoSQL的可扩展性与高性能 数据同步

讲解CQRS前得了解CQS,有些小伙伴看了估计还没不是很清晰,我这里用通俗的话解释:某个对象的数据访问的方法里,要么只是查询,要么只是写入(更新)。而CQRS(命令查询职责分离)基于CQS的基础上,用物理数据库来写入(更新),而用另外的存储系统来查询数据。因此我们在某些业务场景进行存储架构设计时,可以通过关系型数据库的ACID特性进行数据的更新与写入,用NoSQL的高性能与扩展性进行数据的查询处理,这样的好处就是关系型数据库和NoSQL的优点都可以兼得,同时对于某些业务不适于一刀切的替换存储的也可以有一个平滑的过渡。

从代码实现角度来看,不同的存储系统只是调用对应的接口API,因此CQRS的难点主要在于如何进行数据同步。

数据同步方式

CQRS实现方式
方式 实时性 方案类型 优点 缺点
CDC(变更数据捕获) 无业务侵入,解决多业务入口 额外中间件
领域事件 可读性高 需要在框架代码层面处理
调度任务定时同步 同CDC 物理删除无法识别,只能全量

一般讨论到数据同步的方式主要是分拉:

推指的是由数据变更端通过直接或者间接的方式把数据变更的记录发送到接收端,从而进行数据的一致性处理,这种主动的方式优点是实时性高。

拉指的是接收端定时的轮询数据库检查是否有数据需要进行同步,这种被动的方式从实现角度来看比推简单,因为推是需要数据变更端支持变更日志的推送的。

而推的方式又分两种:CDC(变更数据捕获)和领域事件。对于一些旧的项目来说,某些业务的数据入口非常多,无法完整清晰的梳理清楚,这个时候CDC就是一种非常好的方式,只要从最底层数据库层面把变更记录取到就可。

对于已经服务化的项目来说领域事件是一种比较舒服的方式,因为CDC是需要数据库额外开启功能或者部署额外的中间件,而领域事件则不需要,从代码可读性来看会更高,也比较开发人员的维护思维模式。

img

Replacing (selecting) a storage system

Because from the essence, the core essence of this model is the same as that of CQRS. The main thing is to have a comprehensive understanding of the advantages and disadvantages of NoSQL, so as to select and judge a suitable storage system in the corresponding business scenario. Here I would like to introduce a book Martin Fowler "The Essence of NoSQL". I have read this book several times, and it is also a good introduction to the advantages and disadvantages of various NoSQL and usage scenarios.

Of course, when replacing the storage, I also have a suggestion here: add an intermediate version, and this version will do a good job of data synchronization and business switching. Data synchronization must ensure full and increased processing, and you can start over at any time. The business switching is mainly for subsequent versions The update is a temporary function, mainly to avoid the situation that the subsequent version update is not smooth or the data is inconsistent due to the version update. After running for a period of time, after verifying that the data of the two different storage systems are consistent, the underlying calls of the data access layer can be replaced. In this way, smooth update switching can be achieved.

Finish

This article has finished introducing the eight major solutions here, and here is another reminder that each solution has its own response scenarios. We can only choose the corresponding solution according to the business scenario. There is no one-size-fits-all solution, and there is no silver bullet.

Among these eight solutions, most of them have data synchronization. As long as there is data synchronization, whether it is one master with multiple slaves, distributed cache, or CQRS, there will be data consistency problems, so these solutions are more suitable. Some read-only business scenarios. Of course, in some scenarios where you check after writing, you can use the transition page or advertisement page to alleviate the data inconsistency by clicking the user to close and switch pages.

Through this article, I believe that everyone has a comprehensive understanding of database design optimization. If you have more suggestions, please give me feedback in the comments below.

Guess you like

Origin blog.csdn.net/qq_43842093/article/details/131345624