Several common forms of sub-library and sub-table and possible difficulties

When talking about database architecture and database optimization, we often hear keywords such as "sharding", "sharding", "sharding"... The good news is that the companies that these friends serve are growing (or are about to face) high growth, and they are also facing some technical challenges. What is worrying is that their system really needs "sub-database and sub-table"? Is "sub-database sub-table" so easy to practice? To this end, the author has sorted out some problems that may be encountered in the sub-database and sub-table, and introduced the corresponding solutions and suggestions based on past experience.

vertical table

Vertical sub-tables are common in daily development and design. The popular saying is called "splitting large tables into small tables", and the splitting is based on "columns" (fields) in relational databases. Usually, if there are many fields in a table, you can create a new "extension table", and split out the fields that are not frequently used or have a large length and put them in the "extension table", as shown in the following figure:

summary

 

In the case of many fields, splitting is indeed easier to develop and maintain (I have seen a legacy system where a large table contains more than 100 columns). In a sense, the problem of "cross-page" can also be avoided (the bottom layer of MySQL and MSSQL are stored through "data pages", and the "cross-page" problem may cause additional performance overhead, which is not expanded here, interested friends You can consult the relevant information for your own research).

The operation of splitting fields is recommended to be done in the database design stage. If it is split during the development process, the previous query statement needs to be rewritten, which will bring additional costs and risks. It is recommended to be cautious.

vertical library

Vertical sub-libraries have become very popular today when "microservices" are prevalent. The basic idea is to divide different databases according to business modules, instead of putting all data tables in the same database as in the early days.

summary

The "service-oriented" split operation at the system level can solve the coupling and performance bottlenecks at the business system level, and is conducive to the expansion and maintenance of the system. The splitting at the database level is also the same. Similar to the "governance" and "demotion" mechanisms of services, we can also "hierarchically" manage, maintain, monitor, and expand data of different business types.

As we all know, the database is often the most likely to become the bottleneck of the application system, and the database itself is "stateful". Compared with Web and application servers, it is more difficult to achieve "horizontal expansion". The connection resources of the database are relatively valuable and the processing capacity of a single machine is also limited. In high concurrency scenarios, vertical sub-database can break through the bottleneck of IO, connection number and single-machine hardware resources to a certain extent, and it is an important means to optimize database architecture in large-scale distributed systems. .

Then, many people did not fundamentally figure out why they wanted to split, nor did they master the principles and skills of splitting, but just blindly imitated the practices of big factories. This leads to many problems after splitting (for example: cross-database join, distributed transactions, etc.).

horizontal sub-table

Horizontal sub-tables are also called horizontal sub-tables. It is easier to understand. It is to distribute different data rows in a table into different database tables according to certain rules (these tables are stored in the same database), so as to reduce the amount of data in a single table. , to optimize query performance. The most common way is to split by hashing and modulo by fields such as primary key or time. As shown below:

summary

Horizontally dividing tables can reduce the amount of data in a single table and alleviate query performance bottlenecks to a certain extent. But in essence these tables are still stored in the same library, so there will still be IO bottlenecks at the library level. Therefore, this approach is generally not recommended.

Horizontal sub-library sub-table

The idea of ​​horizontal sub-database sub-table is the same as the above-mentioned horizontal sub-table, the only difference is that these split tables are stored in different data. This is also the approach chosen by many large Internet companies. As shown below:

In a sense, the “separation of hot and cold data” used in some systems (migrating some less-used historical data to other databases. In business functions, only hot data queries are usually provided by default), but also similar practice. In the scenario of high concurrency and massive data, sub-database and sub-table can effectively alleviate the performance bottleneck and pressure of single machine and single database, and break through the bottleneck of IO, connection number, and hardware resources. Of course, the hardware cost will also be higher. At the same time, this will also bring some complex technical issues and challenges (eg: complex queries across shards, transactions across shards, etc.)

Difficulties of sub-library and sub-table

Problems and solutions brought by vertical sub-library:

The problem of cross-repository join

在拆分之前,系统中很多列表和详情页所需的数据是可以通过sql join来完成的。而拆分后,数据库可能是分布式在不同实例和不同的主机上,join将变得非常麻烦。而且基于架构规范,性能,安全性等方面考虑,一般是禁止跨库join的。那该怎么办呢?首先要考虑下垂直分库的设计问题,如果可以调整,那就优先调整。如果无法调整的情况,下面笔者将结合以往的实际经验,总结几种常见的解决思路,并分析其适用场景。

跨库Join的几种解决思路

全局表

所谓全局表,就是有可能系统中所有模块都可能会依赖到的一些表。比较类似我们理解的“数据字典”。为了避免跨库join查询,我们可以将这类表在其他每个数据库中均保存一份。同时,这类数据通常也很少发生修改(甚至几乎不会),所以也不用太担心“一致性”问题。

字段冗余

这是一种典型的反范式设计,在互联网行业中比较常见,通常是为了性能来避免join查询。

举个电商业务中很简单的场景:

“订单表”中保存“卖家Id”的同时,将卖家的“Name”字段也冗余,这样查询订单详情的时候就不需要再去查询“卖家用户表”。

字段冗余能带来便利,是一种“空间换时间”的体现。但其适用场景也比较有限,比较适合依赖字段较少的情况。最复杂的还是数据一致性问题,这点很难保证,可以借助数据库中的触发器或者在业务代码层面去保证。当然,也需要结合实际业务场景来看一致性的要求。就像上面例子,如果卖家修改了Name之后,是否需要在订单信息中同步更新呢?

数据同步

定时A库中的tab_a表和B库中tbl_b有关联,可以定时将指定的表做同步。当然,同步本来会对数据库带来一定的影响,需要性能影响和数据时效性中取得一个平衡。这样来避免复杂的跨库查询。笔者曾经在项目中是通过ETL工具来实施的。

系统层组装

在系统层面,通过调用不同模块的组件或者服务,获取到数据并进行字段拼装。说起来很容易,但实践起来可真没有这么简单,尤其是数据库设计上存在问题但又无法轻易调整的时候。

具体情况通常会比较复杂。下面笔者结合以往实际经验,并通过伪代码方式来描述。

简单的列表查询的情况

伪代码很容易理解,先获取“我的提问列表”数据,然后再根据列表中的UserId去循环调用依赖的用户服务获取到用户的RealName,拼装结果并返回。

有经验的读者一眼就能看出上诉伪代码存在效率问题。循环调用服务,可能会有循环RPC,循环查询数据库…不推荐使用。再看看改进后的:

这种实现方式,看起来要优雅一点,其实就是把循环调用改成一次调用。当然,用户服务的数据库查询中很可能是In查询,效率方面比上一种方式更高。(坊间流传In查询会全表扫描,存在性能问题,传闻不可全信。其实查询优化器都是基本成本估算的,经过测试,在In语句中条件字段有索引的时候,条件较少的情况是会走索引的。这里不细展开说明,感兴趣的朋友请自行测试)。

小结

简单字段组装的情况下,我们只需要先获取“主表”数据,然后再根据关联关系,调用其他模块的组件或服务来获取依赖的其他字段(如例中依赖的用户信息),最后将数据进行组装。

通常,我们都会通过缓存来避免频繁RPC通信和数据库查询的开销。

列表查询带条件过滤的情况

在上述例子中,都是简单的字段组装,而不存在条件过滤。看拆分前的SQL:

这种连接查询并且还带条件过滤的情况,想在代码层面组装数据其实是非常复杂的(尤其是左表和右表都带条件过滤的情况会更复杂),不能像之前例子中那样简单的进行组装了。试想一下,如果像上面那样简单的进行组装,造成的结果就是返回的数据不完整,不准确。 

有如下几种解决思路:

  1. 查出所有的问答数据,然后调用用户服务进行拼装数据,再根据过滤字段state字段进行过滤,最后进行排序和分页并返回。

    这种方式能够保证数据的准确性和完整性,但是性能影响非常大,不建议使用。

  2. 查询出state字段符合/不符合的UserId,在查询问答数据的时候使用in/not in进行过滤,排序,分页等。过滤出有效的问答数据后,再调用用户服务获取数据进行组装。

    这种方式明显更优雅点。笔者之前在某个项目的特殊场景中就是采用过这种方式实现。

跨库事务(分布式事务)的问题

按业务拆分数据库之后,不可避免的就是“分布式事务”的问题。以往在代码中通过spring注解简单配置就能实现事务的,现在则需要花很大的成本去保证一致性。这里不展开介绍, 
感兴趣的读者可以自行参考《分布式事务一致性解决方案》,链接地址: 
http://www.infoq.com/cn/articles/solution-of-distributed-system-transaction-consistency

垂直分库总结和实践建议

本篇中主要描述了几种常见的拆分方式,并着重介绍了垂直分库带来的一些问题和解决思路。读者朋友可能还有些问题和疑惑。

1. 我们目前的数据库是否需要进行垂直分库?

根据系统架构和公司实际情况来,如果你们的系统还是个简单的单体应用,并且没有什么访问量和数据量,那就别着急折腾“垂直分库”了,否则没有任何收益,也很难有好结果。

切记,“过度设计”和“过早优化”是很多架构师和技术人员常犯的毛病。

2. 垂直拆分有没有原则或者技巧?

没有什么黄金法则和标准答案。一般是参考系统的业务模块拆分来进行数据库的拆分。比如“用户服务”,对应的可能就是“用户数据库”。但是也不一定严格一一对应。有些情况下,数据库拆分的粒度可能会比系统拆分的粒度更粗。笔者也确实见过有些系统中的某些表原本应该放A库中的,却放在了B库中。有些库和表原本是可以合并的,却单独保存着。还有些表,看起来放在A库中也OK,放在B库中也合理。

如何设计和权衡,这个就看实际情况和架构师/开发人员的水平了。

3. 上面举例的都太简单了,我们的后台报表系统中join的表都有n个了, 
分库后该怎么查?

有很多朋友跟我提过类似的问题。其实互联网的业务系统中,本来就应该尽量避免join的,如果有多个join的,要么是设计不合理,要么是技术选型有误。请自行科普下OLAP和OLTP,报表类的系统在传统BI时代都是通过OLAP数据仓库去实现的(现在则更多是借助离线分析、流式计算等手段实现),而不该向上面描述的那样直接在业务库中执行大量join和统计。

 

 

http://www.infoq.com/cn/articles/key-steps-and-likely-problems-of-split-table?utm_source=infoq&utm_campaign=user_page&utm_medium=link

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326328441&siteId=291194637