Divide and Conquer--A Brief Discussion on Sub-Database and Sub-Table and the Road to Practice | JD Cloud Technical Team

foreword

I was always talking about microservices before. Microservices themselves are also distributed systems. In fact, the core idea of ​​microservices is to divide and conquer. A complex single system is divided into different self-services according to the delivery of business to reduce senior complexity. At the same time It can improve the scalability of the system.

Today I want to talk about sub-database and sub-table, because this is an unavoidable link for a fast-growing business. I was working on the SAAS system related to the mall before. The commodity pool is a storage bottleneck. The number of commodity pools will increase exponentially based on the growth of tenants and operations. It can increase to tens of millions of data in just a few months. After that, it may exceed 100 million. As for the data of orders, it will also become larger and larger as the business grows.

As far as the storage layer is concerned, improving the storage and query performance under a large amount of data involves another level of problem, but the idea is still the same, divide and conquer.

what kind of problems we face

The retrieval performance of a relational database will drop sharply when the amount of data is greater than a certain amount. In the face of massive data, all data is stored in one table, which obviously easily exceeds the capacity of the database table.

In addition, although the simple sub-table can solve the problem of slow retrieval due to excessive data volume, it cannot solve the problem of slow database response caused by too many concurrent requests accessing the same database. Therefore, sub-databases are needed to solve the performance bottleneck problem of a single database instance.

Database Architecture Scheme

Before talking about specific solutions, we need to understand the three architectures of the database and the solutions involved.

1. Shared Everything

Generally refers to the environment of a single host, fully transparently shared CPU/memory/hard disk, the parallel processing capability is the worst, generally does not consider large-scale concurrent requirements, the architecture is relatively simple, and general application requirements can basically be met.

2. Shared Disk

Each processing unit uses its own private CPU and Memory, and shares the disk system. Typical representatives are Oracle RAC and DB2 PureScale. For example, Oracle RAC uses shared storage to achieve data sharing. It can improve parallel processing capabilities by adding nodes, and has good scalability. Storage Area Network (SAN) is used to connect to disk arrays of multiple servers through Fiber Channel. , reduce network consumption, improve the efficiency of data reading, and are often used in OLTP applications with high concurrency. It is similar to the SMP (Symmetric Multi-Processing) mode, but when the memory interface reaches saturation, adding nodes cannot achieve higher performance, and more nodes will increase the cost of operation and maintenance.

3. Shared Nothing

Each processing unit has its own private CPU/memory/hard disk, etc. Nothing, as the name suggests, does not have shared resources, similar to the MPP (massively parallel processing) mode, each processing unit communicates through the protocol, parallel processing and expansion capabilities are better good. Typical representatives are DB2 DPF, MySQL Cluster with sub-database and sub-table. Each node is independent of each other and processes its own data. The processed results may be aggregated to the upper layer or circulated between nodes.

What we often say about Sharding is actually Shared Nothing. It divides a table horizontally from physical storage and distributes it to multiple servers (or multiple instances). Each server can work independently and have a common schema, such as MySQL Proxy and Google's various architectures can increase processing power and capacity only by increasing the number of servers.

As for MPP, it refers to Analytical Massively Parallel Processing (MPP) Databases, which are databases optimized for analytical workloads, which generally require aggregation and processing of large data sets. MPP databases tend to be columnar, so MPP databases typically store each column as an object rather than each row in a table as an object. This architecture enables complex analytical queries to be processed faster and more efficiently. For example TeraData, Greenplum, GaussDB100, TBase.

Based on the above architectural solutions, we can provide solutions for large data volume storage:

The above solutions have their own advantages and disadvantages. The biggest problem with the partition mode is the quasi-share everything architecture, which cannot horizontally expand the cpu and memory, so it can basically be ruled out; nosql itself is actually a very good alternative, but nosql (including most open source newsql) consumes a lot of hardware and has high operation and maintenance costs. One of the commonly used schemes is the Mysql-based sub-database and sub-table scheme.

Sub-database and sub-table architecture scheme

For sub-database and sub-table, first look at what products are on the market.

industry components original factory Features Remark
DBLE Ecoson Open Source Community Highly scalable distributed middleware focusing on mysql An enhanced version based on MyCAT.
Meituan Atlas meituan Separation of reading and writing, single-database sub-table At present, it has been gradually removed from the original factory.
Cobar Ali (B2B) Cobar middleware is located between the foreground application and the actual database in the form of Proxy, and the open interface to the foreground is the MySQL communication protocol The database in the open source version only supports MySQL, and does not support read-write separation.
MyCAT Ali It is a server that implements the MySQL protocol. Front-end users can regard it as a database proxy, access it with MySQL client tools and command lines, and its back-end can communicate with multiple MySQL servers using the MySQL native protocol MyCAT is developed based on Ali's open source Cobar product
Atlas 360 Read-write separation, static table partitioning No maintenance after 2015
Kingshard open source project The high-performance MySQL Proxy project is developed by Go. In terms of satisfying the basic read-write separation function, the performance of Kingshard is more than 80% of that of direct-connected MySQL.
TDDL Ali Taobao Dynamic data source, read-write separation, sub-database and sub-table TDDL is divided into two versions, one is the version with middleware, and the other is the direct Java version
Zebra Meituan Dianping Realize dynamic data source, read-write separation, sub-database sub-table, CAT monitoring Complete functions and monitoring, complex access and many restrictions.
MTDDL Meituan Dianping Dynamic data source, read-write separation, distributed unique primary key generator, sub-database sub-table, connection pool and SQL monitoring
speed Google, Youtube The cluster is managed based on ZooKeeper, and data processing is performed through RPC. It is generally divided into three parts: server, command line, and gui monitoring Youtube mass application
DRDS Ali DRDS (Distributed Relational Database Service) focuses on solving the scalability problem of stand-alone relational databases. It is lightweight (stateless), flexible, stable, and efficient. It is independently developed by Alibaba Group.
Sharding-proxy apache open source project Provide MySQL version, which can use any access client compatible with MySQL protocol (such as: MySQL Command Client, MySQL Workbench, etc.) to operate data, which is more friendly to DBA. It is completely transparent to the application and can be used directly as MySQL. Compatible with any client compatible with the MySQL protocol. The Apache project, positioned as a transparent database agent, provides a server version that encapsulates the database binary protocol to support heterogeneous languages.
Sharding jdbc apache open source project Fully compatible with JDBC and various ORM frameworks. Applicable to any Java-based ORM framework, such as: JPA, Hibernate, Mybatis, Spring JDBC Template or use JDBC directly. Based on any third-party database connection pool, such as: DBCP, C3P0, BoneCP, Druid, HikariCP, etc. Any database that implements the JDBC specification is supported. Currently supports MySQL, Oracle, SQLServer and PostgreSQL The Apache project, positioned as a lightweight Java framework, provides additional services in Java's JDBC layer. It uses the client to directly connect to the database and provides services in the form of jar packages without additional deployment and dependencies. It can be understood as an enhanced version of the JDBC driver

For the product mode of sub-database and sub-table, it is divided into two types, middleware mode and client mode.

1. The advantages and disadvantages of the middleware model

The middleware mode is an independent process, so it can support heterogeneous languages, is not intrusive to the current program, and is a transparent mysql service for the business side, but the disadvantages are also very obvious, such as large hardware consumption and high operation and maintenance costs (especially in local In the case of implementation), at the same time, because the proxy is added to the relational database, it will cause problems that are difficult to debug.

2. Advantages and disadvantages of client mode

The main disadvantage of the client mode is that it is intrusive to the code, so it can basically only support a single language. At the same time, because each client must establish a connection to the schema, if there are not many database instances, the number of connections needs to be carefully controlled, but the client The advantages of the mode are also very obvious. First of all, it is decentralized in terms of architecture, which avoids the problem of proxy failure in the middleware mode. At the same time, because there is no middle layer, it has high performance, flexibility and controllability, and because there is no proxy layer, it does not need Considering the high availability and clustering of proxy, the operation and maintenance cost is relatively low.

sharding-jdbc access practice

Sharding-jdbc is actually the most well-known of these products, also because it is positioned as a lightweight Java framework and provides additional services at the JDBC layer of Java. It uses the client to directly connect to the database and provides services in the form of jar packages without additional deployment and dependencies. It can be understood as an enhanced version of the JDBC driver and is fully compatible with JDBC and various ORM frameworks. Applicable to any JDBC-based ORM framework, such as: JPA, Hibernate, Mybatis, Spring JDBC Template or use JDBC directly. And it is also very good in terms of community activity and code quality. Next, I will talk about the access details in detail.

1. Component integration

<dependency>
<groupId>org.apache.shardingsphere</groupId>
<artifactId>shardingsphere-jdbc-core-spring-boot-starter</artifactId> <version> 5.0.0</version>
</dependency>
<dependency>
<groupId>com.baomidou</groupId>
<artifactId>dynamic-datasource-spring-boot-starter</artifactId>
<version>3.4.0</version>
</dependency> 

2. bean configuration

Configure the sharding jdbc data source and add it to the dynamic data source for data source routing.

Modify the mysql data source configuration of the corresponding service in the original configuration center, and configure the default route of the dynamic data source for the data source that does not divide the database or table

3. sharing JDBC configuration

spring.shardingsphere.enabled=true #shardingsphere开关 
spring.shardingsphere.props.sql.show=true 
spring.shardingsphere.mode.type=Standalone #在使用配置中心的情况下,使用standalone模式即可(memery、standalone、cluster三种模式) 
spring.shardingsphere.mode.repository.type=File #standalone模式下使用File,即当前配置文件 
spring.shardingsphere.mode.overwrite=true # 本地配置是否覆盖配置中心配置。如果可覆盖,每次启动都以本地配置为准。 
spring.shardingsphere.datasource.names=ds-0,ds-1 #配置数据源名字,真实数据源 
#配置ds-0数据源 
spring.shardingsphere.datasource.ds-0.jdbc-url=jdbc:mysql://**** 
spring.shardingsphere.datasource.ds-0.type=com.zaxxer.hikari.HikariDataSource 
spring.shardingsphere.datasource.ds-0.driver-class-name=com.mysql.jdbc.Driver 
spring.shardingsphere.datasource.ds-0.username= 
spring.shardingsphere.datasource.ds-0.password= 
#配置ds-1数据源 
spring.shardingsphere.datasource.ds-1.jdbc-url=jdbc:mysql://****
spring.shardingsphere.datasource.ds-1.type=com.zaxxer.hikari.HikariDataSource 
spring.shardingsphere.datasource.ds-1.driver-class-name=com.mysql.jdbc.Driver 
spring.shardingsphere.datasource.ds-1.username= 
spring.shardingsphere.datasource.ds-1.password= 
#配置模式数据库分片键和相关的表 
spring.shardingsphere.rules.sharding.default-database-strategy.standard.sharding-column=user_id 
spring.shardingsphere.rules.sharding.default-database-strategy.standard.sharding-algorithm-name=database-inline 
spring.shardingsphere.rules.sharding.binding-tables[0]=t_order,t_order_item 
spring.shardingsphere.rules.sharding.broadcast-tables=t_address #配置广播表,即所有库中都会同步增删的表 

The above are some basic configurations, and some business scenario configurations. You can refer to the open source community documentation: https://shardingsphere.apache.org/document/4.1.0/cn/overview/

Summarize

For specific business scenarios, we first divide business units based on the idea of ​​DDD, and first do a good job of vertical database division. Then, for some specific tables with huge business growth, perform horizontal sub-library processing, such as commodity pool tables in the commodity subdomain, order tables in the order subdomain, and so on.

As for the sub-table dimension, in the early stage of business, it is necessary to design the vertical table. For example, in the product pool design, only relational information needs to be stored, while product details information is stored separately in a bottom table.

Author: JD Logistics Zhao Yongping

Source: JD Cloud Developer Community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/9696997