[Advanced] Detailed explanation of MySQL sub-database and sub-table


insert image description here

0. Preface

Suppose there is an e-commerce website. With the increase of the number of users and orders, it is difficult for a single database to carry such a huge amount of data, and the query speed is also gradually reduced. At this time, it is necessary to divide the database and table.

Why sub-library

We know that database connections are limited. In a high-concurrency scenario, a large number of requests to access the database cannot be handled by a single MySQL server! The current very popular microservice architecture appears to deal with high concurrency. It splits different modules such as orders, users, and commodities into multiple applications, and splits a single database into multiple databases with different functional modules (order database, user database, and commodity database) to share the reading and writing pressure.

  1. When a single database cannot handle high-load and large-scale data, the data can be distributed among multiple databases through sub-databases to achieve horizontal expansion and load balancing. Each database instance is responsible for processing part of the data and requests, improving the performance and throughput of the overall system.

  2. In some cases, data needs to be isolated, such as a multi-tenant system, and the data of each tenant needs to be stored separately to ensure data security and privacy. By sub-database, data of different tenants can be stored in different databases to avoid data confusion and conflict.

  3. When the application involves multiple geographical locations or data centers, the data can be distributed in different databases so that the data can be stored nearby, improving data access speed and reducing network latency.

  4. For large systems, the management and maintenance of data can become complex and difficult. Through sub-databases, data can be divided according to business functions or modules, simplifying data management and maintenance operations.

Why sub-table

Rumors about the performance bottleneck of MySQL's 20 million entries

As a development student who has worked for more than 3-5 years, you should have heard of the performance bottleneck of MySQL single table by seniors or online posts, that is, when the data volume of a single table exceeds 20 million rows, the performance will drop significantly. In fact, this rumor has always existed and has been passed on. I don’t think it’s a bad thing, at least in terms of performance optimization. After I saw the unofficial history is. Regarding the performance bottleneck of MySQL single table, that is, when the data volume of a single table exceeds 20 million rows, the performance will drop significantly. According to Noshi, this statement was first said to have originated from Baidu, and was later brought to other companies by Baidu engineers, and gradually spread in the industry.


A long time ago, I was curious and did a test verification. 其实在8核、16G、 机械硬盘、单表32个字段 情况下。数据库表数据达到1000多万条,没有经过索引优化的情况下。时候性能大概在查询一次的时间5-8秒不等. But after index optimization, it can be reduced to within 3 seconds. In fact, this can be regarded as a performance bottleneck, which is less than 20 million.

Best practices given by Alibaba

Alibaba's "Java Development Manual" proposes that the number of rows in a single table exceeds 5 million rows or the capacity of a single table exceeds 2GB.. However, this value is not fixed, it is related to the configuration of MySQL and the hardware of the machine. When the single-table database reaches a certain upper limit, the memory cannot store its indexes, causing subsequent SQL queries to generate disk IO, thereby degrading performance. Increasing the hardware configuration may improve performance.

Alibaba's "Java Development Manual" adds that if the data volume is not expected to reach this level in three years, do not divide the database and table when creating the table. According to the comprehensive evaluation of its own machine conditions, 5 million lines can be used as a unified standard. According to actual tests, under the InnoDB engine, the query performance of a single MySQL table with a data volume of 8 million may be poor. The query speed may be faster using the MyISAM engine, but it is not as good as InnoDB for data integrity and transaction support. Therefore, an appropriate optimization scheme should be selected according to actual needs.

1. Sub-database and sub-table

1.1. Vertical sub-database sub-table

The e-commerce website has a user module and an order module, which can split the user table and order table into different databases. For example, the user table (including user basic information and user extended information) and order table originally in the same database can be split into two databases, one database stores the user's basic information, and the other database stores the user's extended information and order information.

The most common blog articles take the e-commerce system as a case, because it is the most representative, and the user module and order module of the e-commerce website can be split into different databases. This split can improve the scalability and performance of the system.

1.1.1. Vertical sub-library

Suppose there is an e-commerce website, which has a database that contains two tables: user table (User) and order table (Order). The user table contains the basic information of the user (such as user name, password, etc.) and the extended information of the user (such as the user's shopping preferences, historical orders, etc.). The order table contains the details of the order (such as order number, items purchased, quantity, price, etc.).

![Insert picture description here](https://img-blog.csdnimg.cn/be4c1eb2300347d081c2201bd777114e.png insert image description here
When splitting users, orders and products into different databases

User Database

  • User Information Table
  • Authentication Information Table
  • Authorization Information Table
  • User Statistics Information Table

Order Database

  • Order Information Table
  • Order Status Table
  • Payment Information Table

Product Database

  • Product Information Table
  • Inventory Information Table
  • Price Information Table

Each library contains tables and data associated with it. Storing users, orders, and items in separate databases can achieve the following advantages:

  1. Since users, orders, and items are in different libraries, each library can be optimized for its specific needs. For example, the user library can be optimized for user query and authentication, the order library can be optimized for order processing and status tracking, and the product library can be optimized for item retrieval and inventory management. This improves overall system performance.

  2. Each library can be scaled horizontally independently. When the number of users, orders, or products increases, the corresponding library can be expanded by adding database instances or fragments to meet the needs of high concurrency and large-scale data.

  3. Separate storage of users, orders and products in different libraries can achieve data isolation and security. For example, the authority to access the order library can be controlled separately from the authority to access the user library or commodity library, thereby improving the security of the system.

1.1.2. Vertical table division

According to the current user portraits in the e-commerce industry. User portraits are the label models of users' basic attributes, purchasing power, behavioral characteristics, social networks, psychological characteristics, and hobbies obtained from the analysis of these data. If we design the database table based on this. That's too complicated. Let's make a simple example to realize it, so that everyone can understand it basically. The specific split design is a very complicated process.
insert image description here

ok, then let's design a simple user table for the e-commerce system. The simple user table of the general e-commerce system is as follows. Let me make an example here, which contains the basic information of the user and some attributes related to the user.

class User {
  - user_id: int       // 用户ID
  - username: string   // 用户名
  - password: string   // 密码
  - email: string      // 电子邮件
  - phone_number: string   // 手机号码
  - full_name: string  // 姓名
  - gender: string     // 性别
  - date_of_birth: date    // 出生日期
  - registration_date: date    // 注册日期
  - last_login_date: date      // 最后登录日期
  - address: string    // 地址
  - postal_code: string   // 邮编
  - country: string    // 国家
  - state: string      // 省份/州
  - city: string       // 城市
  - avatar: string     // 头像
  - bio: string        // 个人简介
  - is_active: boolean     // 是否激活
  - is_admin: boolean     // 是否管理员
  - balance: decimal      // 余额
  - email_verification_status: boolean    // 邮箱验证状态
  - phone_verification_status: boolean    // 手机验证状态
  - referrer_id: int     // 推荐人ID
  - registration_ip: string    // 注册IP地址
  - last_login_ip: string     // 最后登录IP地址
  - order_count: int    // 订单数量
  - wishlist: string    // 收藏夹
  - user_level: string     // 用户级别/角色
}

ok, let's think about how to do a vertical split.

Let's talk about the principle of vertical split first
垂直拆分是将一个大型的表按照功能或主题进行拆分成多个较小的表的过程.
In the case of the user table, the following aspects may be considered for splitting:

  1. Authentication and Authorization Table :
    This table contains information related to user authentication and authorization, such as username, password, etc. This table can be used in the process of user login and authorization verification. Splitting out this table improves security and performance because sensitive information can be managed and stored separately.

  2. Profile Table (Profile Table)
    This table contains the user's profile information, such as email, registration date, and so on. This table can be used to display user profiles or perform user statistics analysis. Splitting this table can reduce the amount of data in the main user table and improve query performance.

  3. Activity Log Table (Activity Log Table)
    This table is used to record the user's activity log, such as the last login date, operation records, etc. This table can be used to track user activity and generate log reports. Splitting this table can avoid frequent write operations on the main user table and reduce the load on the main table.

According to the above splitting principle, our split table structure

Vertically split identity verification form, profile form and user statistics form. Each table is a separate class that contains the split fields.
insert image description here

  1. Authentication Table
@startuml

class t_user_authentication {
  - user_id: int       // 用户ID
  - username: string   // 用户名
  - password: string   // 密码
  - is_active: boolean     // 是否激活
  - is_admin: boolean     // 是否管理员
  - email_verification_status: boolean    // 邮箱验证状态
  - phone_verification_status: boolean    // 手机验证状态
}

@enduml
  1. Profile Table
@startuml

class  t_user_Profile {
  - user_id: int       // 用户ID
  - email: string      // 电子邮件
  - phone_number: string   // 手机号码
  - full_name: string  // 姓名
  - gender: string     // 性别
  - date_of_birth: date    // 出生日期
  - address: string    // 地址
  - postal_code: string   // 邮编
  - country: string    // 国家
  - state: string      // 省份/州
  - city: string       // 城市
  - avatar: string     // 头像
  - bio: string        // 个人简介
  - registration_date: date    // 注册日期
  - last_login_date: date      // 最后登录日期
  - registration_ip: string    // 注册IP地址
  - last_login_ip: string     // 最后登录IP地址
}

@enduml
  1. User Statistics Table

class  t_user_Statistics {
    
    
  - user_id: int       // 用户ID
  - order_count: int    // 订单数量
  - wishlist: string    // 收藏夹
  - user_level: string     // 用户级别/角色
  - balance: decimal      // 余额
  - referrer_id: int     // 推荐人ID
}

 

2. Horizontal sub-database sub-table

2.1. Horizontal sub-library

In an e-commerce system, standard databases may not be able to handle various high-load and large-volume situations, so it is necessary to divide databases or tables. This can also help reduce traffic and unnecessary load.

Suppose we have an e-commerce platform with tens of millions of users and orders. If we put all the user data and order data in one database, then with the expansion of time and system, this database may face many problems, such as slow query, insufficient data storage, etc. Therefore, we may need to divide the database horizontally.

Using user ID as the basis for sub-database, user data can be divided into multiple databases. For example, user data of user ID 1-50000 can be stored in database 1, user data of user ID 50001-100000 can be stored in database 2, and so on. In this way, queries and other operations can be performed in parallel on different databases, improving system efficiency and scalability.

Then, for the order data, it can also be divided into databases in a similar way. For example, you can sort by order ID or order date. For example, the order data of order ID 1-10000 is stored in database 1, the order data of order ID 10001-20000 is stored in database 2, and so on.

After the horizontal sub-database is implemented in this way, the e-commerce platform can greatly reduce the load on the database, and improve system performance and data query efficiency.

Note that the key to horizontal sharding lies in choosing a reasonable sharding strategy, which needs to be set according to the specific situation of the data and business needs.

2.2. Split order table horizontally

If you want to split the order table into two tables and use modulo for sharding, you can follow the example below:

Suppose we have an orders table called "orders" with the following fields:

  • order_id: order ID
  • customer_id: customer ID
  • order_date: order date
  • total_amount: the total amount of the order
  • shipping_address: shipping address

We can use the customer ID to do a modulo operation, divide the customer ID by 2, and then divide the order data into the two order tables according to the difference in the remainder.

  1. orders_1: store the order data whose customer ID modulo result is 0.
  2. orders_2: store the order data whose customer ID modulo result is 1.

When the system receives a new order, it performs modulo calculation on the customer ID of the order. If the modulo result is 0, insert the order into the orders_1 table; if the modulo result is 1, insert the order into the orders_2 table.

When querying an order, the system needs to determine which order table needs to perform the query operation based on the result of the customer ID modulo. For example, if you want to query an order with a customer ID of 123, perform a modulo operation on the customer ID. If the modulo result is 0, execute the query in the orders_1 table; if the modulo result is 1, execute the query in the orders_2 table Execute the query.
insert image description here

By distributing the order table into different tables according to the modulo results of the customer ID, a simple horizontal split can be achieved, and the order data can be evenly distributed into two tables to improve query performance and scalability.

It should be noted that using modulo for sharding may lead to uneven distribution of data, because some customer IDs may tend to generate more orders, resulting in a large amount of data in a table. In practical applications, it is necessary to adjust and optimize according to the actual situation to ensure that the data is evenly distributed and meet the requirements of system performance and scalability.

This is an example of a simple horizontal split of the order table, the exact implementation may vary depending on system design and requirements.

2. Understand the process and implementation plan

Remember one sentence, not all systems are divided into databases and tables as soon as they come up, including Taobao and JD.com. Especially this kind of advanced design is cumbersome and even hidden danger for small companies. Not only talent cost, resource cost, operation and maintenance cost. Even the cost of learning should not be underestimated. So basically it is continuously derived to sub-database and sub-table, which is the correct technical route and best practice for small and medium-sized companies.

Let's take a scenario: For example, there is an e-commerce website. With the development of the business, the number of users, the number of products and the number of transactions are all growing rapidly. The original single database can no longer meet the demand, the query speed is slow, the system load is high, and even downtime occurs. In this case, in order to improve the performance and stability of the system, it is necessary to sub-database and sub-table of the database.

2.1. Question Discussion

The purpose of sub-database sub-table is to solve the problem that a single database cannot bear a large amount of data and high concurrency. However, sub-database and sub-table will also bring some problems, such as data consistency problems, distributed transaction problems, cross-database cross-table query problems, data migration problems, etc.

2.2. Derived sub-database sub-table strategy

Vertical database sharding and table sharding are common database splitting strategies to solve performance bottlenecks and scalability issues of a single database. The following are common strategies for vertical database and table division:

2.2.1. Vertical Sharding

  • Functional sub-database: Distribute data into different databases according to functional modules. For example, store user information in one database, order information in another, and so on. Each database focuses on processing data for a specific function, improving performance and scalability.
  • Vertical sub-databases can be divided according to business needs and data correlation, and data that is not strongly correlated can be stored in different libraries to reduce the load of a single database.

2.2.2. Vertical Partitioning

  • Column division table: Distribute the fields in the table into different tables according to the correlation of the columns. Store frequently used fields and seldom used fields separately to reduce the amount of data when querying and improve query performance.
  • Entity sub-table: Distribute the rows in the table into different tables according to the association of data entities. Scatter data rows with strong associations into different tables to reduce data redundancy and improve query performance.

2.2.3. How do we choose the strategy of vertical database and table division

The choice of vertical sub-database and sub-table depends on the requirements of the system and data characteristics.

  • 垂直分库适合处理大型系统中的功能模块,每个模块有不同的数据访问模式和负载。例如,【用户模块、订单模块、产品模块等】

  • 垂直分表适合处理表中的大量列或行,其中某些列或行的访问频率较高,而其他部分的访问频率较低例如,【我们拆分的用户表】

2.3 Horizontal sub-database sub-table

Horizontal database sharding and table sharding are another common strategy for database splitting, which is used to deal with large-scale data and high concurrent loads.

2.3.1. Horizontal Sharding

  • Divide databases according to data rows: Divide data rows into different databases according to certain rules (such as hash functions or ranges). Each database is only responsible for storing and processing a subset of data rows. For example, user data could be scattered across different databases based on the hash of the user ID.
  • Horizontal sub-databases can realize parallel processing and scalability of data, and each database can process its own data independently, thereby improving the throughput and performance of the system.

2.3.2. Horizontal Partitioning

  • Divide tables according to data rows: Divide the data table into multiple tables according to certain rules (such as hash functions or ranges). Each table contains only some rows of data, for example, order data is divided into multiple tables according to the range of order IDs.
  • Horizontal table partitioning can reduce the data volume of a single table, improve query performance and parallelism of data access. At the same time, it also provides better data management and maintenance flexibility.

In the strategy of horizontal sharding and table sharding, the following aspects need to be considered:

  1. Data partitioning rules: Determine the rules for partitioning data, such as based on hash value, range, time, etc. The rule should take into account the balance of data, avoid data skew and hot spots.

  2. Data Migration and Consistency: After sub-databases and tables, data migration and consistency issues need to be considered. When migrating data, you can use batch import, ETL tools or distributed data synchronization technology to ensure data consistency.

  3. Cross-database and cross-table queries: When cross-database and cross-table queries are required, appropriate query routing and distributed query mechanisms need to be designed. This includes implementing query routing logic at the application layer, or using distributed query tools and middleware to handle cross-repository and cross-table queries.

  4. Transaction processing: In an environment of horizontal sharding and sharding, transaction processing may involve multiple databases or tables. Distributed transaction management technologies, such as two-phase commit (2PC), compensation transaction (Saga), etc., need to be considered to ensure transaction consistency and reliability.

2.3.3. Common horizontal table division strategies

  1. Range-based partitioning strategy: In this strategy, data is partitioned based on its value within a specific range. For example, split based on date, for example, data for 2019, data for 2020, and data for 2021 are each stored in different tables.

  2. List-based partitioning strategy: This is a strategy for discrete values ​​that puts rows of a particular value into the same table. For example, if customers are divided according to their regions, customers in the United States, customers in the United Kingdom, and customers in China are stored in separate tables.

  3. Hash-based table splitting strategy: Data is split according to the value of the hash function. This function takes as input row data (usually a primary key value) and returns a hash value that is used to determine which table to put the row into. The purpose is to randomize the data so that it is evenly distributed across multiple tables.

When implementing horizontal sharding in practice, the following factors should be considered:

  • In cases where the overall application of the query is unknown, hash partitioning may be the best choice, as it guarantees equal data across multiple tables.
  • If the query is mainly based on a specific column (such as "date" or "region"), then range partition or list partition may be a better choice.
  • Horizontal sharding can introduce added complexity, especially when handling transactions and maintaining referential integrity, so it needs to be done carefully.
  • By analyzing your workload and evaluating the impact of various strategies on query performance, you can decide which strategy is most appropriate.

Use with mature components

Common frameworks:

  1. MyCat: An open source sub-database and table middleware that supports custom sharding rules and can achieve read-write separation, complete transparency of transactions and SQL.
  2. Sharding-JDBC: A lightweight Java framework that provides powerful functions such as database sub-table, read-write separation, distributed transactions and distributed sequences.
  3. Shardingsphere: It includes three independent products, Sharding-JDBC, Sharding-Proxy and Sharding-Sidecar, which can meet the needs of data sharding in different scenarios.
    Complicated issues of sub-database and sub-table:

Problems faced after the stage of sub-database sub-table is completed

When the technology evolves to the point that we have solved the basic problems of sub-database and sub-table, solved the concurrency problem, and solved the performance problem, then we basically faced the following problems. Regarding these issues, we need to continue to study and research to try more solutions. We do not expand here.

1. Multiple live activities in different places

In terms of geographical distribution of databases, disaster recovery, etc., if there are multiple databases and multiple tables, the complexity of data synchronization and backup will increase.

2. Data Migration Issues

.After sub-database sub-table, how to migrate the old data and how to ensure the normal operation of the business during the migration process is a big problem.

3. Distributed transaction issues

In traditional monolithic databases, transactions are an important means to ensure data consistency. However, in the environment of sub-database and sub-table, the original transaction mechanism can no longer be used, and a new distributed transaction solution needs to be introduced.

4. The problem of join query

After sub-databases and sub-tables, the join query that can be done in a library or a table becomes difficult, and needs to be associated and combined through the application layer.

Example of policy implementation for sub-database and sub-table

  1. Function method (hashing, modulo, etc.): According to the function result of a certain field (such as user ID), the database and table are divided.
  2. Range method (enumeration, interval): For example, according to the date range, order ID range, etc., the database and table are divided.
  3. Consistent hashing: When adding or reducing database nodes, data migration can be minimized.

Let's write some pseudocode in Java to make it easy for everyone to understand.
Let's say we have multiple ways of determining which database the data should be stored in based on a certain value such as user ID or order ID. Let's use the function method, the range method and the consistent hash method to simulate it.

  1. Function method: We can divide the remainder of the user ID by the number of databases as the index of the database.
public String getDatabase(int userId) {
    
    
    int dbCount = 2; // 假设我们有2个数据库
    int dbIndex = userId % dbCount;
    return "db" + dbIndex;
}
  1. Range method: We can use the order ID as a range for division. For example, order IDs less than 10,000 are stored in one database, and orders greater than or equal to 10,000 are stored in another database.
public String getDatabase(int orderId) {
    
    
    if (orderId < 10000) {
    
    
        return "db0";
    } else {
    
    
        return "db1";
    }
}
  1. Consistent hashing: This method requires the use of special data structures, such as TreeMap. First, we add each database (called a node) into a TreeMap, then calculate the hash value of a key (such as a user ID) and find the database corresponding to that hash value in the TreeMap. If not found, returns the database with the closest hash.
public class ConsistentHashing {
    
    
    private TreeMap<Integer, String> nodes = new TreeMap<>();

    public void addNode(String nodeName) {
    
    
        int hash = nodeName.hashCode();
        nodes.put(hash, nodeName);
    }

    public void removeNode(String nodeName) {
    
    
        int hash = nodeName.hashCode();
        nodes.remove(hash);
    }

    public String getDatabase(String key) {
    
    
        if (nodes.isEmpty()) {
    
    
            return null;
        }

        int hash = key.hashCode();
        if (!nodes.containsKey(hash)) {
    
    
            SortedMap<Integer, String> tailMap = nodes.tailMap(hash);
            hash = tailMap.isEmpty() ? nodes.firstKey() : tailMap.firstKey();
        }
        return nodes.get(hash);
    }
}

In fact, most of us use sub-database and sub-table components or middleware in real scenarios. The most famous and commonly used sharding-jdbc is an excellent and lightweight sub-database and sub-table component to solve such scenarios. Although there are Lots of bugs or deficiencies, but enough to solve 90% of our problems. In a later chapter, I will write a detailed tutorial on the use of sharding-jdbc. In addition to it, some domestic middleware is also good. as follows

  1. TDDL: Alibaba's open-source sub-database and table-splitting middleware, which supports complex database-splitting and table-splitting strategies.
  2. Oceanus: Netease Cloud Database's sub-database and table-splitting solution, which supports functions such as automatic database and table sub-segmentation, read-write separation, and distributed transactions.

2. Reference documents

  1. "MySQL sub-database sub-table scheme" - InfoQ. Introduces MySQL's sub-database and sub-table scheme, and discusses in detail the steps to implement this scheme. Link: https://www.infoq.cn/article/solution-of-mysql-sub-database-and-sub-table

  2. "MySQL database sub-table practice" - an article in the Alibaba Cloud community can also be used as a reference. Introduces the practice and experience of MySQL sub-database and sub-table. Link: https://developer.aliyun.com/article/776687

  3. "Design of Sub-Database and Sub-table Architecture" introduces the design and implementation of sub-database and sub-table architecture in detail. Link: https://www.jianshu.com/p/d7f3d3808f25

  4. ShardingSphere is an open source project package of Apache that includes the sharding-jdbc I mentioned above, and provides a solution for sub-database and sub-table. Link: https://shardingsphere.apache.org/document/current/cn/overview/

Guess you like

Origin blog.csdn.net/wangshuai6707/article/details/132637837