advantage:
The sub-library reduces the load of the single-point machine;
Sub-tables improve the efficiency of data operations, especially the efficiency of write operations
1. Vertical segmentation
Applicable scenarios: There are many tables and many data.
Features: Simple rules, clear business logic, and very low business coupling. Tables used according to the same business are placed in the same database. In the vertically segmented table aggregation, find the "root element", and perform horizontal segmentation according to the "root element", that is, starting from the "root element", put all the data directly and indirectly related to it into a shard In (fragment), for example, for social networking sites, almost all data will eventually be associated with a user, and segmentation based on users is the best choice. Another example is the forum system. The user and forum modules should be divided into two shards when vertically split. For the forum module, Forum (forum) is obviously the aggregate root. It is natural that all posts and replies in the Forum are placed in a shard with the Forum.
2. Horizontal segmentation
Applicable scenarios: less table and more data.
Features: The splitting rules are complex, and the later maintenance is complex. For a large number of tables, split and concatenate, and different data in the same table are split into different databases.
The advantages of segmentation: the index overhead is reduced, and the table lock time of a single table write operation is reduced.
For example, there are 5000w pieces of data in the article table. At this time, we need to add (insert) a new piece of data to this table. After the insert is completed, other databases will re-index this table, and 5000w rows of data will be indexed. System development The cost cannot be ignored. But conversely, if we divide this table into 100 tables, from article_001 to article_100, 5000w rows of data are averaged, each sub-table contains only 500,000 rows of data, at this time we send a table with only 50w rows of data After inserting the data, the indexing time will be reduced by an order of magnitude, which greatly improves the runtime efficiency of the db and improves the concurrency of the db.
Slicing rules:
a. Divide by number
eg: id is the distinction, the corresponding db1 of 1~1000, the corresponding db2 of 1001~2000, the corresponding db3 of 2001~2100, and so on
id is the distinction, the corresponding db1 of 1 to 1000, the corresponding db2 of 1001 to 2000, and so on
Advantages: Partial migration possible
Disadvantage: uneven distribution of data
b.hash modulo
Hash the id (or use the value of the id directly if the id is numeric), and then use a specific number. For example, if you need to divide an other database into 4 other databases in application development, we will use 4. The number performs modulo operation on the hash value of the id, that is, id% 4. In this case, there are four possibilities for each operation: when the result is 1, it corresponds to db1; when the result is 2, it corresponds to db2; when the result is 3, it corresponds to db3; when the result is 0, it corresponds to db4, so that the data is distributed into 4 dbs very evenly.
Pros: Evenly distributed data
Disadvantages: It is troublesome to migrate data, and data cannot be allocated according to machine performance
c. Save other database configurations in the authentication library
It is to create a db, which saves the mapping relationship between user_id and db. Every time you access other databases, you must first query this other database to get specific db information, and then you can perform the query operations we need.
Pros: Flexibility, one-to-one relationship
Disadvantages: One more query is required before each query, and the performance is greatly reduced
Usually, the system is used in combination with horizontal and vertical segmentation. The system performs vertical segmentation, and individual large tables are horizontally segmented, that is, vertical segmentation first and then horizontal segmentation.
3. Common problems of segmentation and coping strategies
a. Transaction issues:
There are currently two feasible solutions to solve the transaction problem: distributed transaction and implementation of transaction through joint control of application program and database. Let's make a simple comparison between the two solutions.
Option 1: Use Distributed Transactions
Advantages: managed by database, simple and effective
Disadvantage: High performance cost, especially as more and more shards grow
Option 2: Controlled by the application and the database
Principle: Split a distributed transaction across multiple databases into multiple
Small transactions on a single database and overall control through the application
Small things.
Advantages: performance advantage
Disadvantage: Requires the application to do flexible design on transaction control. If using
With spring's transaction management, changes will face certain difficulties.
b. The problem of cross-node Join
As long as it is time-line segmentation, cross-node Join queries are inevitable. But good design and segmentation can reduce this occurrence. The common practice to solve this problem is to query the implementation in two times. Find the id of the associated data in the result set of the first query, and initiate a second request to obtain the associated data according to these ids.
c. Cross-node count, order by, group by and aggregation function problems
These are a class of problems because they all require computation based on the entire set of data. Most proxies do not automatically handle merging. Solution: Similar to solving the cross-node join problem, the results are obtained on each node and merged on the application side. Unlike join, the query of each node can be executed in parallel, so it is often much faster than a single large table. But if the result set is large, the consumption of application memory is a problem.