Principles and Algorithms of Database Fragmentation

1. The concept of data sharding

        Database sharding refers to splitting a large database into multiple small databases, and each small database is called a shard. In this way, the load of the database can be distributed to multiple servers, thereby improving performance bottlenecks and availability.

        The core method of data sharding is to segregate relational databases and tables. Both sub-database and sub-table can effectively avoid the query bottleneck caused by the amount of data exceeding the tolerable threshold. In addition, sub-databases can also be used to effectively disperse single-point access to data nodes (think about it, those who check orders go to the order node, and those who check users go to the user node).

        Although sub-tables cannot alleviate the pressure on data nodes, they can provide as much processing as possible to convert distributed transactions into local transactions. Once cross-database update operations are involved, distributed transactions often complicate the problem (for example, users place orders At that time, deducting points and reducing inventory is enough for you to drink a pot).

        Using the master-slave sharding method can effectively avoid data single points, thereby improving the availability of the data architecture (update operations go to the master node, query operations go to the slave node).

        It is an effective means to deal with high concurrency and massive data systems by splitting data into sub-databases and sub-tables to keep the data volume of each table below the threshold, and to channel traffic to deal with high access volume.

When should sharding be used?

        Whether or not a sharded database architecture should be implemented is almost always a matter of debate. Some people think that sharding is an inevitable consequence of databases reaching a certain size. While others see it as a headache and should be avoided unless absolutely necessary, since sharding adds operational complexity.

  • (1) Handling large-scale data: The rapid growth of data volume is a common situation in modern applications. When the amount of data reaches the capacity limit of a single server, sharding can help applications process large-scale data and distribute the data on multiple nodes to make full use of cluster resources.
  • (2 ) Improve read and write performance: Fragmentation can distribute the load to multiple nodes, thereby improving the read and write performance of the database. Each shard independently processes a portion of the data, offloading individual nodes and allowing parallel processing of queries and transactions.

(3 ) Increase scalability: Fragmentation allows the capacity and throughput of the database to be expanded according to demand. When the amount of data increases, you can simply add more shard nodes instead of upgrading the hardware or software of individual nodes.

(4 ) Reduce single point of failure: By distributing data across multiple nodes, sharding can reduce the impact of a single node failure on the entire system. If a node fails, other nodes are still available, thus ensuring the availability and fault tolerance of the system.

( 5 ) Provide geographic location flexibility: Fragmentation enables data to be stored according to geographic location. This can help applications meet data storage compliance requirements and reduce data access latency.

2.1 Fragmentation principle

2.1.1 Vertical Sharding

1. Principle

        According to the way of business splitting, it is called vertical sharding, also known as vertical splitting, which means dedicated database (a bit like the meaning of each department of the company, each performing its own duties). Before splitting, a data node consists of multiple business tables, and each table stores different business data. After splitting, we classify the tables according to the business and distribute them to different data nodes, so as to distribute the pressure to different data nodes. For example, those related to users are placed in the user library, and those related to orders are placed in the order library.

        Vertical sharding requires adjustments to the architecture and design. Generally speaking, it is too late to cope with the rapid changes in Internet business needs (such as the sudden increase in orders on Double Eleven) ; moreover, it cannot really solve the single-point bottleneck. Vertical splitting can alleviate the problems caused by data volume and access volume, but it cannot cure them. If after vertical splitting, the amount of data in the table still exceeds the threshold that a single node can carry, horizontal sharding is required for further processing.

2. Features

(1) The structure of each library (table) is different;

(2) At least one column of data in each library (table) is the same (at least one field is the same after sub-database or sub-table to associate multiple libraries or tables);

(3) The union of each library (table) is the full amount of data;

3. Advantages

(1) After the split, the business is clear (for transfer to the warehouse, it is split according to the business);

(2) Realize dynamic and static separation, cold and hot data separation design embodiment. Cold storage refers to data that is accessed less; hot storage refers to data that is accessed more.

(3) The data maintenance is simple, and it is placed on different machines according to different businesses.

4. Disadvantages

(1) Unable to cope with rapid changes in business requirements;

(2) Some businesses cannot be associated, and there is a cross-database query problem.

2.1.2 Horizontal Sharding

1. Principle

        Horizontal sharding is also called horizontal splitting (just like a company will set up different branches in different cities to handle daily work). Compared with vertical sharding, it no longer classifies data according to business logic, but distributes data to multiple nodes or tables according to certain rules through a certain field (or several fields), and each shard only Contains a portion of the data. For example: according to the primary key sharding, records with even primary keys are put into 0 library (or table), and records with odd primary keys are put into 1 library (or table).

        Horizontal sharding theoretically breaks through the bottleneck of single-machine data volume processing, and is relatively free to expand. It is a standard solution for sub-database and sub-table.

2. Features

(1) The principle of each library (table) is the same;

(2) The data of each library (table) is different;

(3) The union of each library (table) is the full amount of data;

3. Advantages

(1) The data of a single database (table) is kept at a certain amount, which is helpful for performance improvement;

(2) Improve the stability and load capacity of the system;

(3) The structure of the split library (table) is the same, and the program modification is less;

4. Disadvantages

(1) The expansion of data is very difficult and the amount of maintenance is large, and the splitting rules are difficult to abstract;

2.1.3 Problems caused by data fragmentation

        In the face of scattered data after sharding and sharding, one of the important challenges is that development engineers and database administrators make complex operations on the database. They need to know which table in which specific database the data needs to be retrieved from. If you think about it, we all work in the same place. It is easy to call Zhang San and Li Si if you want to, but now everyone is assigned to the north and south, so it is inconvenient to communicate.

(1) Data aggregation

        SQL that can run correctly in a single-node database may not necessarily run correctly in a fragmented database. For example, splitting the table leads to modification of the table name, or incorrect handling of operations such as paging, sorting, aggregation and grouping. For example, if we want to query the first 10 pieces of data in the order table, we only need one SQL statement before, but now we get different data. Node, if you think about paging queries at this time, you will crash.

(2) Distributed transactions

        Using the table-splitting strategy, try to use local transactions while reducing the amount of data in a table, because all of them are under one data node to avoid the problem of cross-database transactions.

        In scenarios where cross-database transactions cannot be avoided, some businesses still need to ensure transaction consistency. The performance of XA-based distributed transactions in high-concurrency scenarios cannot meet the needs. Most of them use flexible transactions with eventual consistency instead of strongly consistent transactions.

 2.2 Fragmentation middleware

2.2.1 jdbc application layer fragmentation

Compared with proxy, its performance is stronger, but it limits the language to java , which leads to great limitations.

1、sharding-jdbc(shardingsphere)

2.2.2 Proxy proxy layer fragmentation

No language restrictions!

1、mycat

2、mysql-proxy

2.3 Fragmentation algorithm

2.3.1 range

Idea: Based on the business primary key uid of the user center, divide the data horizontally into two database instances:

        db_1: store uid data from 0 to 10 million;

        db_2: store 10 million to 20 million uid data;

(1) Advantages of the range algorithm

  • The segmentation strategy is simple. According to the id and the scope, the business can quickly locate which database the data is on;
  • The expansion is simple, if the capacity is not enough, just add db_3;

(2) Insufficiency of the range algorithm

  • id must satisfy the feature of increment;
  • The amount of data is uneven, and the newly added db_3 will have less data in the initial stage;
  • The amount of requests is uneven. Generally speaking, the activity of newly registered users will be relatively high, so db_2 will often have a higher load than db_1, resulting in unbalanced server utilization;

2.3.2 hash

Idea: Based on the business primary key uid of the user center, divide the data horizontally into two database instances:

        db_1: store the id data whose id is modulo 1;

        db_0: store the id data whose id is modulo 0;

(1) Advantages of hash algorithm

  • The segmentation strategy is simple. According to uid and hash, the business can quickly locate which database the data is on;
  • The amount of data is balanced, as long as the uid is uniform, the distribution of data on each database must be balanced;
  • The request volume is balanced, as long as the uid is uniform, the load distribution on each library must be balanced;

(2) Insufficiency of hash algorithm

  • Expansion is troublesome. If the capacity is not enough, adding a library and re-hash may lead to data migration. How to smoothly perform data migration is a problem that needs to be solved;

2.3.3 Index table

Idea: the id can be directly located to the database, but the data cannot be directly located to the database. If the db can be queried through the id, the problem will be solved;

Solution :

(1) Create an index table to record the mapping relationship of id -> db;

(2) When using id to access, first query the id through the index table, and then locate the corresponding library;

(3) The index table has fewer attributes and can accommodate a lot of data, and generally does not need to be divided into databases;

(4) If the amount of data is too large, you can divide the database by id;

Advantages : node expansion has no effect;

Potential disadvantage : One more database query, double the performance;

2.4 Realize Fragmentation

        Enabling sharding involves many aspects, including database schema design, deployment configuration, and application changes. The following are the steps to enable sharding in general:

(1) Design a sharding strategy: First, you need to determine a sharding strategy suitable for the application, such as range-based, hash, or list-based methods. Choose an appropriate sharding strategy based on application requirements and data characteristics, and consider the choice of sharding keys.

(2) Database architecture design: Design the overall architecture of the database according to the sharding strategy. Determine the number of shards and node size, as well as the data association method and data routing rules between shards.

(3) Physical server deployment: Design, deploy and configure physical servers according to the database architecture. Each shard should be assigned to an independent physical node or server to ensure that each node has sufficient computing and storage resources.

(4) Database sharding initialization: Create a database instance on each sharding node and initialize it according to the sharding strategy. Create corresponding table structures, indexes, constraints, etc. to ensure that the database structure of each shard node is consistent.

(5) Data migration: Migrate existing data to shard clusters. Data is split and imported into each shard according to the sharding strategy to ensure data consistency and integrity. This may involve the process of data export, transformation and import.

(6) Application changes: Modify the application code so that it can adapt to the sharding architecture. Update the database connection configuration to ensure that the application can properly route and access the various shards. In addition, query statements, transaction processing, and data access logic need to be modified to adapt to the sharding environment.

(7) Load balancing and routing configuration: Configure the load balancing and routing mechanism to ensure that requests are evenly distributed in the fragmented cluster. This can be achieved through a load balancer or proxy, routing requests to the appropriate shard nodes.

(8) Testing and monitoring: Conduct a comprehensive test on the sharding environment to ensure the correctness and performance of the sharding strategy. Set up a monitoring system to monitor the running status and performance indicators of each shard node in real time.

It should be noted that enabling sharding is a complex process that requires comprehensive consideration of application requirements, data characteristics, and system architecture. Before sharding, it is recommended to conduct sufficient planning and evaluation to ensure the correct implementation and operation and maintenance of sharding. At the same time, the complexity of data migration and the challenges of system upgrades also need to be considered.

Guess you like

Origin blog.csdn.net/weixin_47156401/article/details/132368472