Sub-library and sub-table design

1. Data segmentation mode

  Vertical (vertical) segmentation: Split a single table into multiple tables and distribute them to different databases (hosts).

  Horizontal (horizontal) segmentation: According to the logical relationship of the data in the table, the data in the same table is split into multiple databases (hosts) according to certain conditions.

  ①, vertical division

        A database is composed of multiple tables, and each table corresponds to a different business. Vertical segmentation refers to classifying the tables according to the business and distributing them to different databases, so that the data is shared among different databases (special databases). dedicated).

The advantages are as follows:

1) After the split, the business is clear and the split rules are clear.

2) It is easy to integrate or expand between systems.

3) Put the meter on different machines according to cost, application level, application type, etc. for easy management.

4), the design mode of the database table that facilitates the realization of dynamic and static separation, cold and hot separation.

5) Simple data maintenance.

The disadvantages are as follows:

1) Some business tables cannot be associated (Join) and can only be resolved through interfaces, which increases the complexity of the system.

2) Due to the different limitations of each business, there is a single library performance bottleneck, and it is not easy to expand data and improve performance.

3) Complex transaction processing.

②, horizontal segmentation

     Compared with vertical segmentation, horizontal segmentation does not classify tables, but distributes them to multiple databases according to a certain rule of a field. Each table contains a part of data, and all tables add up to the total amount. data.

In simple terms, we can understand the horizontal segmentation of data as segmentation according to data rows, that is, segmentation of certain rows in a table into one database table, and segmentation of other rows into other database tables.

This method of segmentation is based on the size of the data volume of the single table to ensure that the capacity of the single table will not be too large, thereby ensuring the processing capabilities of the single table query, for example, splitting the user information table into User1 and User2 Wait, the table structure is exactly the same. We usually divide the table according to some specific rules, such as modular division according to the user ID.

The advantages are as follows:

1) The data of single database and single table is kept at a certain level, which helps to improve performance.

2) The structure of the split table is the same, and the application layer is less modified, and only the routing rules need to be added.

3) Improve the stability and load capacity of the system.

The disadvantages are as follows:

1) After segmentation, the data is scattered, it is difficult to use the join operation of the database, and the performance of cross-database Join is poor.

2) The split rules are difficult to abstract.

3) The consistency of fragmented transactions is difficult to solve.

4) The difficulty of data expansion and the amount of maintenance are extremely large.

③, the sharding dimension of horizontal segmentation

There are different slicing dimensions for data slicing. You can refer to the slicing method provided by Mycat (see section 3.4 of this book). Only the two most commonly used slicing dimensions are introduced here.

1) According to the hash slice

   Hash a certain field of the data, divide it by the total number of shards, and take the modulus. After taking the modulus, the same data becomes one shard. This method of dividing the data into multiple shards is called hash sharding.

2) Slice according to time

   Different from slicing by hash, this method distributes data to different shards according to the range of time. For example, we can slice transaction data by month or quarterly, depending on the amount of transaction data. According to what time period the data is sliced.

④. The common points of vertical and horizontal segmentation:

       There is a problem with distributed transactions.

       There is a problem of joining across nodes.

       There is a problem of cross-node merge sorting and paging.

       There is a problem of multi-data source management.

2. Distributed dilemma brought by sub-database and sub-table and countermeasures

Data migration and expansion issues

The level scoring strategy introduced earlier can be summarized as random scoring and continuous scoring . Continuous sub-tables may have the problem of data hotspots . Some tables may be frequently queried and cause a lot of pressure. The hot data tables become the bottleneck of the entire library, and some tables may store historical data, which is rarely needed. Was queried. Another advantage of continuous sub-tables is that it is easier. You do not need to consider migrating old data. You only need to add sub-tables to automatically expand. The data of the random sub-table is relatively uniform, and it is not prone to hot spots and concurrent access bottlenecks. However, sub-table expansion requires the migration of old data .
The design of the horizontal sub-table is very important. It is necessary to evaluate the growth rate of the business in the short and medium-term, plan the current data volume, integrate cost factors, and calculate how many shards are needed. For the problem of data migration, the general approach is
to first read the data through the program , and then write the data into each sub-table according to the specified table-splitting strategy.

Table association problem

In the case of a single database and single table, the joint query is very easy. However, with the evolution of sub-database and sub-table, joint query encounters cross-database association and cross-table relationship problems. At the beginning of the design, joint query should be avoided as much as possible. It can be assembled in the program or circumvented by de-normalized design .

Paging and sorting issues

In general, when the list is paged, it needs to be sorted according to the specified field. In the case of a single database and single table, paging and sorting are also very easy. However, with the evolution of sub-databases and sub-tables, cross-database sorting and cross-table sorting problems will also be encountered. For accuracy of the final result, the data will need to be in a different sub-table to sort and return, and the table returns the result set in different points of the aggregated and sorted again , and finally returned to the user.

Method 1: Global View Method

1)将order by time offset X limit Y,改写成order by time offset 0 limit X+Y

( 2 ) The service layer performs memory sorting on the N*(X+Y) pieces of data obtained, and then takes the Y records after the offset X after the memory sorting

With the progress of page turning, the performance of this method is getting lower and lower .

Method 2: Business Compromise Method - No Jumping Page Queries

( 1 ) Use the normal method to obtain the data of the first page, and get the time_max recorded on the first page

( 2 ) Every time you turn the page, rewrite order by time offset X limit Y as order by time where time>$time_max limit Y

To ensure that only one page of data is returned at a time, the performance is constant .

Method 3: Second query method

1)将order by time offset X limit Y,改写成order by time offset X/N limit Y

( 2 ) Find the minimum value time_min

3between二次查询,order by time between $time_min and $time_i_max

( 4 ) Set virtual time_min , find the offset of time_min in each sub-library , and get the global offset of time_min

( 5 ) Get the global offset of time_min , and naturally get the global offset X limit Y

Distributed transaction problem

With the evolution of sub-databases and sub-tables, distributed transaction problems will definitely be encountered, so how to ensure data consistency has become a problem that must be faced. At present, distributed transactions do not have a good solution, and it is difficult to meet strong data consistency. Generally, the stored data should be as consistent as possible with users to ensure that the system will recover and modify itself in a short period of time, and the data will eventually reach Unanimous .

Distributed globally unique ID

In the case of a single database and single table, it is really simple to use the database auto-increment feature to generate the primary key ID. In the environment of sub-database and sub-table, data is distributed on different sub-tables, and the self-growth feature of the database can no longer be used. Need to use a globally unique ID, such as UUID, GUID, etc. On how to choose a suitable globally unique ID, I will introduce it in a later chapter.

Industry solutions:

UUID: A long number with a unique identification code of 16 bytes and 128 bits.

Component: current date and time series + global unique network card mac address

Advantages: simple code implementation, no bandwidth usage, no impact on data migration

Disadvantages: disorder, cannot guarantee the increasing trend (requirement 3) character storage, transmission, query is slow, unreadable

Snowflake algorithm

 Distributed Twitter ID Generation Algorithm in Foreign Countries

1bit+41bit+10bit+10+bit=62bit

High random + milliseconds + machine code (data center + machine id) + 10 good flow

domestic:

Just ensure the uniqueness of the data IDC computer room

Advantages: simple code implementation, no bandwidth occupation, unaffected data migration, and increasing low-level trends

Disadvantages: strong clock (multiple servers must have the same time), disorder cannot guarantee trend increase Redis:
reduced version, relevant business code is not included, redis program
advantages: independent of data, flexible and convenient, performance is better than database , No single point of failure (high availability)
Disadvantages: need to occupy network resources, performance is slower than local generation, need to add plug-ins

Three, summary

Sub-database and sub-table are mainly used to deal with two common scenarios on the Internet: massive data and high concurrency. However, sub-database and sub-table is a double-edged sword. Although it can well deal with the impact and pressure of massive data and high concurrency on the database, it increases the complexity and maintenance cost of the system.

Therefore, my suggestion: It needs to be combined with actual needs and not over-designed. At the beginning of the project, the sub-library and sub-table design should not be adopted, but as the business grows, if the optimization cannot be continued, consider the sub-library and sub-division. Table improves the performance of the system.

Guess you like

Origin blog.csdn.net/baidu_28068985/article/details/102895444