Common methods for splitting distributed databases

Abstract: This article introduces two ideas for database segmentation. The popular understanding is: "vertical split" is equivalent to "column" changing to "row" unchanged, and "horizontal split" is equivalent to "row" changing to "column" unchanged.

The most important thing for "scalability" in a distributed system is to be "stateless" first, so that you can "scale" horizontally as you want, without worrying about confusion when switching between multiple copies. " Distributed System Concerns-Detailed Explanation of " Stateless" " talks about this.

However, even if the horizontal expansion is done, it is still a "big program" in essence, but it becomes "copyable".

If you want to eliminate "large programs", you have to "segment", and the core idea of ​​"high cohesion and low coupling" is indispensable to do a good job of segmentation. This is the topic of " Distributed System Concerns-Detailed Explanation of " High Cohesion and Low Coupling" ".

Digression: When you encounter a single order application that cannot be used, the universal advice given by Brother Z is: first consider "expansion" and then "cut". This is the same as writing code, it is often easier to "add" new functions than to change old functions.

For "expansion", first consider "vertical expansion" (adding hardware, money can solve it is not a problem), and then consider "horizontal expansion" (stateless transformation + multi-node deployment, this is a minor operation).

"Slicing" is generally "vertical cutting" (according to business segmentation, this is a major operation), and occasionally "horizontal cutting" (in fact, it is the layering in a single application, such as front-end separation).

In the third part of " Distributed System Focus-Resilient Architecture ", we talked about two common "loosely coupled" architecture modes, in order to take the "scalability" of applications to a higher level.

These are all tasks at the application level. Under normal circumstances, performing surgery at the application level, coupled with the full use of cache, can support the development of the system for a long time. In particular, the amount of data is not large, but the "CPU-intensive" scenario with a large amount of requests.

However, if the work scenario you are in is a very mature project with a certain scale, the more you develop it, the bottleneck always appears in the database. There will even be phenomena such as long-term high load and downtime of the cpu.

In such a scenario, the database has to be operated on. This time Brother Z will come to talk with you about the good ways to make the database "scalability".

Core appeal

When faced with a database that requires surgery, the entire system has often grown into this way.

As mentioned earlier, the bottleneck at this time is often reflected in the "CPU".

Because for the database, the expansion of hard disk and memory is relatively easy, because they can be directly used "increase".

The CPU is different. Once the CPU spikes high, at most check whether the index is done well. After that, you can basically just watch.

So the idea to solve this problem naturally becomes: how to distribute the CPU pressure of a database to multiple CPUs. It can even be increased at any time on demand.

Then, isn't this just doing "segmentation" like an application. It is also the embodiment of the "divide and conquer" idea of ​​distributed systems.

Since it is segmentation, it is essentially the same as the application, and it is also divided into "vertical segmentation" and "horizontal segmentation".

Vertical split

Vertical segmentation is sometimes called "longitudinal segmentation".

Like the application, it is a segmentation method with the "business" as the dimension, running different business databases on different database servers, each performing its own duties.

In general, Brother Z suggests that you give priority to "vertical segmentation" rather than "horizontal segmentation" . Why? You can open the SQL statement in the project at hand at will. I think there must be a large number of "join" and "transaction" keywords. This kind of related query and transaction operation is essentially a kind of "relational binding". After facing the database split, you can't play .

At this point you have only 2 options.

  1. Either discard unnecessary "relationship ***" logic, which requires business adjustments to remove unnecessary "batch operation" services, or remove unnecessary strong consistency transactions. But you also know that there must be some scenes that are endless.
  2. Either "merge" and "associate" and other logics are floated up and reflected in the code of the business logic layer or even the application layer.

In the end, no matter how you choose, the change is a big project.

In order to make this project as small as possible and pursue better cost performance, it is necessary to adhere to a principle-" Avoid splitting closely related tables ."

Because the closer the relationship between the two tables, the more demand for "join" and "transaction", so sticking to this principle can make the same modules and closely related businesses all fall into the same library, so they can Continue to use "join" and "transaction" to work.

Therefore, we should give priority to the "vertical segmentation" approach.

The idea of ​​"vertical segmentation" is very simple. Under normal circumstances, it is recommended to correspond to the segmented application one-to-one, without more or less .

In actual work, "vertical segmentation" is mainly reflected in the familiarity of "business", so I won't continue here.

The advantages of "vertical segmentation" are:

1. High cohesion, clear split rules. Compared with "horizontal segmentation", the data redundancy is lower.

2. It has a 1:1 relationship with the application, which is convenient to maintain and locate problems. Once abnormal data is found in a certain database, it is enough to check the associated programs of this database.

But this is not a "one-and-for-all" solution, because no one can predict how the business will develop in the future, so the most obvious disadvantage is that there is still a performance bottleneck for tables that are accessed extremely frequently or have a large amount of data .

If you really need to solve this problem, you need to move out of "horizontal segmentation."

Digression: Try to avoid "horizontal segmentation" without compelling oneself. You will know the reason after reading the next content.

Next, Brother Z will give you a good talk about "horizontal segmentation", which is the focus of this article.

Horizontal split

Imagine that after you do "vertical segmentation", you still find a table with more than 1 billion data in a database.

At this time, you have to "horizontally split" the table. How would you think about this?

The train of thought that Brother Z taught you is:

  1. First find the "read" field of the "highest frequency" .
  2. Look at the characteristics of the actual use of this field (multiple batch queries or multiple single queries, whether it is also related fields of other tables, etc.).
  3. Then choose a suitable segmentation scheme based on this characteristic.

Why find the high frequency "read" field first?

Because in actual use, "read" operations are often much larger than "write" operations. Generally, you have to do pre-verification by "reading" before "writing", but "reading" has its own use scenario. Therefore, considering the higher frequency "reading" scenario, the value generated will inevitably be greater .

For example, the one billion data table is an order table, and the structure is as follows:

order (orderId long, createTime datetime, userId long)

Let's first take a look at the several "horizontal segmentation" methods, and then we can understand what kind of scene is suitable for which method.

Range segmentation

This is a "continuous" segmentation method.

For example, according to the time (createTime) segmentation, we can divide it by year and month, order_201901 one library, order_201902 one library, and so on.

According to the order number (orderId), there can be one library from 100,000 to 199999, one library from 200,000 to 299999, and so on.

The advantage of this segmentation method is that the size of a single table is controllable, and data migration is not required when expanding .

The disadvantages are also obvious. Generally speaking, the closer the time or the larger the serial number, the "new" the data, so the frequency and probability of being accessed are more than the "old" data. It will cause the pressure to be mainly concentrated in the new library, and the longer the history, the more idle the library .

Hash segmentation

Contrary to "range segmentation", this is a "discrete" segmentation method.

Its advantage is that it solves the shortcomings of "range segmentation". The new data is distributed to each node, avoiding the concentration of pressure on a few nodes .

Similarly, the disadvantages are contrary to the advantages of "scope segmentation". Once the secondary expansion is carried out, data migration will inevitably be involved . Because the Hash algorithm is fixed, as the algorithm changes, the data distribution changes.

In most cases, our hash algorithm can be performed by a simple "modulo" operation. It looks like this:

If divided into 11 libraries, the formula is orderId% 10.

100000% 10 = 0, allocated to db0.

100001% 10 = 1, assigned to db1.

....

100010% 10 = 0, allocated to db0.

100011% 10 = 1, allocated to db1.

In fact, in some scenarios, we can use custom ID generation (refer to the previous article, " A Essential Medicine in Distributed System-Global Unique Document Number Generation ") to achieve both through hash segmentation To generate heat dissipation point data can reduce the dependence on global tables to locate specific data.

For example, add the mantissa of userId to orderId to achieve the effect of equal modulo result of orderId and userId. Let me give you an example:

The userId of a user is 200004. If a 4bit mantissa is taken, it is 4 here, represented by 0100.

Then, we generate the first 60 digits of orderId through a custom id algorithm, and add 0100 at the back.

Therefore, the results of orderId% 10 and userId% 10 are the same.

Of course, it is not easy to add other factors besides userId. That is, one additional dimension can be supported without increasing the global table.

The global table is mentioned twice, so what is a global table?

Global table

This method is to save the partition Key used as the basis for segmentation and the id of each specific data corresponding to a separate library or table. For example, to add a table like this:

1     nodeId    orderId
2      01          100001
3     02          100002
4     01          100003
5     01          100004...
6    ...

In this way, it is true that most of the specific data is distributed on different servers, but this global table will give people a feeling of "the shape is not scattered."

Because it is not possible to directly locate which server the required data is on when requesting data, each operation must first query this global table to know where the specific data is stored.

The side effect of this "centralized" model is that bottlenecks and risks are transferred to this global table. However, the logic is simple.

Okay, so how to choose these segmentation schemes?

The advice from Brother Z is that if the hot data is not a particularly concentrated scene, it is recommended to use "range segmentation" first, otherwise choose the other two .

When choosing the other two, the larger the amount of data, the more inclined to choose Hash segmentation . Because the latter is better than the former in terms of overall usability and performance, the implementation cost is higher.

"Horizontal segmentation" can truly be "infinitely expanded", but there are corresponding drawbacks.

1) Batch query, paging, etc. need to do more extra work. Especially when there are multiple high-frequency fields in a table for where, order by or group by.

2) The splitting rules are not as clear as "vertical splitting".

So let's say one more "nonsense": there is no perfect plan but a suitable plan, which should be selected in combination with specific scenarios . (You are welcome to raise your doubts in the message area, and discuss with Brother Z)

How to implement

When you specifically implement "horizontal segmentation", you can move on two levels, either the "surface" level or the "library" level.

table

The tables are divided into the same database, the table names are order_0, order_1, order_2.....

It can solve the problem of excessive single table data, but it cannot solve the problem of CPU load. Therefore, when the CPU does not have much pressure, but because the table is too large, the execution of SQL operations is slow, you can choose this method.

Library

At this time, the table name can be unchanged, all called order, but divided into 10 libraries. Then it is db0-user db1-user db2-user.......

We are talking about this model in the previous section, so I won't say more.

Table + Library

It is also possible to divide the database and the table, for example, first divide 10 libraries, and then divide each library 10 tables.

This is actually the idea of ​​secondary indexing. The first positioning is performed through the library to reduce certain resource consumption.

For example, first divide the database by year, and then divide the table by month. In this way, if the data that needs to be obtained is only across months but not across years, we can do aggregation operations in a single library to complete it, without involving cross-database operations.

However, no matter which way you choose to proceed, you will still face the following two problems more or less, and you will not escape.

  1. Cross-database join.
  2. Global aggregation or sorting operation.

The best way to solve the first problem is to change your programming thinking . Try to reflect some logic, relationships, constraints, etc. in the application code, and avoid doing these things in SQL for convenience.

After all, code can be written as "stateless" and can be extended at any time, but SQL follows data, and data is "state", which is naturally not conducive to expansion.

Of course, the second best thing is, you can also deal with redundant global tables. Just in this way, it is a big test for the "data consistency" work. In addition, it also costs a lot of storage resources.

The solution to the second problem is to turn the original aggregation or sorting into two operations . The traversal of multiple nodes can be done in a "parallel" manner.

So how does the program use it after data segmentation? This can be divided into two modes, "in process" and "out of process".

"In-process" can be done in the packaged DAL access framework, in the ORM framework, or in the database driver. Well-known solutions of this model such as Ali's tddl.

"Out-of-process" is the proxy model. The more well-known solutions for this model are mycat, cobar, atlas, etc., and relatively more, because this model is "low intrusion" to applications, and it looks like "a database" ". However, due to the additional network communication, there will be more loss in performance.

Old rules, let me share some best practices.

Best Practices

First, share two tips for data segmentation without stopping the machine. Let's take a look at an example of implementing the hash method for horizontal segmentation.

When doing segmentation for the first time, you can use the newly added node as a copy of the original node in the form of "master-slave" for full real-time synchronization.

Then delete the data that does not belong to it on this basis. (Of course, there is nothing wrong with not deleting it, it just takes up more space)

This way, there is no downtime.

Second, as time goes by, if the follow-up cannot be sustained and a second segmentation is needed, we can choose to use a multiple of 2 to expand.

In this way, the migration of data becomes very simple, only a partial migration is required, and the idea is the same as the first time segmentation.

Of course, if the selected segmentation method is "range segmentation", there will be no trouble in the second segmentation, and the data will naturally run to the latest node. For example, if we divide the table by year and month. The data for March 2019 naturally falls into the table of xxxx_201903.

At this point, Brother Z still wants to particularly emphasize that if you can do not split and try not to split, you can first use a solution such as "read-write separation" to deal with the problems you are facing first .

If you really want to perform segmentation, you must first "vertical segmentation", and then consider "horizontal segmentation" .

Generally speaking, considering in this order, the cost performance is better.

to sum up

Well, let’s summarize.

This time, I will first introduce you two ideas for database segmentation. The two ways of thinking are generally understood: "vertical splitting" equals "column" changing to "row" unchanged, and "horizontal splitting" equals "row" changing to "column" unchanged.

Then I focused on the three implementation methods of "horizontal segmentation" and the specific implementation ideas.

Finally, I shared some practical experience to you.

Hope to inspire you.

related articles:

Extra! ! ! HUAWEI CLOUD’s official developer promotion and recruitment plan is underway, click to learn more

 

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/108711190