Random talk about the sub-database and table of distributed routines

Since it is "Ranking about the sub-database and sub-table", then we need to determine what we want to talk about and what not to talk about.

  1. First of all, we do not discuss the implementation and source code of the specific sub-database sub-table framework, which is not the scope of our discussion.
  2. We are discussing ideas, mainly discussing the routines of how to divide the database and the table, what are the pitfalls, and what experience do you have. We will not discuss the specific details. Of course, my own abilities are limited, but I just hope to be able to attract others.
  3. We must make it clear that sub-database and sub-table is not a silver bullet , it is just a way for us to save costs when the MySQL stand-alone performance is not enough. For the boss, he wants to save costs, but also to support the business and provide stable and durable performance.

Programmers exert their ingenuity, rack their brains, day after day hard work and practice, and finally produce two main ways:

  1. The agent embedded mode uses a jar package and integrates it into our code. The code uses routing rules and sharding keys to divide the database and tables, which is an embedded mode.
  2. The cs mode (client-server mode) provides a three-party component, such as mycat, the proxy mode in sharding-sphere, similar to mycat; there is centralization, and the high availability of the three-party components needs to be guaranteed.

If there is a better technical selection, we would rather not sub-library and sub-table, because it is a complicated solution in itself. It's just a compromise. NewSQL and commercial databases are more suitable (for example, Oracle, in most scenarios, the performance is sufficient, but the cost is high).

If one day, there is an excellent and economical newSQL, such as oceanbase and tidb, then we can basically say goodbye to sub-database and sub-table.

The reason why we choose to use the database and table strategy is fundamentally because, on the one hand, our use cost cannot be too high; on the one hand, the performance of the stand-alone DB database is not enough; on the other hand, newSQL is currently immature and too expensive. Dare not use it.

There are a lot of manufacturers for sub-library and sub-table. There are rich open source frameworks, communities, and mature cases, so we adopt,

The direct reason is that Ali is on the platform. Our domestic ethos is that I use what Ali uses and how Ali does it. I follow the trend so seriously. My idea is that we still have some forward-looking views of our own technology, and it is best not to rely on Ali, only technology.

Having said so much, we return to the topic and start to look at the problem.

1. Is it okay to only make sub-tables? Still have to divide the table and divide the library, if it is divided, then the library is on multiple servers? How to consider this

I want to say, or to look at the scale of business , only to see the current scale of operations, but also look to the future 3 - 5 years of business trends.

When it comes to technology selection, our aim is to always choose the most suitable for the current business, the lowest cost, the highest profit, and the appropriate is the best.

The best solution we choose is the one that the technical team can just hold. If the selection is no longer suitable for the current business development, you can change to a more suitable one. This is the inevitable law of the development of things.

Or, the business has not developed to a higher level, it is already GG, then it is just right, don't waste money to buy better facilities, just stop the loss;

Or, the current plan is really not enough, we changed a set of more powerful, although it will cost more money, please more people, but this is not in our minds, our purpose In itself, it supports business development through appropriate technical architecture and better code.

The one-sentence summary is that since you have to spend money, then spend it.

Discussion by scene

A picture is worth a thousand words. Let's look at these two scenarios separately.

For  offline data analysis scenarios

Only the sub-table is enough, because you are mainly used to analyze the data, and the data after the analysis data can be deleted. Asynchronous tasks delete several months/days of data.

Random talk about the sub-database and table of distributed routines

 

For  real-time business systems

If it is a distributed business system 2C, which needs to carry a huge amount of traffic, it is recommended to   consider both the database and the table .

Random talk about the sub-database and table of distributed routines

 

Prerequisites for sub-library, estimated business volume

The premise of sub-database and sub-table is to estimate the business volume. We provide an empirical value, which does not represent the most appropriate one, but is just a qualitative analysis:

QPS 500-1000以下,   那么采用主从读写分离,基本上足够支撑业务了;

QPS 1000-10000,考虑分布分表是一个比较合适的事情 

12000TPS 30000QPS 32库 1024表 
1000多万 16000QPS 16库512表

Essentially: Sub-database and sub-table is a one-off sale . The early design is very important, and it determines the difficulty of later expansion and data migration. In the preliminary design, there is a high probability that we need to plan for the next 3-5 years. In the short term, we need to plan for 1-2 years. According to the plan, determine whether to divide the database and the table, as well as how many libraries and tables. .

Back to the problem itself, this mainly depends on the current business volume and the growth rate of business volume.

Based on these dimensions, we give a set of formulas:

某年数据增量M = (1 + 数据年增速K)^ n  * 初始数据量 N

第一年增量 M1 = (1+k)   * N 
第二年增量 M2 = (1+K)^2 * N
第三年增量 M3 = (1+K)^3 * N

三年数据总量 M' = N + m1 + m2 + m3

Let’s calculate with a single table carrying 10 million data. There are several tables in total. Currently, it is not necessarily 1000w, but 2000w-5000w is fine. This is firstly an empirical value, and secondly, quantitative analysis is required.

Quantitative analysis requires pressure testing. We need to use a library instance to perform stress testing for your online configuration. Under this configuration, without affecting the system throughput, the maximum capacity of a single meter is a safe link. , Which can guide us well in the early stage of design.

We then discuss, when do we need to sub-database, must we ensure that each database is an independent instance?

No, we still have to analyze specific issues in detail.

If it is a development environment, that is, to develop a library for RD to write code, then multiple sets of libraries can be used on one machine. After all, the development environment has no concurrency and is used for development at most. As long as it is not used for stress testing, there is no problem.

If it is an online environment, in addition to deploying the library to multiple machines, you must also consider the separation of read and write and the high availability of the library. The main difference between online and offline is that there are high availability requirements online, but not offline .

Think about the difference between the two, the difference lies in cost control .

We conclude that when we want to deploy the database to a machine instance, it depends on the scenario, the cost, and whether we need it. Analyze specific issues.

2. How to generate routing keys? Can the snowflake algorithm be used? If the original database primary key is self-increasing, there is no unique business constraint, if after migration, how can the original data be routed in the sub-database and sub-table?

good question.

First of all, how to generate the routing key?

Essentially, this is a question of how to implement a reliable distributed issuer . We only talk about ideas, because we can talk about it, but it's been a long time to talk alone.

Ideas:

For some frameworks, they have their own primary key generator, such as the shardingSphere/ShardingJDBC class SnowFlake algorithm;

  1. UUID: string format, indeed unique, but poor readability, difficult to do mathematical calculations, not intuitive, relatively long, and occupy a large space
  2. SNOWFLAKE: It can be used, or it can be used to improve leaf. Leaf itself is a complete set of distributed number issuer, and it also has high availability guarantee.

Of course there are other ways:

Since you have already done sub-database and sub-table, there is a high probability that your system is also distributed, so using in-process numbering is not an ideal way.

If you want to simply implement a distributed number issuing service, we can use redis increment to implement a set of issuer, or use the self-incrementing unique id of the database to do it, but we still need to develop it ourselves to implement a number issuing system.

Simply use the previous picture to express your thoughts. This content will be written in a separate article later.

Random talk about the sub-database and table of distributed routines

 

To sum up, in essence, this is a question of how to implement a reliable distributed issuer.

Therefore, the issue of relying on a specific distributed number issuing mechanism does not need to be entangled. Just pay attention to the final selection and make more trade-offs.

3. If it is a single database, and there is no routing key, and the primary key is used as the unique identifier, how can I play with sub-database and table?

It's very simple. What is your original unique identifier? You can use this after sub-database sub-table.

However, because there is no business attribute key, it is recommended to add a natural primary key of the business attribute after data migration, and there is a high probability that you need to configure new routing rules.

The specific process is:

  1. Migrate data
  2. Change the routing configuration to specify a new query rule, routing rules for sub-databases and sub-tables
  3. Change the code to include CRUD code in the code, such as the code contained in DAO and repository, and add routing rules to the code. Simply put, you can still use the original id to perform queries, inserts, and deletes, but mainly The point of change is that you need to have a routing rule.

We say that the core of the database migration to the table regardless of the table: to ensure the integrity of the data, the code should be refactored, it is difficult to have a comprehensive program that does not need to change the code, we can only compromise and reduce complexity degree.

The original primary key id is migrated to the new database sub-database and sub-table. It is no longer continuous, but it needs to be unique. The auto-incrementing primary key in the new database table still needs to have, but there is no business attribute. The reason for sub-database After the table is divided, an auto-incrementing primary key is needed, which is mainly to improve the efficiency of insertion and query. Through the primary key index tree, back to the table operation.

It is equivalent to the fact that you originally used the auto-increment id to have business attributes. Here is a digression, please try not to use the auto-increment primary key to represent the business meaning .

3. How to choose the shard key

Our answer still cannot give an accurate statement. I can only say that we have to choose according to the requirements of the business scenario.

This is too general, let's express it through a few examples.

对于用户表,使用用户唯一标识, 如:userId作为分片键;
对于账户表,使用账户唯一标识,如:accountId作为分片键;
对于订单表,使用订单唯一标识, 如:orderId作为分片键;
对于商家相关信息表,使用商家唯一标识, 如:merchantId作为分片键;
......

If we want to check a user's order, then we should use userId to go to the routing table and insert the order into the order table to ensure that all orders of a user can be distributed on a table shard. This can avoid the introduction of distributed transactions.

If we say that the dimension is not the user, but other dimensions, for example, we want to query the orders of all users of a certain merchant .

Then we should use the merchant’s merchantId to store a copy of the data. When routing, use the merchant id to route. As long as it is a user order of this merchant, we write it into the merchant’s order table. Then for the order that the merchant belongs to, we It can be obtained from a shard.

Use a diagram to express the above description clearly:

For users, the function of the sharding key is as follows:
usertable.png

For merchants, the function of the
shard key is as follows: merchanttable.png

So our conclusion is: we must choose according to the requirements of the business scenario, analyze the specific problems in detail, and try to ensure that no distributed transactions are introduced to improve query efficiency.

In addition, for the mainstream approach, if you need to have complex queries, either double-write based on different dimensions, or you can directly query by introducing a heterogeneous method, such as using elastic search or hive.

4. When inserting data in batches, it will insert into each sub-database, whether to do distributed transactions in actual business

The third question more or less also mentioned the answer to this question.

In the process of implementing sub-database and table, we must try to avoid introducing distributed transactions .

Because from the third question above, you will find that if we have a routing key, the problem is much simpler. In most cases, we don't need to introduce distributed transactions, but it will be painful if we don't.

For out-of-order inserts and the need to ensure insert transactional scenarios, distributed transactions are required. But this is too inefficient and inappropriate.

First, there are not many out-of-order insertion scenarios. Second, if distributed transactions are introduced, the strength of the transaction is not small, and it has a significant impact on the performance of the insertion. Not the best way.

My suggestion is to do it based on eventual consistency. Otherwise, the introduction of distributed transactions will affect efficiency too much and increase the complexity of the system. I think the purpose of our design system is that we don’t need complicated solutions. If you have time to drink tea, why not do something else.

So the conclusion of this question is: try to avoid distributed transactions, if you have to introduce them, you need to minimize the scope and intensity of transactions. Through compromise, consider more about the feasibility of the scheme.
Performance is very important. It can be done without distributed transactions. How to do it is through eventual consistency.

However, if you say "I just can't avoid distributed transactions, what should I do". Then use it, if it is not necessary, do not increase the entity. If you have to use it, just use it, nothing to say.

5. If there are many tables in a library, and one table is divided into tables, how to place the tables without dividing the tables and the tables? Is it assigned to a certain library in the branch?

Essentially: This is a problem of the distribution of non-sub-database and sub-table data and sub-database and sub-table data.

In fact, middleware for sub-databases and sub-tables often has corresponding functions. This function is often called default routing rules . How do you understand it?

That is to say, for these tables without sub-databases and sub-tables, the default routing rules are  sufficient, so that it will always be routed to the default DataSource.

It is equivalent to a whitelist. Find the middleware documentation to see how the default routing rules are configured. Basically, the middleware considers this issue. For ShardingSphere, a configuration example is as follows:

CustomerNoShardingDBAlgorithm
    default-table-strategy: (缺省表分区策略)
        complex:
        sharding-columns: db_sharding_id
        algorithm-class-name: com.xxx.XxxClass
    tables:
        ops_account_info: (必须要配置这个,才能使用缺省分表策略)
        actual-data-nodes: db-001.ops_account_info

To give a detailed example, for example:

When a service is sub-database in the original database (for example, the user database is divided into user01 and user02), does it force the non-distributed tables to be routed to a certain database (for example, route the non-distributed tables to user01)?

The essence mentioned here is: Default routing rules, we only need to configure certain tables to follow the default routing rules. For example, we now have the user table, order table, and config table. Among them, the user table and the order sub-database sub-table, and config has no sub-database and sub-table.

Then we only need to put the config table in the 0 library, 1 library, 2 library of the user library, anywhere,

After placing it, we only need to configure the default routing rules in the configuration file of the database and table middleware, and configure the config table specially. Just check the config table and go to the specified library.

Others are similar, as long as there is such a demand, add the corresponding configuration.

You must explicitly tell the middleware which tables do not follow the routing rules, and tell it where these tables are placed.
It is best to put them in a library with a small amount of request, or you can build a library separately. The library puts tables that do not perform sub-database sub-tables,
and configures no routing rules to be done, in fact, it is still the default routing rules.

Why do you do this? What is the intention?

My understanding is: the reason why we split the database and tables is because the request is very large, and the concurrency needs to be reduced; and for the tables with low request frequency, we can use the table without dividing the database and the table or use it in a single table, then we can Just configure it as the default routing rule.

8. Data migration process and how to ensure data consistency

In a simple summary, data migration relies on double writing of data; data consistency relies on data integrity verification.

For migration, we have the following steps:

Random talk about the sub-database and table of distributed routines

 

  1. Modify the code first, add the dual-write sub-database sub-table code; go online
  2. Start the data double write, synchronous incremental data; double write, the main purpose is to catch up with the real-time data, a deadline for the full amount of synchronous data, to ensure that the data after this time is complete (at the same time, through asynchronous data integrity check The program verifies the integrity of the data, but if we can ensure the reliability of double writing, this comparison can be done or not. It is better to do it)
  3. The full amount of historical data is synchronized and the integrity of the data is verified; generally, the full amount of data is synchronized, and the method of synchronous writing is not required. The reason is that synchronous writing has high code coupling on the one hand, and has an impact on the system on the other hand. Therefore, we often write in an asynchronous manner, and this process is illustrated in the following text;
  4. Remove the double-write code, and query the full logic of the sub-database and sub-table;
    through the switch, switch to the full read-write sub-database and sub-table logic after the full data synchronization is completed. At this time, the old logic has no request routing. We only need to find a release window to offline the old logic. At this time, the line has been completely migrated to the code flow of sub-database and sub-table.

Finally, I want to say that we must return, we must return, we must return! ! !

Original link: http://wuwenliang.net/2021/01/09/Distributed Routines: Discussing the Sub-database and Sub-table/

If you think this article is helpful to you, you can follow my official account and reply to the keyword [Interview] to get a compilation of Java core knowledge points and an interview gift package! There are more technical dry goods articles and related materials to share, let everyone learn and progress together!

Guess you like

Origin blog.csdn.net/weixin_48182198/article/details/112436867
Recommended