Reasonable and well-founded: choose centralized or distributed database

Whether an OLTP type business system should use a centralized database or a distributed database is a question that is often asked during the transformation of domestic databases. Whether it is for the development and evolution of the technical architecture or to provide necessary support for the long-term development of existing businesses, this question is It is of discussion significance. In the context of distributed architecture, it seems that any architecture needs distributed empowerment. Is this really the case? A comprehensive analysis and explanation will be given below.

Author: Wang Hui

The article comes from the WeChat public account "Basic Technology Research"

1. Analysis of current usage

There will be more than 200 domestic database manufacturers in 2022. The traditional centralized databases are mainly Renmin Jincang and Dameng. There are also emerging databases like polarDB. Distributed databases include GaussDB, Kingwow, TDSQL, GoldenDB and OceanBase, etc. In fact, most of these databases have two deployment modes: centralized and distributed. That is, the money you spend on distributed databases can also be used for centralized deployment, which can meet your different business needs.

One thing to note here is that some distributed database vendors adopt centralized deployment, and applications still need to connect to computing nodes. Connect the following data nodes through the computing node (CN). This may be due to the consideration of a unified architecture, and also because the computing node can sense automatic switching and be transparent to the application when the database switches between active and standby. However, this inadvertently adds a layer of analysis, which will cause a certain loss in performance. Some database vendors directly connect to the database through their own JDBC/ODBC drivers or VIP, thus avoiding similar problems.

From the perspective of technical architecture, the databases used in the financial industry are still mainly centralized, and distributed databases have formed a strong complement in medium and large financial institutions. Research data from the "Financial Industry Database Supply Chain Security Development Report (2022)" shows that centralized databases still account for 89% of the overall financial industry, of which 80% are banks, and the securities and insurance industries account for more than 90%. Centralized databases Play an important role in the digitalization process of financial technology. The overall proportion of distributed databases in the financial industry reaches 7%, the banking industry exceeds 17%, and the securities and insurance industries are relatively low. In other words, it is completely satisfactory to use a centralized database for most of our business.

2. Is distribution really needed?

Since a centralized database has only one main data node, it naturally has the advantages of simple architecture, convenient operation and maintenance, good compatibility, and high cost performance.

However, there are also problems such as the inability to break through the hardware limitations of a single machine, the inability to expand horizontally, and the existence of performance and capacity bottlenecks.

So when the centralized database cannot meet our performance and capacity requirements, distributed provides us with a good technical means. When we plan to choose distributed to solve centralized problems, it is recommended that you ask the following questions before considering it:

  1. Is it possible to solve the problem by optimizing the centralized database itself without making major architectural changes, such as optimizing parameters, optimizing SQL statements, optimizing business logic, etc.
  2. Is it possible to solve the problem by increasing the host resource configuration, such as increasing the CPU and memory size, or using a vertical expansion method such as switching from a virtual machine to a physical machine?
  3. Can the problem be solved by separating storage and computing? If the capacity of a single machine cannot meet the requirements, you can consider plug-in storage or adopt a storage-computing separation architecture to solve the problem of limited disk capacity of a single machine.
  4. Can it be solved through the application layer, such as changing the business architecture and adopting microservices or unitized architecture, that is, achieving data splitting, distributed transactions and horizontal expansion capabilities at the application layer, while the database is still centralized. This method has high requirements for developers and high business transformation costs, which need to be considered comprehensively.
  5. Whether you fully understand the advantages and disadvantages of distributed architecture, whether you have made preparations for the operation, maintenance and backup of distributed databases, and whether you fully consider that your business must be solved through distributed databases.

3. When to use distributed?

In the early days, there was a saying that a table with 20 million rows needed to be split. This was mainly for the MySQL database. When the OLTP type table exceeds 2000W rows, the number of B+tree leaf layers will be increased to 4 through formula calculation, thereby increasing the number of IO reads. However, with the upgrade of hardware or the implementation of caching technology, the impact of IO can be basically ignored. Therefore, it is currently common to use TPS or QPS indicators to determine whether distributed transformation is needed, such as when the single-point TPS bottleneck reaches 4000, or QPS reaches 8W, or the data capacity reaches 2TB. Under normal circumstances, horizontal expansion is required to solve performance or capacity bottlenecks, which is relatively reasonable. However, there is no fixed formula here. It is mainly necessary to make judgments based on your own business scenarios. We should also consider the needs of future business growth, such as whether it can meet the business growth needs in 3-5 years, make peak predictions, and plan in advance to avoid secondary transformation. At the same time, refer to the several issues mentioned above to see whether it must be solved through a distributed database.

Experimental data one (finding the inflection point)

The hardware resource is the Kunpeng virtual machine environment based on ARM architecture. The specific configuration is 16C64G. The winning bid is the Kirin v10 operating system and ordinary SSD disk.

The figure below shows the test results of a domestic distributed database. The distribution is 4 shards, unit: seconds.

There is basically no gap for single-point index-based queries. For full table scans and dual-table associations (the associated tables are uniformly 2 million rows and are based on shard keys as association conditions), the number is already about 5 times when the amount of data is 5 million. There has been a significant improvement. To be honest, I turned this corner a bit early. In fact, it still needs to be verified based on your own business scenarios to be more accurate.

For data volumes below 5 million, you can test it yourself based on your business. Of course, there may be an inflection point at 300w or lower. I hope you can give more test results here. The experimental data may have certain deviations due to various factors. Please correct me. We also hope that everyone can put their test results in the comment area so that everyone can verify the performance inflection points of distributed and centralized systems. This can provide a more accurate The data basis serves as a reference for selection.

Experimental data two

The picture below is the result of a manufacturer’s stress test based on the sysbench tool:

It can be seen that when the resource usage of the centralized database reaches 75% in a medium-sized configuration, the maximum TPS that can be achieved is 4595, the delay is 5ms, and the concurrency is 400. This is a reference value, which is the basis for splitting the basic TPS mentioned above if it exceeds 5,000. Of course, if your resources are large enough, this value can be larger. But most accurately, we need to verify our TPS value through real environment stress testing for judgment.

4. How to make good use of distributed

As the name suggests, it is distributed, with multiple people working, and has the advantages of high availability, high scalability, high performance and elastic expansion and contraction capabilities.

As the number of data nodes and database components increases, problems such as complex architecture, complex operation and maintenance, and high costs will inevitably arise. At the same time, most distributed databases do not support special objects such as stored procedures and custom functions.

Distribution is a double-edged sword, and how we use it well without getting hurt is very important.

1. Selection of shard key

The choice of sharding key is very important. The value of the field selected as sharding key should be relatively discrete so that the data can be evenly distributed on each data node. When a single field cannot satisfy discrete conditions, you can consider using multiple fields together as sharding keys. In general, you can consider selecting the primary key of the table as the sharding key. For example, select the ID number as the distribution key in the personnel information table. And most distributed databases do not support or recommend modification of shard keys.

2. Choice of distribution method

A common choice is hash distribution, which is relatively more evenly distributed. There are also partitions such as range and list. Of course, we ultimately need to make a choice based on specific business scenarios. In addition, some frequently used configuration information tables or small tables for related queries need to be defined as global tables to ensure that they can be obtained at one data node to avoid cross-node data interaction.

3. Standardize the writing of SQL statements

The sharding key should be selected as the query condition, and the sharding key should be used as the multi-table association query condition. If sharding keys are not used, cross-node data transmission will occur. Some distributed databases will aggregate all data into computing nodes for summary and correlation sorting. When the data is large, computing node resources will be filled up instantly, causing the database to be unable to be provided to the outside world. Serve.

4. Avoid cross-node data transmission

As mentioned above, using query conditions as sharding keys is to avoid cross-node transmission to the greatest extent. Because cross-node data transmission is based on the network, there is a big gap in the transmission, read and write performance of the network compared to the disk, so the performance will be obvious. decline, and there may even be situations where the results never come out.

5. Avoid distributed transactions

Distributed transaction processing has a long path. This is determined by its nature. Most databases are implemented based on the 2PC principle. Therefore, we must avoid distributed transactions to the greatest extent. Generally, they should be controlled within 10% of all transactions. Too much Distributed transactions will definitely have a performance impact on us and also bring challenges to the consistency of business data.

5. In-depth analysis: Is distribution a database solution or an application solution?

Distributed implementation can be solved through databases (distributed databases) or through applications. Most developers, especially financial institutions such as traditional industries or city commercial banks, have less development capabilities than large banks and limited staff sizes. They prefer The database does more things, such as the implementation of distributed transactions and data splitting, and is as transparent to developers as possible. Therefore, they will directly use distributed databases, taking the unitized architecture as an example as shown below:

However, some important business systems or teams with certain development capabilities will consider implementing them at the application layer. They want to gain more control. If an abnormality occurs in a distributed transaction, if it is implemented at the database layer, it will be a black box for the developer. He can only look forward to the distributed transaction processing capabilities of the database, and they cannot intervene. . But if it is implemented at the business layer, they can use the log information obtained from message queues, TCC, saga, etc. and use the data compensation mechanism to perform corresponding processing. Therefore, they will achieve distribution in the application layer, and the database will adopt a centralized approach. Each database stores part of the business data, with a unitized architecture as shown in the figure below:

The differences between centralized and distributed databases in implementing distributed methods are summarized as follows:

Using a centralized database, the application layer has higher requirements for distributed applications to achieve distributed characteristics. However, there are relatively few modifications at the database level because the compatibility of centralized databases is better than distributed ones.

Using a distributed database, the application does not need to implement distributed features and is transparent to the application. However, the distributed database has poor compatibility or even no support for special objects such as stored procedures and functions. This requires the application to adapt to the database.

6. Summary

At a roundtable forum on database innovation, a fellow teacher said that a centralized database is like a sheep, docile and easy to manage, while a distributed database is a wild horse, unruly and difficult to control. This reminds me of Song Dongye's "Dong" As the song "Miss" sings, "I fell in love with a wild horse, but there is no grassland in my home, which makes me feel desperate...". The wild horse of distributed database can be tamed and let you gallop on the prairie, otherwise it will make you suffer and struggle. In fact, most developers still hope that the database will do more, developers will make less changes, and the database will be more transparent, simpler, and even smarter.

Finally, I would like to say that our domestic database has a long way to go. In fact, customers are more concerned about the improvement of basic functions than the addition of new functions. If we can do a good job in the core storage engine of the database and the ecology, then we will not discuss this topic in depth on the OLTP database.

If there is any inaccurate or unprofessional expression in the article, please correct me. Thank you.

For more technical articles, please visit: https://opensource.actionsky.com/

About SQLE

SQLE is a comprehensive SQL quality management platform that covers SQL auditing and management from development to production environments. It supports mainstream open source, commercial, and domestic databases, provides process automation capabilities for development and operation and maintenance, improves online efficiency, and improves data quality.

SQLE get

type address
Repository https://github.com/actiontech/sqle
document https://actiontech.github.io/sqle-docs/
release news https://github.com/actiontech/sqle/releases
Data audit plug-in development documentation https://actiontech.github.io/sqle-docs/docs/dev-manual/plugins/howtouse
Spring Boot 3.2.0 is officially released. The most serious service failure in Didi’s history. Is the culprit the underlying software or “reducing costs and increasing laughter”? Programmers tampered with ETC balances and embezzled more than 2.6 million yuan a year. Google employees criticized the big boss after leaving their jobs. They were deeply involved in the Flutter project and formulated HTML-related standards. Microsoft Copilot Web AI will be officially launched on December 1, supporting Chinese PHP 8.3 GA Firefox in 2023 Rust Web framework Rocket has become faster and released v0.5: supports asynchronous, SSE, WebSockets, etc. Loongson 3A6000 desktop processor is officially released, the light of domestic production! Broadcom announces successful acquisition of VMware
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/actiontechoss/blog/10314327