Database | The master of the future, neither centralized nor distributed

  • 1. The debate between centralized and distributed

Distributed and centralized databases, who is the future master? Is it really necessary to use a distributed database? This type of debate has been raised frequently in recent days. Everyone has sufficient arguments and even quotes survey data from authoritative organizations to support their respective views.

Centralized scene. Small enterprises have small amounts of data and low concurrency, while large enterprises still have more than half of their small and medium-sized business systems, as well as some edge computing scenarios. Centralized databases have stronger advantages in terms of performance , cost , and development and operation complexity . It will undoubtedly be the product of choice.

Distributed scenario. Large-scale business systems with massive data and high concurrency, core systems with extremely high reliability requirements, Internet services with drastic load fluctuations over time, etc.; especially when the performance and reliability of domestic equipment cannot reach the level of mainframes and minicomputers. In this case, the high availability and scalability of distributed databases will be more advantageous, and distribution is also a must choice at this time.

Based on the fact that " the two types of business scenarios will last forever ", everyone debates and debates, and the final conclusion is that both centralized and distributed have their own advantages and disadvantages, and they will also coexist with the business scenarios for a long time .

Such a conclusion seems unshakable. However, combined with the development history of databases and the continuous changes in technical architecture, when we foresee how the architecture will evolve in the future, the debate at this moment will appear futile!

  • In the future of databases, there will no longer be distributed or centralized

    1. Markets overlap, competition is fierce, and convergence of capabilities is an inevitable result.

The database has been developed for 60 years starting from GE's IDS in 1961. In June 1970, Edgar Frank Codd officially unveiled the relational database. After that, Oracle, Informix, DB2, Sybase and other pioneers pushed database software to the level of the core technology of IT systems. The world has been like this ever since. In a centralized world, there was no distribution at that time.

Databases are being applied to more and more core scenarios, and system security has been improved. Therefore, on the basis of centralized databases, active-standby replication high availability was launched; read-write separation was also used to temporarily solve the problem of insufficient single-machine performance due to high concurrency. Then the multi-instance external shared storage architecture further appeared (in fact, this was already an early distributed architecture), and the centralized database had reached its peak at this time.

If business needs and database technology continue to accommodate and support each other, there will be no native distributed thing today. Unfortunately, the relationship between the two finally broke down. The rise of the Internet was followed by the popularity of mobile Internet. In the past, people had to wait for the institution to open during the day and gather in the store to put pressure on the database. Now it is different. People can do it at any time. , can connect to the organization’s business systems from any location.

In the Internet era, data processing pressure is everywhere. Data concurrency has rapidly increased from hundreds and thousands to hundreds of thousands and millions, and data capacity has rapidly increased from GB to TB, PB, and then EB. It has been 20 years, and the information age has allowed business demands for data processing to continue to reach new heights. However, the database is now a bit outdated and cannot keep up with the jump in business demands. In order to solve new problems, the sub-database and sub-table architecture briefly appeared, but was abandoned due to problems such as application transformation, insufficient expansion capabilities, and difficulty in operation and maintenance.

Until now, the rise of native distributed databases that do not rely on traditional databases has supported new business needs. At the beginning, the market goals of distributed were only high concurrency, massive data, and core business. The market goals of centralized and distributed were different from each other and did not conflict. However, with the continuous improvement of native distributed technology and the steady improvement of centralized database capabilities, the overlap in market goals between the two has become larger and larger, and the resulting conflicts have intensified. The aforementioned debate on whether it is necessary to use a distributed database actually occurs in business scenarios where the two overlap.

Whether it is a centralized architecture or a distributed architecture, it is actually a transitional product of an era . Without causing additional mutations (such as sudden changes in data volume, data models, hardware equipment, etc.), in order to survive in a limited space, the two have to compete for each other's business scenarios, so they must solve the problem of their own products Pain points in the other party’s scenario. Therefore, the convergence of capabilities is inevitable . The capabilities here include deployment architecture, cost, product functions, performance and other aspects. The product finally selected by the market must be a database that supports both centralized deployment (cost, simplicity, performance) and distributed deployment (scalability, high availability). At that time, purely distributed and centralized databases will be completely eliminated, at least in OLTP scenarios, this is the inevitable outcome.

In the future, when different users meet together, they will not ask "Do you use distributed or centralized system xx?" because such questions are no longer fashionable. The conversation at that time should be like this:

A: How many nodes are used for the database of your xx system?

B: Hi, ours is a small system with only one node. What about you?

A: We used 5 nodes. I met C two days ago and he said that they used 500 nodes.

2 Technical feasibility of centralized and distributed “architecture integration”

There is a high probability that a database with such an integrated architecture will not be a brand new product from 0 to 1, but the product of upgrades and iterations from existing products. After all, the market is already large enough, and no new manufacturers will join in the short term. Then, as long as centralized can achieve distributed deployment, or distributed supports centralized deployment, then architectural integration will basically be formed (in fact, such a product has already taken shape).

Can centralized databases be fully scaled to be distributed?

Taking Oracle, the most powerful centralized database, as an example, information from its official website shows that Oracle RAC shared storage can already support 100 nodes. Such computing power can already support most business needs in TP business scenarios. However, judging from real delivery cases, it generally does not exceed 4 nodes. Due to the restricted business model, it is difficult to improve the performance of the RAC architecture when it is above 3 or 4 nodes; and when the amount of data in the shared storage is too large, There is also the problem of performance bottlenecks that are difficult to overcome. In addition, technologies like Oracle RAC are still limited to very few centralized vendors and have not been widely popularized; moreover, it was also because RAC could not solve these problems that the current distributed architecture was created. Therefore, it is very difficult to implement distributed deployment of hundreds or thousands of nodes in a centralized manner. The core architecture will need to be adjusted too much, which is equivalent to rewriting a new product.

So, can a distributed database be deployed on a single machine?

In terms of functionality, distributed databases have basically reached the level of centralized databases in recent years. Whether it is the coverage of SQL, the ACID characteristics of transactions, the surrounding ecology, etc., although it has not yet reached the level of Oracle in the high-end business market, it can still do to the ability to replace Oracle.

So, now we only have to look at the technical architecture to see if distributed deployment can be achieved on a single machine. Most distributed databases include management nodes (metadata management, global transaction management), computing nodes, storage nodes and other roles. This gives people the illusion that the distributed database must be composed of multiple node roles in terms of architecture, and it is difficult to shrink to a single machine (multiple roles are deployed on one node, which always makes people feel nondescript).

But in actual technology, these management roles are deployed separately just for ease of understanding, implementation, and resource isolation. Eliminating multi-role and peer-to-peer deployment is the primary condition for single-machine deployment of distributed databases, followed by flexible online controllability. For example, the number of data copies can be adjusted online from 1 to many. There are already many distributed databases in China that implement peer-to-peer deployment. Although OceanBase also has single components such as rootserver cluster management service and GTS global clock service, this component has been integrated within the computing node, which can be regarded as basically achieving peer-to-peer deployment; Another distributed database, QianBase, is more thorough. It uses HLC hybrid logical clock, timestamp sorting, etc. to remove the global clock service and allocate the cluster management capabilities to each computing node, achieving complete equality.

Therefore, for this type of distributed database, when it is deployed on a single machine and the number of data copies is set to 1, all the functions of the distributed architecture (except high availability) can still be retained, which basically achieves single-machine deployment. . The next thing to do is to continuously optimize transaction processing so that the performance of single-machine deployment is the same as that of centralized deployment. At that time, "architectural integration" will be completely realized. Judging from the current development trends of innovative market demands such as technical architecture, cloud native, cross-regional deployment, and AI large models, it is relatively easier to derive future "architectural integration" from native distributed databases.

3. The balance between centralized and distributed is difficult to solve

Will centralized and distributed become "good friends" and complement each other? The answer is impossible! If there was no scene conflict between the two, it would be possible, but now it is no longer possible. The overlapping scenes between the two are getting bigger and bigger, and the growth rate even exceeds the growth rate of the overall market space. Scenarios where the two overlap will not disappear, and users in such scenarios want both centralized low-cost, simple operation and maintenance, and distributed high availability and scalability. In the end, in the fierce competition, they will This has led to the emergence of such integrated database products that “do both.”

Moreover, if "architectural integration" does not appear, then as the performance and stability of hardware improve, centralized capabilities will continue to increase, and they will also engulf distributed scenarios until distributed systems lose all living space. So currently, from the two types of standpoints, both are trying to see who can break through the other's structural advantages first. There is another expectation for centralized systems, that is, before distributed architecture integration is completed, hardware will make a huge qualitative leap, and centralized scenarios will be greatly expanded, but this seems to be a bit too small a probability.

The current distributed versus centralized debate is temporary and short-lived. In this temporary stage, both A and B meet the needs, each has its own advantages and disadvantages, and the better C has not really appeared yet, so there is nothing wrong with using either, it just depends on which product features the user prefers. In scenarios where centralized and distributed coverage are shared, no matter which one should be used to make a conclusion, it is a relatively one-sided view.

From the perspective of a database practitioner, in the next five years, "architecturally integrated" databases will truly appear and quickly capture the market. It is estimated that in 10 years, there will be no pure centralized and distributed databases. New market space has been added.

Guess you like

Origin blog.csdn.net/CSHARP0409/article/details/134985276