Starting from the separation of storage and calculation: the road to distributed database transformation in the financial industry

Since its official start in the 1990s, China's database development has gone through nearly 30 years. With around 2000 as an inflection point, open source databases headed by MySQL gradually entered the production business under the impetus of Internet manufacturers; and in order to make MySQL, which has mediocre single-instance capabilities, meet high-performance requirements, Internet manufacturers began to integrate the original database and table Split through various logics to form a new technical route. This is a distributed database.

At the 2023 Huawei Financial Innovation Data Infrastructure Summit, Huawei and a group of financial industry partners such as banks and securities jointly released the OceanData distributed database storage solution, proposing a professional all-flash storage and storage-computing separation architecture to enable distributed database innovation Upgrade to help the smooth transformation of financial industry databases. This release has aroused heated discussions in the industry. What is a "distributed database storage solution"? Why did Huawei jointly release with financial industry partners? And why so much emphasis on "separation of storage and calculation"? This article will start from this topic, and analyze in detail the content of the plan released by Huawei at this conference, and the story behind the distributed transformation of the financial industry database.

Separation of storage and calculation, an important choice for financial database transformation

In the financial industry, the main scenarios for using databases include core transactions, mutual funds APPs, analysis applications, office and other internal applications. Among them, the core transaction scenarios often use IBM mainframe + DB2 or minicomputer + Oracle, which have very high requirements on the delay, strong consistency and availability of the database; mutual gold applications require the database to have high concurrency and easy expansion capabilities, and at the same time Data consistency and database availability also put forward high demands, and MySQL-like distributed database + container technology is often used.

Today, with the gradual expansion of financial business scale and changes in policy forms, the database system of various business scenarios in the financial industry is gradually being transformed into a domestic database; however, except for non-critical business scenarios such as office work, the transformation of core business scenarios is still progressing slowly. For this phenomenon, the fundamental problem is that domestic databases, especially distributed databases, mostly adopt an integrated storage and computing architecture, which lacks guarantees for availability, performance, and manageability at the data level, making it difficult to meet the requirements of the financial industry.

 From a technical point of view, it is indeed difficult for a database with an integrated storage and computing architecture to meet the demands of core financial scenarios. The storage-computing integrated architecture refers to the architecture that uses the server's local disk to store data; due to the lack of reliability of the server's local disk, the database under the storage-computing integrated architecture often improves system availability through the form of one master and multiple backups. However, there is a set of insurmountable contradictions when synchronizing data between the main database and the standby database: if synchronous replication is used between the main database and the standby database, that is, the database of the standby database is logically consistent with the main database, it will seriously affect the performance of the main database; If semi-synchronous or asynchronous replication is used between the two, the complete consistency of the primary and backup data cannot be guaranteed. This contradiction cannot be reconciled in the distributed system, which makes it almost impossible for this type of database to meet the high performance and strong consistency requirements of the financial core at the same time.

For financial core scenarios, disaster recovery is a necessary capability. According to the requirements of the "Commercial Bank Data Center Supervision Guidelines", commercial banks with a total asset scale of more than 100 billion yuan and provincial rural credit companies should set up remote disaster recovery centers, and require at least RTO <2 days and RPO 0-30 minutes The level of disaster recovery; in fact, for the core systems of the five major banks or top shareholding systems, the basic requirements are to reach the highest level of RTO minutes and RPO=0. At the beginning, the problems caused by the master-slave replication of the storage-computing integrated architecture are more obvious in the disaster recovery scenario, because the fault scenario is more complicated after the link is extended (such as link jitter, fiber degradation, etc.), and At present, no vendor has maturely optimized this problem on server + Ethernet. Therefore, at present, almost all databases using the integrated storage and computing architecture adopt the single-cluster remote + asynchronous replication mode, and there are basically no cases of strong consistency disaster recovery for distributed databases under core business.

In terms of management, there are also many problems in the integration of deposit and calculation. Due to the strong binding of computing and storage, resources cannot be expanded as needed, which will inevitably lead to waste of resources on one side; in terms of actual implementation, due to limited server capacity, computing resources are often the one that wastes more. Many financial users who are the first to carry out database distributed transformation already have nearly ten thousand servers, but the utilization rate of CPU resources is less than 10%. And these nearly ten thousand servers have hundreds of thousands of hard disks, which affect their health and failures. Management is even more of a headache for operation and maintenance.

 

To analyze the root causes of the above problems, in addition to the "CAP curse" that distributed systems cannot get rid of, in the final analysis, it is because the server itself does not pay much attention to data reliability and easy-to-manage design; and the current database vendors basically only have the ability to meet user functions in the northbound direction , lack of technical accumulation to improve data availability and high-efficiency management capabilities in the southbound direction, and cannot make up for the deficiencies of the server in this regard, which ultimately leads to the failure of the entire database system to meet the core financial requirements.

Nowadays, with the changes in the policy situation and business needs, it is the general trend for domestic distributed databases to go to the core; instead of waiting for the slow technological evolution of the application layer, it is obviously more feasible to entrust professional tasks to professional people. Professional storage focuses on high data availability, high performance, and easy management. The technology has evolved for nearly 40 years. With rich technical accumulation and practical experience, it can solve many problems in closed-loop distributed databases faster; more importantly, use professional Storage will lead the distributed database to the Share Storage data sharing architecture, which provides the feasibility to change the underlying logic of data synchronization between the master and backup, and this logic is the root cause of the contradiction between performance and strong consistency of the distributed database. The above points constitute the technical inevitability of a distributed database moving towards a storage-computing separation architecture. Today, whether it is AWS Aurora, openGauss, PolarDB, or TDSQL, they are all moving closer to the storage-computing separation architecture. This will never be a coincidence

How does the distributed new core support the name of "core"?

Due to various reasons, some large banks have started the transformation of the core system sub-database and sub-table, and use distributed databases to build a new generation of core systems. This is the distributed new core. However, as mentioned above, the bank's core system has very stringent requirements for reliability, and most of the current distributed databases cannot meet the requirements, resulting in the business run by the new distributed core is often not so core. How to let the distributed new core support the name of "core"? Huawei OceanData distributed database storage solution gives its own answer.

Huawei's OceanData distributed database storage solution is based on storage synchronous replication technology, and realizes dual-cluster disaster recovery of distributed databases. Compared with traditional distributed databases that adopt a single-level group remote approach, this solution can effectively achieve fault isolation between clusters, and is currently the only solution that meets the financial core disaster recovery requirements. So how did Huawei do it?

Take the following figure as an example. Most of the current distributed databases use the replication method on the left of the figure, and the master node logs are synchronized to each slave node through the Ethernet network between the servers. Regardless of whether logical replication or physical replication is used, the biggest problem with this solution is that the server lacks high data availability guarantee capabilities, and the application layer needs to deal with a large number of split brains, link jitter, bad blocks, bit errors, packet loss, etc. under remote disaster recovery. The problem is that no current database can provide guarantees; many databases only use the PAXOS protocol to deal with the split-brain problem, which has already had a huge impact on business performance. This is also the reason why currently no distributed database supports multi-site strong consistency disaster recovery with RPO=0.

 

Huawei's OceanData distributed database storage solution enables distributed databases to write Redo Log streams into storage, and through the powerful synchronous replication function of OceanStor storage, the logs are copied to the storage in the remote disaster recovery data center; the remote data center passes Real-time log playback ensures that the data in the standby database is consistent with that of the primary database. Due to the strong replication capability of professional storage, the execution of the main database transaction does not have to wait for the remote standby database to be successfully written, and can be submitted as long as the log is written, which greatly improves its performance compared with the current distributed database. In this solution, the databases of the primary data center and the disaster recovery data center are two isolated clusters, which are completely isolated from the data plane to the management plane. paralyzing problem. I believe that many people have discovered that Huawei's solution is similar to Oracle's ADG implementation, and Oracle has officially achieved this by relying on the mature and powerful replication capabilities of enterprise-level storage.

In addition to storage synchronous replication, the OceanData solution also has its own exclusive solution to the problem of link quality degradation for remote capacity disasters. In the traditional disaster recovery solution, the degradation of the replication link can only be detected through the heartbeat packets between the primary data center and the disaster recovery center. It takes several minutes to detect and switch to the redundant link, which may cause financial enterprises Thousands of transactions failed; Huawei's solution uses the SOCC technology of WDM devices and storage devices to detect link degradation within milliseconds and complete link switching within two seconds, ensuring the integrity of the replication link at all times. Unimpeded, avoid performance impact on front-end business or transaction failure due to jitter. The powerful and effective reliability assurance means of professional storage has always been the patron saint of the core values ​​of financial enterprises; even in the era of distributed new cores, this is no exception.

How to have financial-level high reliability in the transformation of financial microservices?

Due to its lightweight deployment and agile management, container technology has gradually been widely used in the financial industry; many financial companies choose to deploy sensitive services such as Internet financial APPs on containers; and for some traditional applications, there has also been a wave of There is an upsurge of micro-service transformation.

According to the "Container Stateful Application Research Report", due to various application requirements, databases are the most containerized applications. However, the reliability assurance mechanism of containers for applications is actually relatively lacking, especially for stateful applications such as distributed databases, many failure scenarios require manual intervention; but for financial enterprise applications, reliability is the most important thing to ignore important feature. Imagine that in a microservice transformation environment, tens of thousands of pods are running at the same time. If all errors require manual intervention, the operation and maintenance costs will be unimaginable, and the stability of the container cluster will be completely uncontrollable.

To address this pain point, Huawei's OceanData distributed database storage solution uses a self-developed container storage interface to create a highly reliable and agile distributed database containerized deployment solution, allowing financial microservice transformation applications to also have financial High and reliable.

First, we need to introduce the fault handling mechanism of traditional containers. Taking the most commonly used Kubernetes as an example, when using stateful applications, the StatefulSet controller is generally used for container management. When a node fails, all Pods on the node will lose contact with the Kubernetes management node; at this time, since Kubernetes cannot determine whether the node is really faulty or whether the above Pods and services have stopped, it will not automatically switch the Pods to another normal node. Node restarts to avoid unpredictable business errors. At this time, manual intervention is required to forcibly delete the failed node from the cluster, and Kubernetes will restart the Pod and service on the new node.

Even if the problem of fault pull-up is solved, the database adopting the storage-computing integrated architecture will face another problem: data recovery. Since the data on the local disk of the failed node cannot be accessed, the new node needs to reconstruct the data in full before it can become a real slave library, which often depends on the replication from the master node to the slave node. According to the business pressure, the reconstruction of a single node generally takes 6-10 hours, during which business performance and reliability will be affected.

 

To address the above pain points, Huawei's OceanData distributed database storage solution improves database reliability from two aspects.

First of all, based on the separation of storage and computing, the database is transformed into a Share Storage architecture, and the high availability of data is guaranteed through professional storage, and the data will not be lost when the server fails; when the new node is pulled up, the data can be remounted to the new node. The data written during the fault recovery period can be synchronized with the amount of data, so that the recovery time can be shortened to about 5 minutes, and the impact on business and reliability risks are greatly reduced.

Secondly, through the IO status between the storage and the host, proactively detect server failures, and implement active expulsion of faulty nodes through Huawei's self-developed container storage interface, and mount the PV of the faulty POD to the newly launched POD to make up for the native Kubernetes. Handle the problem manually. Relying on the advantage of separating storage and computing, Huawei's solution effectively solves the inherent problems in cloud-native applications, enabling the transformation of financial microservices to truly meet financial-level high-availability requirements.

Can traditional core application distributed databases not be divided into databases and tables?

After talking about the scenario of distributed transformation, let's discuss this topic at last: Can we use distributed databases without dividing databases and tables? I believe this is the voice of many DBAs. Distributed transformation of the traditional core involves the splitting and slicing of large database tables, the rewriting of stored procedures, and the upper-level business needs to be unitized, ensured transaction consistency, and serialized logic corrections, etc., all of which require a lot of time and manpower , and it is difficult to achieve perfection. Even after the transformation is completed, the efficiency of cross-table join and transaction execution is still much lower. Many stored procedures and large transactions cannot be used, which also causes many difficulties for later development.

Although sub-database and sub-table do have certain advantages in high-concurrency scenarios, for most non-Internet companies, the cost is not directly proportional to the benefit. Is there a way to help financial companies smoothly overcome this most difficult hurdle? Huawei's OceanData distributed database storage solution also gave a response.

Tracing back to the source, why do distributed databases need sub-databases and sub-tables? The fundamental motivation is that the instance capabilities of open source ecological databases represented by MySQL are completely incomparable with commercial databases such as Oracle. After Oracle uses RAC, multiple instances can read and write concurrently, and the performance is enhanced, and the failure of a single node does not affect the business; because MySQL does not support RAC, multiple nodes can only write multiple reads at most, and the performance has a bottleneck. The switching service will be interrupted. In order to reduce the performance impact area and the explosion radius, only the database and table can be divided. If an open source ecological database such as MySQL can also have the ability to read more and write more, and even have a mechanism similar to RAC, is it possible to no longer need to divide databases and tables?

 

Huawei's OceanData distributed database storage solution starts from this breakthrough to solve the problem. On the basis of shared storage, Huawei has launched a multi-write enable interface, which includes two engines—"Shentian" multi-write enable engine and "DBStor" shared acceleration engine—to realize multi-read and multi-write in open source ecosystems such as MySQL.

The "Santa" multi-write enable engine further implements ShareMemory on the basis of ShareStorage. It changes the traditional logic of distributed transformation in the persistence layer to distributed transformation in the cache layer; through the high-performance data processing of the cache layer and the high-performance communication capability of RDMA, the performance and reliability brought by the data synchronization of the persistence layer are compensated question. The "Shentian" multi-write enabling engine mainly completes four major functions:

  • Global resource management: Complete the registration of global node resources, the transmission of page and lock resources, etc., and perform unified management and allocation of resources scattered in the caches of each node, providing execution-level guarantees for global concurrent reading and writing;
  • Global lock management: complete the management logic of requesting and releasing lock resources between different nodes, support spinlock, distributed latch, deadlock detection, etc., and it is the main guarantor of transaction consistency during global concurrent reading and writing;
  • Global page management: complete the management logic of loading, requesting, and releasing page resources between different nodes, which provides a logical level guarantee for the consistency of the global resource view to support concurrent reading and writing and distributed MVCC;
  • Global cluster management: complete the registration of node information on the shared storage, manage the process status and network status of each node, and communicate status information with other nodes and shared storage; when a single node fails, it will trigger arbitration, master election and other failures The recovery process, together with the shared storage, ensures that the entire cluster can provide services to the outside world.

On the basis of global data sharing, the "Shentian" multi-write enable engine solves the unification of data on the cache layer, so that the master-slave nodes in the past can maintain real-time strong consistency, evolve into a multi-master architecture, and then achieve concurrency Reading and writing greatly improves the performance, throughput and reliability of the database cluster.

The "DBStor" shared acceleration engine is mainly based on ShareStorage to enhance the ability to read and write. Its main functions include:

  • Database IO protocol stack simplification: reduce the frequent context switching of database IO between databases, file systems, and block devices on the host side, thereby reducing IO pass-through delay;
  • Pushdown of database operators: Perceive database semantics, push down some SQL operations such as large table scanning to storage, and provide index storage to greatly optimize SQL execution efficiency;
  • High-speed network: IO is processed through Huawei's NoF+ high-speed network, which greatly optimizes transmission delay and ensures network stability without jitter.

The "DBStor" shared acceleration engine enhances the execution efficiency of the open source database, greatly increases its performance, and greatly bridges the processing capability gap between it and Oracle.

From the past to the present, the demands of the financial core system have not changed fundamentally, so the problem-solving ideas should be similar. Huawei's OceanData distributed database storage solution adheres to this concept. With professional storage + innovation engine, it fills the gap in the lack of benchmark products for distributed databases on Oracle ASM and RAC, and transfers the distributed logic of the persistence layer to the cache layer. Solved the industry problem of "distributed databases must be divided into databases and tables". Judging from the actual effect of the current network, the performance of large tables has basically matched Oracle; the node failure does not interrupt the business, and the reliability has also been on par with Oracle. Since then, the transformation of the distributed database of the core system has finally come with a smooth transformation plan that does not require sub-database sub-table.

In the general trend of demand and policy, the database transformation of the financial industry is inevitable. However, no matter which transformation method is used, high availability is the constant core demand of the financial industry. Over the past few decades, through the good cooperation of traditional databases and highly reliable professional storage, the financial industry has gradually formed a reliable and highly available system construction system; now, in the new ecology of distributed databases, we can see that To the separation of storage and calculation + professional storage system, it is still a solid guardian of high reliability of the database.

Guess you like

Origin blog.csdn.net/dobigdata/article/details/130077341