Cloud Database Technology Salon|Interpretation of Data Replication Technology under Multi-Cloud and Multi-Source-NineData

Abstract: With the advent of the era of data intelligence, data management under the multi-cloud and multi-source architecture is an essential infrastructure for enterprises. We believe that data access, data integration and distribution, data security and data quality are the foundation, and it is also the way to multi-cloud and multi-source The starting point of the architecture. This topic introduces cloud-native multi-cloud and multi-source data management NineData, focusing on data management and replication technologies related to MySQL and ClickHouse. 

The MySQL x ClickHouse special session of the first cloud database technology salon in 2023 was successfully held in Hangzhou Haizhi Center. This salon is co-sponsored by NineData, Caigen Development, and Liangcang Taiyan. This time, Chen Changcheng (Tianyu), vice president of Jiuzhang Arithmetic Technology, shared with you the technical content of "Revealing Data Replication Technology under Multi-Cloud and Multi-Source - NineData".

The content of this article is organized based on speech recordings and PPT.

Chen Changcheng (Tianyu), Vice President of Jiuzhang Arithmetic Technology, former senior technical expert of Alibaba Cloud, has been working in the database field for 15 years, and has led the evolution of Alibaba's database infrastructure (IOE to distributed, remote multi-active, containerized separation of storage and storage) And cloud native database tool system construction.

My next sharing topic: Uncovering the Secret of Data Replication Technology under Multi-Cloud and Multi-Source-NineData , mainly involves product and technology interpretation. For everyone, it will be easier to understand.

Hello everyone. I am Chen Changcheng , from Jiuzhang Arithmetic Team. Our team's mission is to empower everyone with data and cloud. Many of our team are senior technical experts from Alibaba, Huawei and IBM. Our CEO Ye Zhengsheng once served as the general manager of the product management and solution division of the Alibaba Cloud database division. I personally worked in Ali for more than ten years, and was mainly responsible for the evolution of the PASS layer of the entire infrastructure and the construction of the cloud-native database tool system.

Next, let me introduce the sharing agenda for this session . The first is to discuss the opportunities and challenges encountered in the digital transformation of enterprises in the era of multi-cloud and data. Then, I will introduce NineData's product platform, focus on NineData's data replication technology, and share two typical customer cases.

First, let's take a look at Gartner's report. The report shows that more than 80% of enterprises will choose multi-cloud or hybrid cloud. According to the Percona report, more than 70% of enterprises will choose to use multiple databases to deal with multi-source data. In the industry analysis report, we found that if enterprises can effectively use multi-source infrastructure and new data architecture, their innovation ability and overall profitability will be significantly improved. However, in this new multi-cloud and data era, enterprises still face some problems, such as data silos, multi-source data management complexity, and development efficiency, etc., which need to be resolved urgently.

If you look at the entire data governance and data management of an enterprise today, there are many issues involved in all aspects. From the perspective of NineData, we should think about how to solve these problems.

This is how we think about it. In the entire process of enterprise data management, we use built-in security capabilities to improve the development efficiency and security of enterprises. This includes implementing security throughout the complete lifecycle of design, development, change, and release of enterprise databases. At the same time, use multi-source data replication technology to promote data flow ; use data backup technology and online query technology to protect enterprise data assets and improve the data value of cold data; use innovative technology methods to protect data assets and increase data value. . In addition, we can improve the data quality of enterprises through technical means such as structure comparison, data comparison and multi-environment data comparison .

My second topic below discusses the main architecture of the entire NineData data management platform . The bottom layer is an infrastructure design based on multi-cloud and multi-source, and our core four functional modules, such as backup, replication, comparison and SQL development, have been introduced before. Next, I will introduce the entire platform from the perspectives of multi-source and multi-cloud.

From the perspective of multi-cloud architecture , the main purpose of NineData is to help enterprises manage various data sources dispersed in multi-cloud or hybrid cloud. To this end, we have designed flexible cloud-native architecture, elastic architecture, network structure, etc., so that enterprises can manage data sources scattered among various cloud vendors and offline. At the same time, we can allow enterprises to enjoy their own resources exclusively through the technology of exclusive clusters. Including that we can place the worker nodes of the enterprise locally or inside the VPC to realize the internal closed loop of data and improve the security of enterprise data and the efficiency of worker execution.

We can see that NineData, as a cloud-native SAAS product , has the most basic capabilities of on-demand deployment and elastic scaling.

For the enterprise network , because many customers are unwilling or do not want to expose the public network port of the database, especially for key data. Therefore, we designed a database gateway. Through this design, users only need to install our database gateway to connect to our central management node, so as to establish a reverse access channel, which can be scattered in various places, including internal Unified management of data sources . In addition, our NineData worker can also be placed in the user's local area to realize the internal closed loop of the data link, and the management link can manage instance-level tasks through the central console, and the entire data channel is inside the user's local area to achieve unified management .

In terms of multiple sources, we mainly designed a unified access layer for data . In order to access many data sources, we abstracted the entire data source, and made a unified abstraction including connection management for its attribute configuration, connection check and security authentication. In this way, all data sources can be connected uniformly, and based on the above four main functional modules, the same data source is used for access, so that all functions can be used in one access. For users, after registering all data sources, the most important thing is to be able to achieve unified management.

Therefore, I will introduce the related content of  SQL development next. NineData hopes to be able to use our products to manage and query enterprise online data, and to conduct unified management of the entire life cycle from data development to change release . Therefore, NineData Personal Edition is more like MySQL Workbench, or a client tool such as Navicat, but our function is a SAAS version. In the future, we will also launch a client version. For the enterprise version, we hope to improve enterprise efficiency through a visual interface for database design, development, and change release, while building security and authority capabilities throughout the production process.

How to improve the efficiency of individuals and teams? In fact, there are multiple environments within an enterprise that want to be able to do things like structure synchronization. Different levels of requirements may exist between different environments. For example, with an agile business, you might want to reduce administration so you can innovate faster. For some core businesses, you will pay more attention to stability and authority management.

In this regard, NineData has formulated many SQL development specifications, covering various situations, such as how to judge whether large-scale data changes will affect the stability of the production database, and to isolate and block; for table structure changes, how to judge whether Online DDL A large load, etc. will be added. In addition, we support users to define security execution policies, and provide more than 90 development specification templates, which support the use of templates by users or custom methods.

NineData SQL development has been committed to improving user efficiency . At present, intelligence is such as ChatGPT, which is very popular in the industry. We are also based on this large-scale model technology to strengthen NineData's entire product capabilities and improve user-oriented efficiency. It is mainly reflected in the fact that users only need to quickly query the required data through real-time dialogue based on one data source . For example, users can easily find employees whose salary exceeds a certain amount. At the same time, for database students, NineData also supports the optimization of SQL statements and table structures. Everyone is welcome to experience it, and it will still be very pleasantly surprised.

Next, I will focus on some technologies of NineData in data replication. In terms of multi-source multi-cloud data replication , the main challenge is that there are many types of databases, and the data types and data structures behind each database are also designed independently. Therefore, how to realize the linkage between data so that it can flow freely is a big challenge. At the same time, many cloud vendors provide rich data sources, but there are subtle or large differences among them. In the process of customer business development, the data center and international business of many enterprises may need to involve data topology across long distances.

As shown in the figure below, it is a typical data replication link topology. After you configure the source and target, NineData will make the entire link start running. There will be a pre-check at the beginning to check whether your network connection, account password, etc. are correct. Next, the migration structure will be carried out, and the full amount of data and incremental data will be fetched and written. Our product design concept is to hope that it can support multi-source and multiple data source access, and has good scalability . It enables easy access when new data sources come in. We've just covered a lot of different data sources, with some minor differences between them. Therefore, in order to ensure the long-term stable operation of the entire synchronization link, we also focus on building its observability and intervention .

In the entire transmission kernel module, the most basic ability is to ensure consistency , which is also our bottom line. In addition, high throughput and low latency capabilities are also required. Next, the following sharing will also be carried out around these points.

Regarding the architecture of the entire module of NineData , we have just given a brief introduction. Compared with other products in the industry, we will invest heavily in performance collection, exception handling, and data consistency.

First, let's look at throughput capabilities . Taking full performance as an example, suppose we have some data to be processed on the source side, and there are many tables in it, and their data volumes are different. If we start three concurrent threads for processing at the same time, some tables with a small amount of data may have been processed, but some tables with a large amount of data are still waiting for a single thread to process. If the table level is concurrent, there will be a similar problem. Therefore, in order to improve the overall efficiency, we must enhance the concurrency capability in the table. Specifically, we need to consider whether the table is evenly sliced. To this end, we have formulated a strategy, that is, the default component supports one-key splitting, and supports splitting in the order of primary key, non-null unique key, nullable unique key, common key, etc., in order to achieve the most balanced way possible Concurrent processing.

Then, when the crawling performance of the source can be good, and after it can be expanded linearly, the throughput bottleneck may not be on the channel, but on the writing of the target library. Therefore, the writing posture in the target library is very important. If every SQL needs to be parsed on the target side, the performance will definitely be poor. Therefore, we need to adopt some way of batch submission. At the same time, when dealing with the compression switch, you need to pay attention to the number of CPUs. On low CPU counts, enabling compression can have a large performance impact. In our tests, if the number of CPUs may exceed four cores, the impact of not enabling compression will be relatively small.

During use, you may encounter some problems. For example, in normal concurrent writing, if you write 100G of data on the source side, it may become 150G on the target side. This is because if a single table is submitted out of order, some data holes may be generated. To this end, NineData has made some optimizations in terms of slice size and concurrent order.

Because the entire full copy is single-ended and flows quickly to the target, this data is eliminated. Therefore, to do some targeted optimization on the parameters of the entire JVM is also based on the fact that the entire full data replication model itself has this feature.

The second part is low latency , so how to build low latency? We consider it from two perspectives. One is the performance perspective of the entire channel, including some such as batch and hot data merging . For the merging of hotspot data, if a record is changed from A1 to A2 and then to A3, the general synchronization model is full-track modification, but after the hotspot capability is enabled, it may directly map to A3 and will not insert A1 or update A2. Through this This capability directly writes the data in the final state, and merges the queue directly in the memory. At the performance level, there are also some technical applications. For example, we can reduce the serialization cost of the queue in the replication link of redis , so that the consumption of the entire queue can be minimized.

At the entire management level, it is also very important for the entire low-latency system , which is the experience gained from our many years of practice. To give two examples, one is that the database is delayed, but the logs on the data server have been cleared; as our cloud-native products , what should we do? We will pull the user interface to check whether there are logs uploaded to OSS or other object storage . If there is, we will automatically obtain and continue the previous record, so as to avoid re-pulling the full amount and reduce delay.

In addition, we will also document the security position . Assume that there are 16 concurrent writes to the target library, and each concurrently submits different library tables and records. When recording points, we should record the slowest of these 16 to avoid data loss. That means the other 15 run faster concurrently. If an abnormal downtime occurs and restarts, there will be such a problem: several threads running quickly will replay the data they have run. It causes the user to feel that this is data rollback.

So we've now done two optimizations. The first is the table-level site, each table will have its own latest site. If the position of this table has been used during playback, we will discard it to avoid the position rollback. The second optimization is for normal operation and maintenance operations. It will submit all the data in the queue to complete, so that 16 threads reach a consistent position, and then close the process. Through this capability, we have achieved a clean cleandown, and users will not encounter the need to replay data after restarting, which is one of the most elegant ways.

It is more important to ensure consistency. There are two key points to consider. The first is the consistency of the data itself. For example, suppose we have five transactions from T1 to T5, where B1 is the order status, the order is created from B1, and the user payment may be made to B3, and a logistics order L will be generated at this time. If we use the normal row-level synchronization method, orders and logistics orders will be stored in different tables, that is, in different queues. But due to row-level concurrency, their order cannot be guaranteed. Therefore, B1 and L may appear in the target library at the same time, that is to say, when the order is created, the logistics order has also been created. For online businesses, this is definitely against business logic and cannot support the normal operation of online businesses.

Therefore, we have built a transaction consistency capability, and users can enable transaction capabilities . When the user enables the transaction capability, we will check whether each record in the T3 transaction has a dependency relationship with all previous transactions. If it exists, T3 will wait for other transactions to be submitted before submitting to ensure data consistency. Therefore, the first submission will only submit T1 to T4, and T3 will wait for T2 to submit before submitting. This is a synchronization mechanism that ensures data consistency.

The second problem involves the consistency of the reader parsing end. Specifically, taking the table structure as an example, if we encounter a change in the table structure, the general solution is to check the table structure at the source. However, since most of the data logs only have data and table names, and lack information such as structure and type, we need to check the data source to obtain structural information and splice out the final result. However, it is very likely that a second DDL has already occurred at the source during the review, and what we get is the DDL that has been modified again, resulting in inconsistencies and errors in the spliced ​​data.

Therefore, we have developed a DDL parsing capability, that is, after the DDL parsing is completed, replaying is performed directly in the synchronous thread parsing thread . At the same time, we record the version of each change, and generate a new version while replaying, and the old version is not deleted. In this way, any table can check its Meta structure at any time at any time, without the need to replay it from the beginning like other industry practices. This method is very difficult for problem diagnosis, and the latest or current Meta cannot be obtained. status.

In terms of data comparison, NineData presents it as an important product capability, because we believe that the impact of data comparison on the overall data quality is very important. Therefore, we have done a very comprehensive function in terms of structure comparison, data comparison and revised SQL generation. Second, we will consider the load brought by the data comparison between the user's source and target databases. These loads are very important to many production personnel. Therefore, we have formulated many strategies, such as: only rechecking inconsistent data, controlling concurrency and current limiting, setting sampling ratio and conditional filtering, comparing only data within a certain range, and so on. At the same time, in terms of performance, we have also done some work. Because pulling out all the data from the source and the target will consume a lot of computing resources and bandwidth, so we need to do a pushdown that is more elegantly computed.

In terms of scalability, how to support fast new data sources in NineData? It means that we need to quickly support the conversion of structures and data types, and quickly productize channels. These are our current important considerations. Our whole design idea is to say that the topology method of N times M from various original sources to targets can be realized by adding N to M.

Let's talk about the data type first, because the data type may be more concerned about the final consistency, and the industry has defined many intermediate type. Today, NineData also defines some intermediate types, because the better the abstraction of intermediate types, the fewer their types, which means that we need less work to develop convert for new data sources. Therefore, how to better abstract into fewer sample sets is an overall better abstraction method.

The second plug-in thing is that we provide a framework called relational data submission. To achieve sufficient plug-in means that we need to provide some frameworks in various places at the bottom layer to improve the overall reusability. For example, when all data comes into our writer, both DDL and DML need to wait when submitting. Make sure that the table structure has been changed and supported, and then put the subsequent DML. In addition, there will be some DDL memory structures at the library level and table level for sorting lock conflicts. In fact, the table level is DML placed in the same table, which means that we can save batches to optimize SQL and reduce the number of writes. In row-level operations, if it is the same UK, then hot-spot merging can be done, as described earlier. Such framework abstraction capabilities enable the data sources accessed later to naturally possess these capabilities. At the same time, in the process of data flow, you can also build transaction-dependent topological relationships to determine whether the data can be submitted and whether the previous data has been completed.

Today is the MySQL and ClickHouse technology session, and we have also done a lot of practice in these two areas. Because MySQL and ClickHouse are quite different in terms of data types, ClickHouse also supports many engine types. Let's first look at the choice of engine. We have also investigated industry implementations including MaterializeMySQL and Airbyte. We believe that although ReplacingMergeTree can be implemented, it is also the engine used internally by MaterializeMySQL. Although MaterializeMySQL has done some encapsulation in terms of query, if there is not enough query encapsulation, the intermediate synchronization tools of different vendors have differences in version and sign equivalents. , which will eventually cause the user's query results to be different.

The sign of this CollapsingMergeTree can be -1 or 1. For your addition, deletion, modification and query operations, the incremental replication can be mapped naturally. So, for the first time, we implemented this CollapsingMergeTree, through which the data can be synchronized to the desired target. In practice, we found that some customers were still using ReplacingMergeTree, so we supported that as well, thus providing both ways.

There are relatively many data types in ClickHouse, which may be more than cloud-native types like MySQL. Therefore, there are many choices when it comes to typemapping, and there are also many defaults. If you don't specify it as "NA", some initial value may appear, for example, for "point" type, it may be "00". These behaviors may cause users to find discrepancies in the data when comparing the source and target data.

In the submission process, in terms of channel performance, similar to Airbyte’s approach, all incremental data will be combined into one file, because many additions, deletions, and changes in the ClickHouse engine are directly added, so this method is relatively simple. But this method will bring a large delay. Therefore, in the implementation process, we consider using SQL to submit, and how many items come will be transferred to batch submission immediately, which can be dynamically controlled, such as more than 1000 or 0.5 seconds, and can be submitted in hundreds of milliseconds. In addition, ClickHouse's Jdbc has poor performance when parsing each statement, so we have made some optimizations and adopted batch submission to improve performance.

Next, I will introduce the two aspects of "observability" and "intervention ability". Due to the large number of types, for example, during the synchronization process, users may encounter some problems on the target side and new writes, which may cause data conflicts between the two sides. Therefore, in terms of observability, we will not only fully disclose the basic state to the user, but also provide the statements submitted by each thread. For example, if there are 16 threads running, we will display which SQL these 16 threads are executing, or whether the task is stuck by DDL, etc. You can view what operations each thread is performing, how long it has been performed, and other information in a way similar to MySQL Processlist.

In terms of "intervention ability", we also focus on construction, because during the synchronization process, users may need to add new objects to the synchronization link, or encounter some abnormal situations and need to intervene. Or you need to rewrite and submit operations, etc. This kind of problem is often encountered.

For example, in the picture below , the fine-grained control exception is an example. If there is a statement that mentions the target and encounters a structural conflict during execution, the system will pop up a statement, and the user can directly modify the SQL statement in the dialog box and then execute it, so that the new SQL can be executed. In the process of changing the table structure, this kind of problem is often encountered, and the user can also choose to skip or ignore these problems and continue to perform the following operations.

Regarding the multi-cloud and multi-source NineData data replication function , we have done a lot of work on the product's functional completeness, structure, pre-inspection and performance to ensure its high competitiveness in the market competition. At present, the product is ready, and you can use and experience it on the cloud.

Below are two simple cases. The first case is about a large real estate company. The business has a large database, but its development process involves many partners, such as ISVs or third-party software development providers. Therefore, they need to delegate access control of data sources to these partners. In the previous manual management process, authority management has become very complicated and the process is cumbersome, making it difficult to manage in a unified manner. To this end, NineData provides a solution for unified management of data sources. Through this solution, all data sources of the enterprise are unified and managed, and the developer's account initialization, permission application, and visualization of the data development process are all optimized, thereby greatly improving development efficiency and collaboration efficiency. 

The second case is a cross-border e-commerce scenario , whose analysis and operation activities are based on ClickHouse. His MySQL production is scattered all over the world, such as Japan, South Korea, etc., and he aggregates online data from various places to the domestic ClickHouse for unified analysis and operational decision-making. In this process, he used our NineData replication product. NineData has some advantages in cross-regional replication. Our parsing module, reading module and writing module can be deployed in different places, the parsing module can be close to the user's source end, and the writing end can be close to the user's destination end, thus achieving more optimized overall performance.

OK, that's all for my sharing, okay, thank you all.

Focusing on the theme of "Technology Evolution, Making Data Smarter", this conference gathered 6 experts in the database field from ByteDance, Alibaba Cloud, Jiuzhang Math, Huawei Cloud, Tencent Cloud, and Baidu. Technology trends, combined with enterprise-level real scene landing cases, communicate and share with the majority of technology enthusiasts

Guess you like

Origin blog.csdn.net/NineData/article/details/130615015