[Database Technology] NineData data replication, accelerating real-time data warehouse construction

On August 30, the online joint press conference with the theme of "Real-time Data-driven, Leading Enterprise Intelligent Data Management" co-hosted by NineData and SelectDB was successfully held! The two parties focused on real-time data warehouse technology and data development capabilities, and demonstrated how to connect with rich big data ecological products through strong ecological development compatibility, help enterprises quickly develop data analysis business, and jointly explore real-time data-driven future enterprise intelligent data management solution.

This article is based on the content of the keynote speech delivered by Chen Changcheng (Tianyu), vice president of Jiuzhang Arithmetic Technology, at the NineData X SelectDB joint conference.

Technology sharing|NineData data replication, accelerating the construction of real-time data warehouse
Chen Changcheng (Tianyu), Vice President of Jiuzhang Arithmetic Technology

Chen Changcheng, once served as the person in charge of Alibaba Cloud Database Center and the general manager of Alibaba Cloud Database Ecological Tools Department, and a former senior technical expert of Alibaba Cloud. Leading Alibaba Cloud's database infrastructure to complete three structural changes, from IOE removal to distributed, multi-active in different places, containerization and separation of storage and computing. Lead the upgrade of cloud-native tool architecture and build one-stop management capabilities. Published many technical patents and top conference papers on VLDB and ICDE databases.

01 NineData Product Introduction

In the era of data and cloud, enterprise digitalization is facing many challenges. From the reports of Gartner and Percona, we know that more than 80% of enterprises will choose multi-cloud or hybrid cloud, and more than 70% of enterprises will choose to use multiple databases to meet business needs . In the industry analysis report, we found that if enterprises can effectively use multi-source infrastructure and new data architecture, their innovation ability and overall profitability will be significantly improved. However, in the era of data and cloud, there are more challenges in enterprise data management, such as data silos, multi-source heterogeneous data management complexity, and development efficiency, etc., which need to be resolved urgently.

Based on the above-mentioned common problems and challenges, Jiuzhang Math has built the NineData cloud-native intelligent data management platform. The bottom IaaS interface layer connects various data sources in various scenarios in a unified manner. Based on this, NineData builds four core functional modules of data backup, data replication, data comparison, and SQL development, and integrates with enterprise databases, search, message queues, and data warehouses. The system is closely linked to help enterprises protect data assets, flexibly build infrastructure based on multi-cloud and hybrid cloud, unify security management, and improve database development efficiency.

Here is an introduction to SQL development. It is a best practice productization that allows all developers inside and outside the enterprise to follow the unified data access specification and improve efficiency. Currently, enterprises face multi-cloud and multiple data sources. Although various data sources There are respective CLI or graphical management tools, but with these problems:

  1. Decentralized authority, lack of auditing, and difficult security management and control;

  2. Different tools have different levels of construction, average experience, and low development efficiency;

  3. It is impossible to form a unified standard, and the stability of database production is not guaranteed;

  4. Multiple environments and multiple data sources cannot be managed in a unified manner.

In response to these problems, NineData has designed enterprise-level database security management capabilities, connecting various cloud vendors and self-built various data sources through a unified data source, designing task flow and approval flow, security rule configuration, authority management and operation audit, SSO support It manages the enterprise data (instance, library, table) + account role + operation type through the rule engine in a unified manner, and has built-in database SQL development stability and security best practices, providing database access rights management, change management, Sensitive data management, data import and export and other functions. NineData provides two service modes: the personal version with simple GUI and the enterprise version with efficient collaboration, combined with the ability of large-scale model AIGC, to improve developer efficiency in terms of natural voice query data, table structure design rewriting, and SQL optimization suggestions.

In the data replication scenario, enterprises also face multiple data sources, multi-cloud data connection, cross-regional long-term replication, and the resulting problems in terms of performance and stability. NineData data replication is committed to providing the infrastructure for data flow. Eliminate data flow difficulties caused by different database types, different vendors, and different environments, and help enterprises maximize the value of data. At present, NineData supports 13 kinds of data sources with one-way and two-way links, strong replication performance and perfect data comparison function, which will be expanded later.

02 Data replication technology architecture

First introduce the overall architecture of NineData. Based on multi-cloud and multi-source capabilities, we have built data backup, data replication, data comparison and SQL development capabilities.

2.1 Cloudy angle

From the perspective of multi-cloud, in order to help enterprises manage various data sources dispersed in multi-cloud or hybrid cloud in a unified manner, we have designed a flexible cloud-native architecture, containerized elastic pull-up, network architecture, etc.

Support for dedicated clusters

While supporting multi-cloud, we can allow enterprises to exclusively enjoy their own resources through the technology of dedicated clusters. Including that we can place the worker nodes of the enterprise locally or inside the VPC to realize the internal closed loop of data and improve the security of enterprise data and the efficiency of worker execution.

Cloud native SAAS mode

As a cloud-native SAAS product, NineData has the most basic capabilities of on-demand deployment and elastic scaling.

cyber security

In terms of network, based on security considerations, many enterprise customers do not want to expose the public network port of the database. Therefore, we designed a database gateway. Through this design, users only need to pull up a NineData database gateway to connect to our central management Nodes, so as to establish a reverse access channel, can manage data sources scattered all over the place, including internal ones. In addition, our NineData worker can also be placed locally to realize the internal closed loop of the data link, while the management link can still realize unified link management through the central console.

2.2 Multi-source aspects

In terms of multiple sources, we mainly designed a unified data source access layer. In order to access many data sources, we have made a unified abstraction of the connection pool management, property configuration, connection check and security authentication of the data source. In this way, all data sources can be connected uniformly. Our four major functional modules all use the same data source access layer, enabling all functions to be available for one access. For users, the real unified management is realized.

In NineData's product design, security is not a single task or function, but instilled from beginning to end. In the whole process of product design, development, operation and maintenance, we have done a lot of work in data transmission encryption, operation and maintenance white screen, and operation auditing , and at the same time, NineData protects data security in multiple aspects through internal testing and three-party auditing.

A typical NineData data replication link topology, after you configure the source and target, NineData will start the entire link to run. There will be a pre-check at the beginning to check whether your network connection, account password, etc. are correct. Next, the structure will be copied, and the full and incremental data will be fetched and written.

From the product level, we need to support multi-cloud and multi-source. In addition to the elastic architecture and network architecture introduced above, we have made important designs for the compatibility and scalability of multiple data types of the replication module. At the same time, multi-source heterogeneous long-term operation must exist. For a small number of incompatible scenarios, we also focused on functional design in terms of observability and intervention. The bottom line of the data transmission kernel module is to ensure data consistency, and at the same time have leading advantages in throughput and latency, so we have done a lot of work in this regard. The sharing of core features later will revolve around these points.

03 Core Features of Data Replication

3.1 Throughput capacity

Taking full performance as an example, there are several important optimization items:

Large table migration performance

The performance of large table migration is the most common bottleneck. Suppose we have some data to be processed on the source side. There are many tables among them, and their data volumes are all different. If we start three concurrent threads for processing at the same time, some tables with a small amount of data may have been processed, but some tables with a large amount of data are still waiting for a single thread to process. If the table level is concurrent, there will be a similar problem. Therefore, in order to improve the overall efficiency, we must enhance the concurrency capability in the table. Specifically, we need to consider whether the table is evenly sliced. To this end, we have formulated a strategy, that is, the default component supports one-key splitting, and supports splitting in the order of primary key, non-null unique key, nullable unique key, common key, etc., in order to achieve the most balanced way possible Concurrent processing.

Concurrent writing is also associated with a space problem. During use, if you write 100G of data at the source end, it may become 150G at the target end. This is because if a single table is submitted out of order, some data holes may be generated. To this end, NineData is optimized in terms of slice size and concurrent order to control write amplification.

target library write

The maximum performance can only be obtained by writing in the method with the least loss to the target library. After the channel performance is solved and it can be expanded linearly, the throughput bottleneck is not on the channel, but on the writing of the target library. Therefore, the writing posture of the target library is very important. If each SQL needs to be parsed on the target side, the performance will definitely be poor. Therefore, we need to adopt some way of batch submission. At the same time, when dealing with compression switches, you need to pay attention to the number of CPUs. On low CPU counts, enabling compression can have a large performance impact.

memory optimization

Memory optimization can improve throughput performance. Because the whole full copy is characterized by batch loading to content and fast writing to the target library, then this data will be eliminated. Therefore, some targeted configuration optimizations are made on the parameters of the entire JVM to reduce memory and CPU overhead and improve channel performance.

3.2 Low Latency

So how does NineData build low latency? We consider low-latency features from multiple dimensions.

channel performance

From the perspective of channel performance, it includes some such as batch and hot data merging. For the merging of hotspot data, if a record is changed from A1 to A2 and then to A3, the general synchronization model is full track modification, but after the hotspot capability is enabled, it may be directly merged into the final A3 insert statement, and A1 will not be inserted Or update A2, through this ability to directly write the data in the final state, and merge this queue directly in the memory. At the level of channel performance, there are some other designs. For example, in the replication link of redis, the serialization cost of the queue is reduced, so that the consumption of the entire queue is minimized.

Channel Management Design

The channel management layer is also very important in the overall system design with low latency, which is the experience gained from our many years of practice. It is necessary to be able to face various abnormalities in the synchronization link with the minimum cost.

(a) Reduce the possibility of redrawing under abnormal conditions. There is a delay in the database, but the logs on the data server have been cleared; as our cloud-native products, what should we do? We will obtain the interface of the source database and check whether there are logs uploaded to OSS or other object storage. If there is, we will automatically obtain and continue the previous record, so as to avoid re-pulling the full amount and reduce delay.

 (b) Rollback data as little as possible. We designed table-level security sites. Each table will have its own latest location. If the position of this table has been used during playback, we will discard it to avoid the position rollback.

(c) Operate cleanly. For daily operation and maintenance operations, the NineData replication thread will submit all the data in the queue, so that 16 threads reach a consistent position, and then close the process. Through this capability, we have achieved a clean cleandown, and users will not encounter the need to replay data after restarting, which is one of the most elegant ways.

3.3 Consistent Data Synchronization

The importance of data consistency is unquestionable. Here we will focus on introducing a design feature of NineData from two aspects: data consistency and DDL structure consistency. Meanwhile, NineData has realized a complete data comparison function.

data consistency

Consistency of the data itself, how to ensure transaction consistency. For example, suppose we have five transactions from T1 to T5, where B1 is the order status, the order is created from B1, and the user payment may be made to B3, and a logistics order L will be generated at this time. If we use the normal row-level synchronization method, orders and logistics orders will be stored in different tables, and their order cannot be guaranteed due to row-level concurrency. Therefore, B1 and L may appear in the target library at the same time, that is to say, when the order is created, the logistics order has also been created. For online businesses, this is definitely against business logic and cannot support the normal operation of online businesses.

Therefore, we have built a transaction consistency capability, and users can enable transaction capabilities. When the user enables the transaction capability, we will check whether each record in this transaction of T3 has a dependency relationship with all previous transactions. If it exists, T3 will wait for other transactions to be submitted before submitting to ensure data consistency. Therefore, the first submission will only submit T1 to T4, and T3 will wait for T2 to be submitted before submitting. This is a synchronization mechanism that ensures data consistency.

Consistency of DDL change synchronization

Consistency of DDL change synchronization. Specifically, take the table structure as an example. If we encounter a change in the table structure, the general solution is to check the table structure at the source. However, since most of the data logs only have data and table names, and lack information such as structure and type, we need to check the data source to obtain structural information and splice out the final result. However, it is very likely that a second DDL has already occurred at the source during the review, and what we get is the DDL that has been modified again, resulting in inconsistencies and errors in the spliced ​​data.

Therefore, we have developed a DDL parsing capability, that is, after the DDL parsing is completed, the replay is performed directly in the synchronous thread parsing thread. At the same time, we record the version of each change, and generate a new version while replaying, and the old version is not deleted. In this way, any table can check its Meta structure at any time at any time, instead of replaying it from the beginning as in other industry practices.

Data comparison

In terms of data comparison, NineData builds it as an important product capability, and we believe that the impact of data comparison on the overall data quality is very important. So we're very comprehensive in terms of structure comparison, data comparison, and revised SQL generation. Second, we will consider the load brought by the data comparison between the user's source and target databases. These loads are very important to many production personnel. Therefore, we have formulated many strategies, such as: only rechecking inconsistent data, controlling concurrency and current limiting, setting sampling ratio and conditional filtering, comparing only data within a certain range, and so on. At the same time, in terms of performance, we also make unique optimizations. Conventional data comparison will pull out all the data of the source and the target, which will consume a lot of computing resources and bandwidth. Therefore, we have done a more elegant calculation pushdown, and only return the inconsistent data to the table and compare them field by field.

3.4 Scalability, accelerating the construction of real-time data warehouse

In terms of scalability, how to support fast new data sources in NineData? It means that we need to quickly support the conversion of structures and data types, and quickly productize channels. These are our current important considerations. Our whole design idea is to say that the topology method of multiplying N by M from various original sources to targets can be realized by adding N to M.

Let's talk about the data type first, because the data type may be more concerned about the final consistency, and the industry has defined many intermediate type. Today, NineData also defines some intermediate types, because the better the abstraction of the intermediate type, the fewer its types, which means that we need less work to develop convert for new data sources. Therefore, how to better abstract into fewer sample sets is an overall better abstraction method.

The plug-in of the second capture and write module maximizes code reuse and improves product efficiency and stability. We provide a framework called Relational Data Commitment. This framework abstracts DDL/DML library, table, primary key level conflict waiting, transaction conflict waiting, hotspot merging and batch optimization SQL, so that the data source connected later can naturally have these capabilities.

At present, NineData has widely supported MySQL, PostgreSQL, SQLServer, Redis, MongoDB, Kafka, ElasticSearch, SelectDB (Doris) and other databases, and its series have been connected to the products of mainstream cloud vendors. Here we focus on the feature design of SelectDB and ClickHouse.

structure replication

NineData supports automatic synchronization of all MySQL DDLs to SelectDB, including Distribute Key adaptation and SQL rewriting, cross-database create table like SQL rewriting, etc.

data replication

We define the one-to-one mapping from the NineData intermediate type to the SelectDB data type to realize the data type and character set mapping. At the same time, for the time type, it supports cross-time zone data migration according to the server's Global TimeZone.

data processing

Supports filtering of synchronization objects during the copy process, and at the same time performs operation type filtering (such as only copying Insert but not copying Delete), filtering based on data calculation, and data type conversion.

performance optimization

In addition to the write merge supported by the replication framework, NineData data replication supports full or incremental write in Stream mode. In the test of MySQL->SelectDB (Doris), the concurrency of 30 regions in the same region on the cloud can reach 209MB/S , 88W RPS (about 250B on average for a single row).

NineData has also made careful design on the support of ClickHouse. In terms of structure mapping, users can choose CollapsingMergeTree or ReplacingMergeTree for copying, and supports the mapping of various data types in ClickHouse, including the processing of default value differences. In terms of performance, the approach similar to Airbyte will combine all incremental data into one file, because many additions, deletions and changes in the ClickHouse engine are directly added, so this method is relatively simple. But this method will bring a large delay. Therefore, in the process of implementation, we consider using SQL to submit. As many as there are, they will be transferred to batch submission immediately, which can be dynamically controlled. For example, if more than 1000 or 0.5 seconds, it can be submitted in a few hundred milliseconds. In addition, ClickHouse's Jdbc has poor performance when parsing each statement, so we have made some optimizations and adopted batch submission to improve performance.

3.5 High availability mechanism

Node disaster recovery

All components of NineData adopt a high-availability architecture to avoid single-point risks. The task nodes run on the distributed container cluster, and the disaster recovery system automatically detects abnormal tasks and abnormal nodes and automatically completes the task cross-machine drift disaster recovery.

task robustness

Through dynamic memory management, combined with technologies such as dynamic sharding, dynamic batching, streaming read and write, and elastic expansion and contraction, the link's ability to adapt to the load is improved, and task stability is effectively guaranteed.

http

All modules support regular location records, including structure replication, full replication, incremental replication, and data comparison; if any task or service node is abnormal, the task will be restarted based on the breakpoint. Through perfect retry and intervention methods, the robustness of the link is improved in scenarios such as poor network, high data load, and hardware abnormality.

3.6 Observable and Intervention Capabilities

Observability

(1) The replication system has second-level log management for each link. You can view the second-level RPS, the cumulative number of DDL/DML operations, the number of queue accumulations and other indicators, and check the status of each module of the task.

(2) View the submission thread status in real time. For example, if there are 16 threads running, we will display which SQL these 16 threads are executing respectively, or whether the task is stuck by DDL and so on. You can view what operations each thread is performing, how long it has been performed, and other information in a way similar to MySQL Processlist. During the synchronization process, the user may encounter some problems on the target side, new writes, etc., resulting in data conflicts between the two sides. Therefore, in terms of observability, we will not only fully disclose the basic state to the user, but also provide the statements submitted by each thread.

Intervention ability

(1) Modify the synchronization object. For long-term replication tasks, the user may need to add a new object to the synchronization link due to business changes. You can add it directly on the interface. The background will create the structure initialization, full amount, and increment of the new object. And merge the synchronization object into the existing link after catching up.

(2) Mature exception handling capabilities. For abnormal tasks, NineData will display specific error information, and users can correct and retry statements at the SQL level, or skip them, so that tasks can be quickly repaired in the event of a small number of target double-writing or structural inconsistencies, ensuring the integrity of the target data. timeliness and correctness.

3.6 Summary of capability data replication

The design goal of NineData data replication is the data flow scenario of Any Where, Any Data, AnyTime service customers. The current mainstream data sources are relatively complete, compatible and adapted to various complex network environments, deeply adapted to VPN, private line, bastion host, cloud vendor VPC and private network access, etc., and can serve customers through SaaS services or dedicated clusters. Ensure data security and replication stability

04 Typical User Cases

4.1 State-owned cloud customers

A large-scale state-owned cloud data replication, the customer has 30+ Regions across the country, and there is a lot of data that needs to be synchronized. At the same time, it also needs to provide data replication products for its customers. For example, customers of this cloud migrate from other cloud vendors or self-built systems. cloud. It includes many complex application scenarios, such as migration to the cloud, cross-cloud migration, cross-region migration, data disaster recovery, remote multi-active and other business scenarios. It also faces a very complex network environment, various links within/between Regions, and between other cloud vendors and customer-owned systems. The customer chose NineData after examining the solutions of mainstream cloud vendors and data replication vendors on the market.

4.2 Cross-border e-commerce companies

A cross-border e-commerce company realizes real-time data warehouse through NineData to guide operational analysis and decision-making. The customer's analytical and operational activities are based on ClickHouse. MySQL production is scattered all over the world, such as Japan, South Korea, etc., and he aggregates online data from various places to the domestic ClickHouse for unified analysis and operational decision-making. In this process, he used our NineData replication product. NineData has some advantages in cross-regional replication. Our parsing module, reading module and writing module can be deployed in different places, the parsing module can be close to the user's source end, and the writing end can be close to the user's destination end, thus achieving more optimized overall performance.

4.3 Large Real Estate Enterprises

A large real estate company uses NineData to realize unified data management. The business has a large database, but its development process involves many partners, such as ISVs or third-party software development providers. Therefore, they need to delegate access control of data sources to these partners. In the previous manual management process, authority management has become very complicated and the process is cumbersome, making it difficult to manage in a unified manner. To this end, NineData provides a solution for unified management of data sources. Through this solution, all data sources of the enterprise are unified and managed, and the developer's account initialization, permission application, and visualization of the data development process are all optimized, thereby greatly improving development efficiency and collaboration efficiency. 

Finally, NineData has established many close cooperative relationships with data sources and cloud vendors, and has obtained multiple certifications in cloud services, information security management, and quality management, and is widely used in leading companies in multiple industries. NineData is committed to providing customers with more stable and intelligent database services, helping customers to quickly build unified data management, so that everyone can make good use of data and cloud. Welcome everyone to experience it.

NineData is a new generation of cloud-native intelligent data management platform, which includes multiple functions such as data replication, SQL development, data backup, and data comparison. NineData uses leading cloud native and AIGC technologies to provide an intelligent data management platform for architecture design in the cloud and AI era. As the industry's first platform that supports seamless integration of cross-cloud and local IDC, it can help customers easily complete data migration to the cloud, real-time transmission of cross-cloud data, ETL, data backup, enterprise-level intelligent SQL development, database development specifications, production changes, and sensitive Data management and other functions, so as to make the use of customer data more secure and efficient.

Guess you like

Origin blog.csdn.net/NineData/article/details/132597836