Chen Changcheng: NineData’s technical practice for Doris real-time data warehouse integration

At the just past Doris Summit Asia 2023 in Beijing, Chen Changcheng, Vice President of Jiuzhang Arithmetic Technology, was invited to participate and gave a report on "NineData's Technical Practice for Doris Real-time Data Warehouse Integration".

Chen Changcheng, Vice President of Jiuzhang Arithmetic Technology

Challenges of multi-cloud and multi-source enterprise data management

We know from industry reports that more than 81% of enterprises use multi-cloud or hybrid cloud architecture, and more than 70% of enterprises use multiple data types. Enterprises that are skilled in the use of infrastructure and data architecture innovate much faster than their peers. . Of course, multi-cloud and multi-source also bring many challenges, leading to increased challenges such as complex infrastructure management, data silos, and reduced development efficiency.

Faced with these problems, Jiuzhang Arithmetic developed the NineData cloud-native intelligent data management platform . The bottom layer is based on unified data sources and IaaS layer abstraction, and connects various cloud vendors and multiple data sources. Based on it , data replication, data comparison, and SQL development are established. , data backup four functional modules , and form a good interaction with the enterprise's managed database PaaS, search platform, message queue and big data platform to help enterprises achieve multi-cloud and multi-source unified data management capabilities.

NineData data management platform architecture diagram

Cloud native data replication architecture

In data integration under multi-cloud and multi-source, enterprises are faced with the need for data extraction from multiple data sources and data interoperability with multi-cloud vendors. For multi-data centers and overseas enterprises, they are bound to face the challenge of long-term data synchronization across regions. NineData believes that a cloud-native data replication architecture needs to have four characteristics :

(1) Scalable (scalability, enabling quick access to multiple data sources)

(2) Resilient (adaptable to various environments, manufacturers and complex network environments)

(3) Manageable (manageability, large number of environment and link management and consistency comparison)

(4) Observable (observable and intervenable)

NineData's goal in multi-cloud is to realize the database access and management capabilities of AnyWhere, AnyNetwork, and AnyDatabase, and to help users manage data sources in various places through a unified console. NineData's workers will be deployed to the place closest to the user, so that the data link can be run locally and the task status will be reported to the central console. Workers can connect to user data sources through private network VPC or public network. For databases that are not exposed to the public network, NineData database gateway can be used to achieve local access, remote replication and management. At the same time, NineData also supports the exclusive cluster deployment needs of financial enterprise customers.

NineData cloud native data replication architecture

In terms of multiple data sources, NineData unifies the abstraction of data sources to manage database connection attributes, account passwords, connection pool management, network connection methods, etc. Once a data source is registered, you can use all the functions provided by NineData . Including SQL development, data replication, data comparison, data backup, etc.

Real-time data warehouse Doris data integration practice

NineData's real-time data warehouse Doris data integration practice focuses on the following aspects: consistency, high throughput, low latency, observability and intervention. NineData currently supports more than 60 data sources. In the real-time data integration of the data warehouse, we will consider the following aspects:

(1) DDL self-adaptation , realizing structure migration initialization and subsequent automatic synchronization of new incremental DDL. For Doris, its structure is highly compatible with MySQL, mainly for distribution key adaptation and cross-database CREATE TABLE LIKE compatibility. NineData will automatically fill in the distribution key in the order of primary key and unique key, and also allows users to drop down and specify , to achieve a smooth experience.

(2) Data type mapping , including data type mapping (such as BIGINT UNSIGNED -> LARGEINT, etc.), character set mapping (Doris is mainly utf8), and when production libraries in multiple regions and different time zones are aggregated into Doris, time zone auto-mapping is required. adapt.

(3) Data ETL conversion . When using MySQL to synchronize to Doris, we hope that the table structure will be synchronized as it is. In fact, it is more ETL. First, ensure that the data is synchronized quickly, accurately, and stably, and then based on these original data ODS Go up and build the dimension tables, materialized views, etc. of the data warehouse. However, there will also be some data in the production library that does not need to be synchronized to the data warehouse, so it needs to be filtered out, or some simple calculations and markings must be done before synchronizing to the data warehouse. This is EtLT.

(4) Submission performance , this is a common concern for data warehouse integration and will be introduced separately later.

Here are some key points in practice:

3.1 Consistency

When doing real-time log CDC, two parts are actually required to correctly parse the data. Taking MySQL as an example, you need to get the binlog log of the database (which contains before and after data mirroring), and at the same time get the table structure at the moment when MySQL generated this log, in order to correctly spell out the response DML statement. Therefore, when DML/DDL is executed in a mixed manner, it is more difficult to correctly obtain the table structure at that moment, which is often encountered in production libraries. Therefore, NineData implemented a DDLParser to simulate MySQL's DDL execution for each DDL log in the synchronization module, update the Meta cache in the synchronization module, and implement versioned storage . In this way, the table structure metadata of each table at any time can be obtained.

NineData’s data consistency

3.2 High throughput

Full synchronization performance is an important aspect of data warehouse integration, which often involves synchronizing data from multiple data sources to one data warehouse. NineData's work includes the following three parts:

(1) During the process of fully synchronizing data to Doris, since the production inventory contains multiple tables of different sizes, if there are many small tables and 1-2 large tables, it is easy for the small tables to be synchronized. Finally, Several large tables were never completed. Therefore, we need to perform concurrent slicing of a single table and ensure that the slicing is even enough to maximize the use of concurrency and allow everyone to complete it at the same time. NineData will slice according to the order of the table's primary key, unique key, non-empty index, etc., and at the same time achieve breakpoint resume transmission at the slice granularity.

(2) In terms of memory, optimize JVM memory for transient data scenarios such as full synchronization.

(3) Perform batch submission and merge, and perform full and incremental writes of the Stream model based on the characteristics of Doris. In actual measurements, 30 concurrency can reach a performance of 209MB/S and 88W RPS.

3.3 Low latency

NineData is built from multiple dimensions to achieve low latency capabilities. Including hotspot update data merging in the link, table-level security points to reduce data rollback, automatic pullback of cloud native RDS backup logs, graceful exit (clean shutdown) during active operation and maintenance, etc., to protect running links. Minimally affected by delays caused by various circumstances.

3.4 Link built-in ETL capability

Including object name mapping (library, table, and column names all support name mapping), data filtering (for example, it supports configuring SQL Expression and using functions to calculate and filter data. Example: gmt_create>='2019-09-09 11:11:11) , Operation type filtering (such as supporting the operation types that need to be copied by configuring incremental copy, and fine-grained control of copy operations, example: only copy Insert/Delete/Update/Create Table/Alter Table, other operations are not required).

3.5 Scalability

There are many data sources involved in data warehouse integration. In order to facilitate the expansion of more data sources, we have abstracted intermediate data types for structure conversion and data conversion, which can achieve rapid conversion of heterogeneous multi-sources and abstract the replication framework. , plug-in development based on the replication framework can achieve rapid access to new data sources.

NineData data management platform architecture diagram

3.6 Observable and intervenable

(1) Data comparison, as a key construction function of NineData, has a good ability to observe the data consistency of the integrated replication of the data warehouse. NineData's full comparison will push down calculations to reduce database network consumption and improve performance, and supports current limiting to protect production libraries. Quickly compare the number of rows, MAX, MIN, and AVG values ​​of the data to more accurately determine data consistency.

NineData data comparison function

In the presentation of results, NineData will color-mark the inconsistent parts of each field in each row and generate revised SQL.

NineData supports intelligent verification of data and generation of revised SQL

(2) In addition to traditional monitoring and alarming, NineData has two distinctive functions. One is that when the synchronization module is running, you can check what SQL each thread is currently submitting. For example, if the DDL is particularly slow, it has already been executed. How much time. The other is for each command being replicated. If an error is thrown, the customer is allowed to modify and retry at the SQL statement level, or skip, quickly intervene and restore the link.

NineData’s observable and intervenable capabilities

Typical business scenarios and technology outlook

We believe that in the context of multi-cloud and multi-source, with the rapid development of cloud-native data warehouses, only cloud-native real-time data integration can adapt to the requirements of the times, and can quickly provide rapid aggregation of various vendors and various data types, with the capabilities On-demand, out-of-the-box features.

Real-time data integration technology trends

Metadata-Driven can uniformly manage multiple data sources of an enterprise and build a unified metadata and data directory. It is also particularly important for enterprises to have a complete perspective of data production materials. NineData's SQL development can help enterprises supplement these capabilities. At the same time, the traditional method of first building a data center to centralize big data and then considering the output effects cannot meet the needs of enterprises. Purpose-Driven is more popular among enterprise users, allowing users to have clear target effects for their data warehouse integration investments. estimate. By building federated queries or logical views in advance, you can preview the report effects after data warehouse integration, and evaluate related links and storage costs before investing. Moreover, the real-time data integration platform should provide self-service services for users to try and make decisions.

With the current development of AIGC capabilities, we believe that large models have good application prospects in helping enterprises carry out intelligent assistance in data management.

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/ninedata/blog/10143753